My Tech Blogs: 2011

Hbase/Hadoop is not perfect like any other system. This is especially true in this case because its a relatively newer technology and things can go completely wrong by even a small problem like accidental shutdown. Re-install is really trouble and can even worsen situation in such cases. Hadoop is relatively stable so you will rarely have problem of Hadoop by simple mistakes like accidental shutdown. But, Hbase is notorious in such things. So, this step is only for cases where you have hadoop working perfectly fine (do sudo -u hdfs hadoop dfsadmin -report) but you have screwed up hbase (doing status in hbase does not give you any thing but hangs or reports problems).

The idea is simple. Empty hbase data including system files so you have a hbase back. So, you need backup system on a regular basis for this to avoid trouble. So, the step is this
Shutdown hbase (all servers including zookeeper, master and regionservers)
1. delete /hbase directory
sudo -u hdfs hadoop fs -rmr /hbase
2. create empty /hbase directory
sudo -u hdfs hadoop fs -mkdir /hbase
3. change ownership to hbase user from default hdfs
sudo -u hdfs hadoop fs -chown hbase /hbase
4. delete zookeeper cache
sudo rm -r /var/zookeeper/version-2/
5. create cache folder back
sudo mkdir /var/zookeeper/version-2/
6. change its ownership to zookeeper
sudo chown zookeeper /var/zookeeper/version-2/

Now, start hbase. try doing status. it may not stuck this time but will give you odd results (if you are not lucky like me). So, give it a new try this time. Just stop hbase and restart it again. Now, try doing status usual. it must work. Now start your map/reduce jobs again to reload data.

This may not be best but a quick fix. I have experimented with other methods but i have screwed up things every time. But, as usual analyse problem. The hbase problem may be simple that may be fixed easily without removing all data. So, see log files first. And, backup frequently and regularly. Dont wait for problems. They are very usual in hbase. Thats why they say its NoSQL viz. you dont have stability of SQL servers. It can break very easily.

Note: We have not shutdown hadoop or made any changes to it except removing hbase folder.

We are trying various ways in which we could fit relational healthcare data in hbase's non-relational form. This is a challenging job since tables are not fixed and we are trying generic ways using configuration file to work with. So, here are various types of tables that we are working right now.

Regular tables: These are regular relational table that have direct mapping into domain. In healthcare domain, there might be many such tables like Claims, Eligibility, etc. So, they are sort of ones that one can find direct mapping into hbase too. So, we would have claim, eligibility, etc as hbase table. As NoSQL campaign says, these table are completely denormalized and hold every field that may be involved in any query used for that table that are normally put in separate table under normalized conditions.
Index Tables: These tables are for performance reasons. We use them as our optimization steps. In student-course example, this table will be like having row id with studentid+courseid in addition to having course id field in student table. The former will be used by optimizer but latter will be used by non-optimized codes using filter. Index tables are based on regular table fields as a way of indexing them.

Also, there are some variations for column name in regular table. There are ones that sound perfectly normal in relational world where you have column name that maps to domain with a value. Lets call them the regular columns. Eg. Column name in student table having value 'Jeff'. The other class of column are pretty weird and seem like job of geeks done pretty untidily. Here name of column is what would normally be 'value' like value 'Jeff' be column name. So, what does column value hold? Well that is a fair question to ask but the answer is dependent on column design. We use such column for code column family. It contains different types of codes with their attributes. A code is associated with claim line number which basically bills every health care cost in a separate line. So, they have assigned code for everything that gets billed from medical procedure to diagnosis. To keep track of what line numbers hold what codes we have each code value prefixed by certain value denoting type of code as column name. The column value is simply line numbers in a comma delimited list.

The index tables are also of different types depending on column types. They are based on columns of regular table. Regular index table are those table that have row id as regular column. In student table, if there is column with name course then its index table row id will be studentid+courseid. Such index tables are regular. Those index table that are based on column whose value occur as column name fall in separate category. For such tables, We do not create index table based on column names but rather a single table based on column family having all columns from that family in its row id. So, the row id shall be column family name + column name (basically value of column as discussed earlier).

Before we settled on the above schema, we researched on different ways of schema design. After researching, we listed four types of schemas. Each of them had their own merits and demerits. So, I thought of mentioning them too as our experimenting on hbase schemas. The problem was mapping many to many relationship as in student-course. For simplicity, consider designing student table only so that one can store all demographics information of student as well as course information of that student. The schemas are listed below

JSON string column: There will be two column family: demographics and course. The demographics family will have only single column which will store all information (name, address, DOB, etc) of that student as JSON string. The course family will have each course information as JSON string in each column having integer index as column name. Its merit includes less space. Among all other contenders, it took the least space equivalent to half of what others took normally. To get all course of a student, one has to only apply get request which returns result in matters of milliseconds. The demerits include parsing JSON string for every query that may slow down reading significantly for complex queries.
Multiple Column Family: This schema puts each field of course as separate column family so that all course information of a student is still accessed using single get request. So, if course has attribute lecturer, name and courseid then we have three family each with that name. In each family we have column maintained in strict order each corresponding to one course. So, first column in lecturer family will hold lecturer name of first course. Similarly, first column in name family hold name of first course. So, such schemas has large number of families which can have high space requirement and forms its demerits. Among all other tables, it takes lots of space as null field has to be stored to maintain order. Also, for performance reasons column family number should be as less as possible. So, this schema totally violates it. The merit includes the fastest read for different types of query among all tables as single get does all the work.
Partially Normalized: This schema is related to normalized table in some form and hence the name. Here, we have only two family: demographics and course as in first. Demographics family has all non null fields as column while course family has all non null courseid's as column. To get course information, one has to lookup into separate course table having courseid as rowid. This is what we adapted for regular table as mentioned above. Merits include using get we can get all course id for a student. But, demerit include we need to double lookup into course table if we want to query on course attributes like DOB, lecturer, etc. An efficient way for doing in such case is bulk get.
Composite Key: This schema contains row id as combination of various ids. So, row id can be studentid + courseid. So, all non null information of course can now be put into single family avoiding separate lookup as was case with 3^rd schema. For accessing all course of one student, one has to do range scan like student-*. This can be done easily using filter. If course id is known to be long or integer then start and end keys can be specified in range scan as minimum and maximum integer/long values. For storing demographics information of student, one has to have separate student table holding demographics information in single family. Our idea about index table derive from this type of schema.

My Tech Blogs

Tuesday, September 13, 2011

Simple Hbase Fixing Steps

Tuesday, June 7, 2011

Experimenting with Hbase schemas