Tuesday, June 7, 2011

Experimenting with Hbase schemas



We are trying various ways in which we could fit relational healthcare data in hbase's non-relational form. This is a challenging job since tables are not fixed and we are trying generic ways using configuration file to work with. So, here are various types of tables that we are working right now.
  1. Regular tables: These are regular relational table that have direct mapping into domain. In healthcare domain, there might be many such tables like Claims, Eligibility, etc. So, they are sort of ones that one can find direct mapping into hbase too. So, we would have claim, eligibility, etc as hbase table. As NoSQL campaign says, these table are completely denormalized and hold every field that may be involved in any query used for that table that are normally put in separate table under normalized conditions.
  2. Index Tables: These tables are for performance reasons. We use them as our optimization steps. In student-course example, this table will be like having row id with studentid+courseid in addition to having course id field in student table. The former will be used by optimizer but latter will be used by non-optimized codes using filter. Index tables are based on regular table fields as a way of indexing them.

Also, there are some variations for column name in regular table. There are ones that sound perfectly normal in relational world where you have column name that maps to domain with a value. Lets call them the regular columns. Eg. Column name in student table having value 'Jeff'. The other class of column are pretty weird and seem like job of geeks done pretty untidily. Here name of column is what would normally be 'value' like value 'Jeff' be column name. So, what does column value hold? Well that is a fair question to ask but the answer is dependent on column design. We use such column for code column family. It contains different types of codes with their attributes. A code is associated with claim line number which basically bills every health care cost in a separate line. So, they have assigned code for everything that gets billed from medical procedure to diagnosis. To keep track of what line numbers hold what codes we have each code value prefixed by certain value denoting type of code as column name. The column value is simply line numbers in a comma delimited list.

The index tables are also of different types depending on column types. They are based on columns of regular table. Regular index table are those table that have row id as regular column. In student table, if there is column with name course then its index table row id will be studentid+courseid. Such index tables are regular. Those index table that are based on column whose value occur as column name fall in separate category. For such tables, We do not create index table based on column names but rather a single table based on column family having all columns from that family in its row id. So, the row id shall be column family name + column name (basically value of column as discussed earlier).

Before we settled on the above schema, we researched on different ways of schema design. After researching, we listed four types of schemas. Each of them had their own merits and demerits. So, I thought of mentioning them too as our experimenting on hbase schemas. The problem was mapping many to many relationship as in student-course. For simplicity, consider designing student table only so that one can store all demographics information of student as well as course information of that student. The schemas are listed below
  1. JSON string column: There will be two column family: demographics and course. The demographics family will have only single column which will store all information (name, address, DOB, etc) of that student as JSON string. The course family will have each course information as JSON string in each column having integer index as column name. Its merit includes less space. Among all other contenders, it took the least space equivalent to half of what others took normally. To get all course of a student, one has to only apply get request which returns result in matters of milliseconds. The demerits include parsing JSON string for every query that may slow down reading significantly for complex queries.
  2. Multiple Column Family: This schema puts each field of course as separate column family so that all course information of a student is still accessed using single get request. So, if course has attribute lecturer, name and courseid then we have three family each with that name. In each family we have column maintained in strict order each corresponding to one course. So, first column in lecturer family will hold lecturer name of first course. Similarly, first column in name family hold name of first course. So, such schemas has large number of families which can have high space requirement and forms its demerits. Among all other tables, it takes lots of space as null field has to be stored to maintain order. Also, for performance reasons column family number should be as less as possible. So, this schema totally violates it. The merit includes the fastest read for different types of query among all tables as single get does all the work.
  3. Partially Normalized: This schema is related to normalized table in some form and hence the name. Here, we have only two family: demographics and course as in first. Demographics family has all non null fields as column while course family has all non null courseid's as column. To get course information, one has to lookup into separate course table having courseid as rowid. This is what we adapted for regular table as mentioned above. Merits include using get we can get all course id for a student. But, demerit include we need to double lookup into course table if we want to query on course attributes like DOB, lecturer, etc. An efficient way for doing in such case is bulk get.
  4. Composite Key: This schema contains row id as combination of various ids. So, row id can be studentid + courseid. So, all non null information of course can now be put into single family avoiding separate lookup as was case with 3rd schema. For accessing all course of one student, one has to do range scan like student-*. This can be done easily using filter. If course id is known to be long or integer then start and end keys can be specified in range scan as minimum and maximum integer/long values. For storing demographics information of student, one has to have separate student table holding demographics information in single family. Our idea about index table derive from this type of schema. 

5 comments:

  1. Hi Sumit,
    I am trying to model a schema for storing EDI 837 claims in HBASE.

    Can you please help me out with the data model in HBASE.
    I have created the mapping document and have the attributes that should go with Claim header and claim detail table.

    Thanks,
    Dipankar

    ReplyDelete
  2. As said in the blog above you have many ways you can do it. Claims are primary transaction data so you can keep it simple by putting all the fields in as column. You can group all procedure fields into one column family all diagnosis related fields in one column family. One update i would like to give regarding index is that dont use hbase for it. use lucene instead so you can do descending as well as ascending order sort. Good luck and keep posted.

    ReplyDelete
  3. So does this look good to you.

    Have CF :ClaimHeader Column :ClaimVer Value :4010/5010
    Column :Type Value : Professional
    etc...

    and have all Loop 2400 details (transactions) segregated into separate CF like Diagnosis ,Procedure Codes.

    and each rowkey would comprise of self generated seq_claim_id|line number|sub line code

    ReplyDelete
  4. I am beginner in HBase. Finally, I got my answer after many hours of surfing the web by reading your article. hope to see more posts on this technology.

    ReplyDelete
  5. Appreciate your work, very informative blog on HBase. I just wanted to share information about HBase Online Training. Hope it helps blog here.

    ReplyDelete