Thursday, February 14, 2013

Understand HBase data replication



I often saw that people were confused about the data replication on HBase with respect to hadoop. In todays post I will try to provide insides on that and explain why hadoop fsck command reports under replicated blocks.


HBase is no-sql data base which is under-lyingly used HDFS, as the underlying file system is HDFS, HBase leverages the benefits provided by the HDFS.

HBase and HDFS replication

Default replication factor of HDFS is 3 hence if you create a HBase table and put some data on it, the data written on the HDFS and HDFS created three copies of that data.

What's the problem then ? (Under replicated blocks)

Suppose replication factor of your Hadoop cluster is set to two and you build HBase cluster on top of it.  The default replication factor that HBase uses is three, hence hadoop's fsck command will report under replicated blocks.

To avoid this problem and bind the Hbase's replication factor with Hadoop's replication factor you can use one of the following available options

1. Create symlink of hdfs-site.xml at <Hbase_Home>/conf dir
2. Define dfs.replication property in hbase-site.xml
3. Copy the hdfs-site.xml at  <Hbase_Home>/conf dir

Difference between replication factor and replication scope.

If you run describe command on hbase table it will give you output something like below

hbase(main):005:0> describe 'test2'
DESCRIPTION                                                                                       ENABLED                                            
 {NAME => 'test2', FAMILIES => [{NAME => 'cf2', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0',  true                                               
 COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY =>                                                     
 'false', BLOCKCACHE => 'true'}]}                                                                                                                    
1 row(s) in 0.0580 seconds

The "REPLICATION_SCOPE => '0', "  is totally different with replication factor.

Replication Scope represents the number of copies of a table data to another Hbase cluster (part of Hbase data backup). While replication factor represents number of copies of the table data within the same cluster (part of data availability)


Conclusion :- Hbase by default have replication factor three and it doesn't follow the HDFS's replication factor untill we instruct Hbase to do so.Hbase's replication factor is part of data availability within the same cluster while replication scope is part of Hbase's data backup within the original cluster and backup cluster.






3 comments:

  1. I think REPLICATION SCOPE it is not that, it is just whether replication is enabled for that particular table (http://blog.cloudera.com/blog/2012/08/hbase-replication-operational-overview/). "REPLICATION_SCOPE is a column-family level attribute and its value can be either 0 or 1. A value of 0 means replication is disabled, and 1 means replication is enabled".
    I liked the part where you explained the conflicts between Hadoop and HBase replication factor, but I have to test that myself yet, even though, I think you might be right and that could have been the source of one of the problems I was having in the past with my cluster. See my blog at algarecu.wordpress.com for a complete review on that later as well. cheers, Al.

    ReplyDelete
  2. Dude..This is an excellent post...It helped me...Thanks a lot bro..:)

    ReplyDelete