Tuesday, February 19, 2013

AmitsTechnical Blog: Understand HBase data replication

AmitsTechnical Blog: Understand HBase data replication: I often saw that people were confused about the data replication on HBase with respect to hadoop. In todays post I will try to provide ins...

Thursday, February 14, 2013

Understand HBase data replication



I often saw that people were confused about the data replication on HBase with respect to hadoop. In todays post I will try to provide insides on that and explain why hadoop fsck command reports under replicated blocks.


HBase is no-sql data base which is under-lyingly used HDFS, as the underlying file system is HDFS, HBase leverages the benefits provided by the HDFS.

HBase and HDFS replication

Default replication factor of HDFS is 3 hence if you create a HBase table and put some data on it, the data written on the HDFS and HDFS created three copies of that data.

What's the problem then ? (Under replicated blocks)

Suppose replication factor of your Hadoop cluster is set to two and you build HBase cluster on top of it.  The default replication factor that HBase uses is three, hence hadoop's fsck command will report under replicated blocks.

To avoid this problem and bind the Hbase's replication factor with Hadoop's replication factor you can use one of the following available options

1. Create symlink of hdfs-site.xml at <Hbase_Home>/conf dir
2. Define dfs.replication property in hbase-site.xml
3. Copy the hdfs-site.xml at  <Hbase_Home>/conf dir

Difference between replication factor and replication scope.

If you run describe command on hbase table it will give you output something like below

hbase(main):005:0> describe 'test2'
DESCRIPTION                                                                                       ENABLED                                            
 {NAME => 'test2', FAMILIES => [{NAME => 'cf2', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0',  true                                               
 COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY =>                                                     
 'false', BLOCKCACHE => 'true'}]}                                                                                                                    
1 row(s) in 0.0580 seconds

The "REPLICATION_SCOPE => '0', "  is totally different with replication factor.

Replication Scope represents the number of copies of a table data to another Hbase cluster (part of Hbase data backup). While replication factor represents number of copies of the table data within the same cluster (part of data availability)


Conclusion :- Hbase by default have replication factor three and it doesn't follow the HDFS's replication factor untill we instruct Hbase to do so.Hbase's replication factor is part of data availability within the same cluster while replication scope is part of Hbase's data backup within the original cluster and backup cluster.