Monday, March 18, 2013

Hbase Architecture

HBase Tables and Regions

Table is made up of any number of regions.
Region is specified by its startKey and endKey.
  • Empty table: (Table, NULL, NULL)
  • Two-region table: (Table, NULL, “com.ABC.www”) and (Table, “com.ABC.www”, NULL)
Each region may live on a different node and is made up of several HDFS files and blocks, each of which is replicated by Hadoop

HBase Tables:-
  • Tables are sorted by Row in lexicographical order
  • Table schema only defines its column families
  • Each family consists of any number of columns
  • Each column consists of any number of versions
  • Columns only exist when inserted, NULLs are free
  • Columns within a family are sorted and stored together
  • Everything except table names are byte[]
  • Hbase Table format (Row, Family:Column, Timestamp) -> Value

HBase uses HDFS as its reliable storage layer.It Handles checksums, replication, failover

Hbase consists of,
  • Java API, Gateway for REST, Thrift, Avro
  • Master manages cluster
  • RegionServer manage data
  • ZooKeeper is used the “neural network” and coordinates cluster
Data is stored in memory and flushed to disk on regular intervals or based on size
  • Small flushes are merged in the background to keep number of files small
  • Reads read memory stores first and then disk based files second
  • Deletes are handled with “tombstone” markers
MemStores:-
After data is written to the WAL the RegionServer saves KeyValues in memory store
  • Flush to disk based on size, is hbase.hregion.memstore.flush.size
  • Default size is 64MB
  • Uses snapshot mechanism to write flush to disk while still serving from it and accepting new data at the same time
Compactions:-
Two types: Minor and Major Compactions
Minor Compactions
  • Combine last “few” flushes
  • Triggered by number of storage files
Major Compactions
  • Rewrite all storage files
  • Drop deleted data and those values exceeding TTL and/or number of versions
Key Cardinality:-
The best performance is gained from using row keys
  • Time range bound reads can skip store files
  • So can Bloom Filters
  • Selecting column families reduces the amount of data to be scanned    

Fold, Store, and Shift:-
All values are stored with the full coordinates,including: Row Key, Column Family, Column Qualifier, and Timestamp
  • Folds columns into “row per column”
  • NULLs are cost free as nothing is stored
  • Versions are multiple “rows” in folded table
 
DDI:-
Stands for Denormalization, Duplication and Intelligent Keys
Block Cache
Region Splits

1 comment:

  1. Big data in hadoop is the interesting topic and to get some important information.Big data hadoop online training Bangalore

    ReplyDelete