Table is made up of any number of regions.
Region is specified by its startKey and endKey.
- Empty table: (Table, NULL, NULL)
- Two-region table: (Table, NULL, “com.ABC.www”) and (Table, “com.ABC.www”, NULL)
HBase Tables:-
- Tables are sorted by Row in lexicographical order
- Table schema only defines its column families
- Each family consists of any number of columns
- Each column consists of any number of versions
- Columns only exist when inserted, NULLs are free
- Columns within a family are sorted and stored together
- Everything except table names are byte[]
- Hbase Table format (Row, Family:Column, Timestamp) -> Value
HBase uses HDFS as its reliable storage layer.It Handles checksums, replication, failover
Hbase consists of,
- Java API, Gateway for REST, Thrift, Avro
- Master manages cluster
- RegionServer manage data
- ZooKeeper is used the “neural network” and coordinates cluster
Data is stored in memory and flushed to disk on regular intervals or based on size
- Small flushes are merged in the background to keep number of files small
- Reads read memory stores first and then disk based files second
- Deletes are handled with “tombstone” markers
After data is written to the WAL the RegionServer saves KeyValues in memory store
- Flush to disk based on size, is hbase.hregion.memstore.flush.size
- Default size is 64MB
- Uses snapshot mechanism to write flush to disk while still serving from it and accepting new data at the same time
Two types: Minor and Major Compactions
Minor Compactions
- Combine last “few” flushes
- Triggered by number of storage files
Major Compactions
- Rewrite all storage files
- Drop deleted data and those values exceeding TTL and/or number of versions
Key Cardinality:-
The best performance is gained from using row keys
- Time range bound reads can skip store files
- So can Bloom Filters
- Selecting column families reduces the amount of data to be scanned
All values are stored with the full coordinates,including: Row Key, Column Family, Column Qualifier, and Timestamp
- Folds columns into “row per column”
- NULLs are cost free as nothing is stored
- Versions are multiple “rows” in folded table
DDI:-
Stands for Denormalization, Duplication and Intelligent Keys
Block Cache
Region Splits
Big data in hadoop is the interesting topic and to get some important information.Big data hadoop online training Bangalore
ReplyDelete