Tuesday, February 26, 2013

What is Hive

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.



Hive is an abstraction on top of MapReduce it allows users to query data in the Hadoop cluster without knowing Java or MapReduce.It Uses the HiveQL language,Very similar to SQL.

8 Points about Hive:-
  • Hive was originally developed at Facebook
  • Provides a very SQL-like language
  • Can be used by people who know SQL
  • Enabling Hive requires almost no extra work by the system administrator
  • Hive ‘layers’ table definitions on top of data in HDFS
  • Hive tables are stored in Hive’s ‘warehouse’ directory in HDFS, By default, /user/hive/warehouse
  • Tables are stored in subdirectories of the warehouse directory
  • Actual data is stored in flat files- Control character-delimited text, or SequenceFiles
Hive is Data warehousing tool on top of Hadoop.It same as SQL
  • SQL like Queries
  • SHOW TABLES, DESCRIBE, DROPTABLE
  • CREATE TABLE, ALTER TABLE
  • SELECT, INSERT
Hive Limitations
  • Not all ‘standard’ SQL is supported
  • No support for UPDATE or DELETE
  • No support for INSERTing single rows
  • Relatively limited number of built-in functions
  • No datatypes for date or time - Use the STRING datatype instead.In new version date or time datatype will support.

Hive Architecture



Metastore: stores system catalog
Driver: manages life cycle of HiveQL query as it moves thru’ HIVE; also manages session handle and session statistics
Query compiler: Compiles HiveQL into a directed acyclic graph of map/reduce tasks
Execution engines: The component executes the tasks in proper dependency order; interacts with Hadoop
HiveServer: provides Thrift interface and JDBC/ODBC for integrating other applications.
Client components: CLI, web interface, jdbc/odbc inteface
Extensibility interface include SerDe, User Defined Functions and User Defined Aggregate Function.

Monday, February 25, 2013

Hive Commands

To launch the Hive shell, start a terminal and run $ hive
Note: example is the table name for all qurey

hive>
Hive : Creating Tables
hive> CREATE TABLE example (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
hive> DESCRIBE example;
hive> SHOW TABLES;

Hive : Loading Data Into Hive
Data is loaded into Hive with the LOAD DATA INPATH statement – Assumes that the data is already in HDFS
hive> LOAD DATA INPATH “file_txtdata.txt” INTO TABLE example;
If the data is on the local filesystem, use LOAD DATA LOCAL INPATH – Automatically loads it into HDFS
hive> LOAD DATA LOCAL INPATH "file_txtdata.txt" INTO TABLE example;
Hive : SELECT Queries
Hive supports most familiar SELECT syntax
hive> SELECT * FROM example LIMIT 10;
hive> SELECT * FROM example WHERE id > 100 ORDER BY name ASC LIMIT 10;

Joining Tables
SELECT e.name, e.dep, s.id FROM example e JOIN sample s ON (e.dep = s.dep) WHERE e.id >= 20;

Creating User-Defined Functions
INSERT OVERWRITE TABLE u_data_new
SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python weekday_mapper.py' AS (userid, movieid, rating, weekday) FROM u_data;

Join Query: sample
1.Create table
CREATE TABLE example(ID int,SUBJECT string,PRODUCT string,PERIOD int,START_TIME int,OPERATION string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','STORED AS TEXTFILE;
2.Load data (save the file in related folder)
hive> LOAD DATA LOCAL INPATH "file_txtdata.txt" INTO TABLE example;
3.Join Query
select A.*
from example A
join (
select id, max(start_time) as start_time
from example B
where start_time < 25
group by id ) MAXSP
ON A.id=MAXSP.id and A.start_time = MAXSP.start_time;

Using NOT IN / IN  hive query
SELECT * FROM example WHERE NOT array_contains(array(7,6,5,4,2,12), id)

Hive Application,Components,Model and Layout

Hive Applications
  • Log processing
  • Text mining
  • Document indexing
  • Customer-facing business intelligence (e.g., Google Analytics)
  • Predictive modeling, hypothesis testing
Hive Components
  • Shell: allows interactive queries like MySQL shell connected to database – Also supports web and JDBC clients
  • Driver: session handles, fetch, execute
  • Compiler: parse, plan, optimize
  • Execution engine: DAG of stages (M/R,HDFS, or metadata)
  • Metastore: schema, location in HDFS,SerDe
Data Model
  • Tables – Typed columns (int, float, string, date,boolean) – Also, list: map (for JSON-like data)
  • Partitions – e.g., to range-partition tables by date
  • Buckets – Hash partitions within ranges (useful for sampling, join optimization)
Metastore
  • Database: namespace containing a set of tables
  • Holds table definitions (column types,physical layout)
  • Partition data
  • Uses JPOX ORM for implementation; can be stored in Derby, MySQL, many other relational databases
Physical Layout
  • Warehouse directory in HDFS – e.g., /home/hive/warehouse
  • Tables stored in subdirectories of warehouse – Partitions, buckets form subdirectories of tables
  • Actual data stored in flat files – Control char-delimited text, or SequenceFiles – With custom SerDe, can use arbitrary format

Wednesday, February 20, 2013

Hadoop Ecosystem

Hadoop Ecosystem Map

 

1.Large data on the web
2.Nutch built to crawl this web data
3.Large volumn of data had to saved- HDFS Introduced
4.How to use this data? Report
5.Map reduce framework built for coding and running analytics
6.unstructured data – Web logs, Click streams, Apache logs, Server logs  – fuse,webdav, chukwa, flume  and Scribe
7.sqoop and Hiho for loading data into HDFS – RDBMS data
8.High level interfaces required over low level map reduce programming– Hive,Pig,Jaql
9.BI tools with advanced UI reporting
10.Workflow tools over Map-Reduce processes and High level languages - Oozie
11.Monitor and manage hadoop, run jobs/hive, view HDFS – high level view- Hue, karmasphere, eclipse  plugin, cacti, ganglia
12.Support frameworks- Avro (Serialization), Zookeeper (Coordination)
13.More High level interfaces/uses- Mahout, Elastic map Reduce
14.OLTP- also possible – Hbase
15.Lucene is a text search engine library written in Java.

  • HBase is the Hadoop database for random read/write access.
  • Hive provides data warehousing tools to extract, transform and load data, and query this data stored in Hadoop files.
  • Pig is a platform for analyzing large data sets. It is a high level language for expressing data analysis.
  • Oozie - Workflow for interdependent Hadoop jobs.The workflow has four control-flow nodes.A start control node,a map-reduce action node, a kill control node, and an end control node.
  • FLUME - Highly reliable, configurable streaming data collection
  • SQOOP -Integrate databases and data warehouses with Hadoop
  • HUE - User interface framework and SDK for visual Hadoop applications
  • Eclipse is a popular IDE donated by IBM to the open source community.
  • Lucene is a text search engine library written in Java.
  • Jaql or jackal is a query language for JavaScript open notation.
  • ZooKeeper - Coordination service for distributed applications
  • Avro is a data serialization system.
  • UIMA is the architecture for the development, discovery, composition and deployment for the analysis of unstructured data.

Tuesday, February 19, 2013

Data Disk Failure, Heartbeats and Re-Replication

Each Datanode sends a Heartbeat message to the Namenode periodically.The Namenode detects the absence of a Heartbeat message.It does not forward any new IO requests to them.Any data that was registered to a dead Datanode is not available to HDFS any more.The Namenode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a Datanode may become unavailable, a replica may become corrupted, a hard disk on a Datanode may fail, or the replication factor of a file may be increased.

Cluster Rebalancing
The HDFS architecture is compatible with data rebalancing schemes.It automatically move data from one Datanode to another to balancing the free space.

Checkpoint Node
The Checkpoint node periodically creates checkpoints of the namespace. It downloads fsimage and edits from the active NameNode, merges them locally, and uploads the new image back to the active NameNode. The Checkpoint node usually runs on a different machine than the NameNode.The Checkpoint node stores the latest checkpoint in a directory that is structured the same as the NameNode's directory.Multiple checkpoint nodes may be specified in the cluster configuration file.

Backup Node
The Backup node provides the same checkpointing functionality as the Checkpoint node, as well as maintaining an in-memory, up-to-date copy of the file system namespace that is always synchronized with the active NameNode state. Backup node also applies those edits into its own copy of the namespace in memory, thus creating a backup of the namespace.

Rebalancer
HDFS data may not always be placed uniformly across the DataNode. One common reason is addition of new DataNodes to an existing cluster. While placing new blocks (data for a file is stored as a series of blocks), NameNode considers various parameters before choosing the DataNodes to receive these blocks.

Rack Awareness
Typically large Hadoop clusters are arranged in racks and network traffic between different nodes with in the same rack is much more desirable than network traffic across the racks. In addition NameNode tries to place replicas of block on multiple racks for improved fault tolerance.

Safemode
Safemode for the NameNode is essentially a read-only mode for the HDFS cluster, where it does not allow any modifications to file system or blocks. Normally the NameNode leaves Safemode automatically after the DataNodes have reported that most file system blocks are available.

MapReduce Architecture

It consists of two phases:Map and Reduce
 
MapReduce: The Mapper
When the map function starts producing output, it is not simply written to disk. The process is more involved, and takes advantage of buffering writes in memory and doing some presorting for efficiency reasons.Each map task has a circular memory buffer that it writes the output to. The buffer is 100 MB by default, it can change based on the size.When the contents of the buffer reaches a certain threshold size 80%,a background thread will start to spill the contents to disk,during this time, the map will block until the spill is complete.
 
Before it writes to disk,the data is partitions corresponding to the reducers.Each partition,sort by key, and if there is a combiner function,it is run on the output of the sort.so there is less data to write to local disk and to transfer to the reducer.
The Mapper reads data in the form of key/value pairs and It outputs zero or more key/value pairs.
MapReduce Flow-The Mapper
Each of the portions (RecordReader, Mapper, Partitioner,Reducer, etc.)
 
MapReduce: The Reducer
The map output file is sitting on the local disk of the machine.The reduce task needs the map output for its particular partition from several map tasks across the
cluster. The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes. This is known as the copy phase of the reduce task,by default 5 threads can copy,this we can change by property.
When all the map outputs have been copied, the reduce task moves into the sort phase.For example, if there were 50 map outputs, and the merge factor was 10.then there would be 5 rounds. Each round would merge 10 files into one, so at the end there would be five intermediate files.final round that merges these five files into a single sorted file and move to the reduce phase.The output of this phase is written directly to the output filesystem, typically HDFS.
 
MapReduce Architecture
 
After the Map phase is over, all the intermediate values for a given intermediate key are combined together into a list.This list is given to a Reducer
  • There may be a single Reducer, or multiple Reducers
  • This is specified as part of the job configuration
  • All values associated with a particular intermediate key are guaranteed to go to the same Reducer
  • The intermediate keys, and their value lists, are passed to the Reducer in sorted key order
  • This step is known as the ‘shuffle and sort’
 
The Reducer outputs zero or more final key/value pairs
  • These are written to HDFS
  • In practice, the Reducer usually emits a single key/value pair for each input key
The MapReduce Flow: Reducers to Outputs.
  • Each of the portions (Reducer,RecordWriter,output file)

What is MapReduce

MapReduce published in 2004 by google.Hadoop can run MapReduce programs written in various languages like Java, Ruby, Python, and C++.It consists of two phases: Map and then Reduce and two stage known as the shuffle and sort.Map tasks work on relatively small portions of data – typically a single HDFS block.MapReduce is a method for distributing a task across multiple nodes.Features of MapReduce is to automatic parallelization and distribution.

Shuffle and Sort
MapReduce makes the guarantee that the input to every reducer is sorted by key.It is known as the shuffle.
Counters
Counters are a useful channel for gathering statistics about the job: for quality control or for application level-statistics.
Secondary Sort
The MapReduce framework sorts the records by key before they reach the reducers.

Thursday, February 14, 2013

HDFS Architecture

Namenodes(Master Node) - Its keep the address of the file.
Datanode(Slave Node) -   Its keep the actual data.


HDFS has a master and slave node architecture.An HDFS cluster consists of a single Namenode,secondary Namenode and Datanodes.A master node that manages the file system namespace and regulates access to files by clients.Without NameNode, there is no way to access the files in the HDFS cluster.Slave node manages Datanode and Datanode keep the actual data.It allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of Datanodes.The Datanodes are responsible for serving read and write requests from the file system’s clients. The Datanodes also perform block creation, deletion, and replication upon instruction from the Namenode.

Data Replication
HDFS stores each file as a sequence of blocks, all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified
at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The Namenode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the Datanodes in the cluster. Receipt of a Heartbeat implies that the Datanode is functioning properly. A Blockreport contains a list of all blocks on a Datanode.

Secondary NameNode
The secondary NameNode merges the file system image and the edits log files periodically and keeps edits log size within a limit. It is usually run on a different machine than the primary NameNode since its memory requirements are on the same order as the primary NameNode. The secondary NameNode is started by bin/start-dfs.sh on the nodes specified in conf/masters file. The secondary NameNode stores the latest checkpoint in a directory which is structured the same way as the primary NameNode's directory. So that the check pointed image is always ready to be read by the primary NameNode if necessary.

What is HDFS

HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster.
HDFS is a filesystem written in Java,based on Google’s GFS and Sits on top of a native filesystem Such as ext3, ext4 or xfs.
HDFS is a filesystem designed for storing very large files and running on clusters of commodity hardware.
Data is split into blocks and distributed across multiple nodes in the cluster
  • Each block is typically 64MB or 128MB in size
  • Each block is replicated multiple times
  • Default is to replicate each block three times
  • Replicas are stored on different nodes
Files in HDFS
Files in HDFS are ‘write once’ - No random writes to files are allowed
Files in HDFS are broken into block-sized chunks,which are stored as independent units.

When data is loaded into the system, it is split into ‘blocks’ - Typically 64MB or 128MB.
A good split size tends to be the size of an HDFS block, 64 MB by default, although this can be changed for the cluster.In file system there are three types of permission: the read permission (r), the write permission (w),and the execute permission (x).
Namenode
Without the metadata on the NameNode, there is no way to access the files in the HDFS cluster.
When a client application wants to read a file:
  • It communicates with the NameNode to determine which blocks make up the file, and which DataNodes those blocks reside on
  • It then communicates directly with the DataNodes to read the data.