Big data: Hadoop Ecosystem

Hadoop Ecosystem Map

1.Large data on the web

2.Nutch built to crawl this web data

3.Large volumn of data had to saved- HDFS Introduced

4.How to use this data? Report

5.Map reduce framework built for coding and running analytics

6.unstructured data – Web logs, Click streams, Apache logs, Server logs – fuse,webdav, chukwa, flume and Scribe

7.sqoop and Hiho for loading data into HDFS – RDBMS data

8.High level interfaces required over low level map reduce programming– Hive,Pig,Jaql

9.BI tools with advanced UI reporting

10.Workflow tools over Map-Reduce processes and High level languages - Oozie

11.Monitor and manage hadoop, run jobs/hive, view HDFS – high level view- Hue, karmasphere, eclipse plugin, cacti, ganglia

12.Support frameworks- Avro (Serialization), Zookeeper (Coordination)

13.More High level interfaces/uses- Mahout, Elastic map Reduce

14.OLTP- also possible – Hbase

15.Lucene is a text search engine library written in Java.

HBase is the Hadoop database for random read/write access.
Hive provides data warehousing tools to extract, transform and load data, and query this data stored in Hadoop files.
Pig is a platform for analyzing large data sets. It is a high level language for expressing data analysis.
Oozie - Workflow for interdependent Hadoop jobs.The workflow has four control-flow nodes.A start control node,a map-reduce action node, a kill control node, and an end control node.
FLUME - Highly reliable, configurable streaming data collection
SQOOP -Integrate databases and data warehouses with Hadoop
HUE - User interface framework and SDK for visual Hadoop applications
Eclipse is a popular IDE donated by IBM to the open source community.
Lucene is a text search engine library written in Java.
Jaql or jackal is a query language for JavaScript open notation.
ZooKeeper - Coordination service for distributed applications
Avro is a data serialization system.
UIMA is the architecture for the development, discovery, composition and deployment for the analysis of unstructured data.

Big data