Wednesday, February 20, 2013

Hadoop Ecosystem

Hadoop Ecosystem Map

 

1.Large data on the web
2.Nutch built to crawl this web data
3.Large volumn of data had to saved- HDFS Introduced
4.How to use this data? Report
5.Map reduce framework built for coding and running analytics
6.unstructured data – Web logs, Click streams, Apache logs, Server logs  – fuse,webdav, chukwa, flume  and Scribe
7.sqoop and Hiho for loading data into HDFS – RDBMS data
8.High level interfaces required over low level map reduce programming– Hive,Pig,Jaql
9.BI tools with advanced UI reporting
10.Workflow tools over Map-Reduce processes and High level languages - Oozie
11.Monitor and manage hadoop, run jobs/hive, view HDFS – high level view- Hue, karmasphere, eclipse  plugin, cacti, ganglia
12.Support frameworks- Avro (Serialization), Zookeeper (Coordination)
13.More High level interfaces/uses- Mahout, Elastic map Reduce
14.OLTP- also possible – Hbase
15.Lucene is a text search engine library written in Java.

  • HBase is the Hadoop database for random read/write access.
  • Hive provides data warehousing tools to extract, transform and load data, and query this data stored in Hadoop files.
  • Pig is a platform for analyzing large data sets. It is a high level language for expressing data analysis.
  • Oozie - Workflow for interdependent Hadoop jobs.The workflow has four control-flow nodes.A start control node,a map-reduce action node, a kill control node, and an end control node.
  • FLUME - Highly reliable, configurable streaming data collection
  • SQOOP -Integrate databases and data warehouses with Hadoop
  • HUE - User interface framework and SDK for visual Hadoop applications
  • Eclipse is a popular IDE donated by IBM to the open source community.
  • Lucene is a text search engine library written in Java.
  • Jaql or jackal is a query language for JavaScript open notation.
  • ZooKeeper - Coordination service for distributed applications
  • Avro is a data serialization system.
  • UIMA is the architecture for the development, discovery, composition and deployment for the analysis of unstructured data.

1 comment: