Friday, November 23, 2012

Hadoop

Hadoop is an open-source project overseen by the Apache Software Foundation.
Originally based on papers published by Google in 2003 and 2004.

Hadoop consists of two core components
  • The Hadoop Distributed File System (HDFS)
  • MapReduce
Hadoop Ecosystem
  • Pig, Hive, HBase, Flume, Oozie, Sqoop, etc
Distributed systems evolved to allow developers to use multiple machines for a single job
  • MPI
  • PVM
  • Condor
Hadoop Support's
  • Partial Failure
  • Data Recoverability
  • Component Recovery
  • Consistency
  • Scalability

Hadoop’s History


Hadoop was created by Doug Cutting and Michael J. Cafarella. Doug, who was working at Yahoo at the time,named it after his son's toy elephant.It was originally developed to support distribution for the Nutch search engine project.

Hadoop is based on work done by Google in the late 1990s/early 2000s
– Specifically, on papers describing the Google File System (GFS) published in 2003, and MapReduce published in 2004.

Core Hadoop Concepts

 
Applications are written in high-level code  
  • Developers need not worry about network programming, temporal dependencies or low-level infrastructure.
 
Nodes talk to each other as little as possible  
  • Developers should not write code which communicates between nodes.
  • ‘Shared nothing’ architecture.
 
Data is spread among machines in advance  
  • Computation happens where the data is stored, wherever possible.
  •  Data is replicated multiple times on the system for increased availability and reliability.

Thursday, November 22, 2012

Big Data

Big Data is about the growing challenge that organizations face as they deal with large and fast-growing sources of data or information that also present a complex
range of analysis and use problems. These can include:
  • Having a computing infrastructure that can ingest, validate, and analyze high volumes (size and/or rate) of data
  • Assessing mixed data (structured and unstructured) from multiple sources
  • Dealing with unpredictable content with no apparent schema or structure
  • Enabling real-time or near-real-time collection, analysis, and answers
Big Data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of
data, by enabling high-velocity capture, discovery, and/or analysis.
The data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and
cell phone GPS signals to name a few. This data is big data.

Three V's of BigData
  • Volume describes the amount of data generated by organizations or individuals.
  • Variety describes structured and unstructured data ,such as text, sensor data, audio, video, click streams, log files and more.
  • Velocity describes the frequency  at which data is generated, captured and shared.

Big Data-Hadoop

Hadoop is the Apache open-source software framework for working with Big Data.It was derived from Google technology and put to practice by Yahoo and others. But, Big Data is too varied and complex for a one-size-fits-all solution. While Hadoop has surely captured the greatest name recognition, it is just one of three classes of technologies well suited to storing and managing Big Data. Hadoop is a software framework,which means it includes a number of components that were specifically designed to solve large-scale distributed data storage,analysis and retrieval tasks.

Need of Big Data

  • Google processes 20 PB a day.
  • Wayback Machine has 3 PB + 100 TB/month.
  • Facebook has 2.5 PB of user data + 15 TB/day.
  • eBay has 6.5 PB of user data + 50 TB/day.