Big data: 2012

Friday, November 23, 2012

Hadoop

Hadoop is an open-source project overseen by the Apache Software Foundation.

Originally based on papers published by Google in 2003 and 2004.

Hadoop consists of two core components

The Hadoop Distributed File System (HDFS)
MapReduce

Hadoop Ecosystem

Pig, Hive, HBase, Flume, Oozie, Sqoop, etc

Distributed systems evolved to allow developers to use multiple machines for a single job

MPI
PVM
Condor

Hadoop Support's

Partial Failure
Data Recoverability
Component Recovery
Consistency
Scalability

Hadoop was created by Doug Cutting and Michael J. Cafarella. Doug, who was working at Yahoo at the time,named it after his son's toy elephant.It was originally developed to support distribution for the Nutch search engine project.

Hadoop is based on work done by Google in the late 1990s/early 2000s
– Specifically, on papers describing the Google File System (GFS) published in 2003, and MapReduce published in 2004.

Core Hadoop Concepts

Applications are written in high-level code

Developers need not worry about network programming, temporal dependencies or low-level infrastructure.

Nodes talk to each other as little as possible

Developers should not write code which communicates between nodes.
‘Shared nothing’ architecture.

Data is spread among machines in advance

Computation happens where the data is stored, wherever possible.
Data is replicated multiple times on the system for increased availability and reliability.

Thursday, November 22, 2012

Big Data

Big Data is about the growing challenge that organizations face as they deal with large and fast-growing sources of data or information that also present a complex

range of analysis and use problems. These can include:

Having a computing infrastructure that can ingest, validate, and analyze high volumes (size and/or rate) of data
Assessing mixed data (structured and unstructured) from multiple sources
Dealing with unpredictable content with no apparent schema or structure
Enabling real-time or near-real-time collection, analysis, and answers

Big Data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of

data, by enabling high-velocity capture, discovery, and/or analysis.
The data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and
cell phone GPS signals to name a few. This data is big data.

Three V's of BigData

Volume describes the amount of data generated by organizations or individuals.
Variety describes structured and unstructured data ,such as text, sensor data, audio, video, click streams, log files and more.
Velocity describes the frequency at which data is generated, captured and shared.

Big Data-Hadoop

Hadoop is the Apache open-source software framework for working with Big Data.It was derived from Google technology and put to practice by Yahoo and others. But, Big Data is too varied and complex for a one-size-fits-all solution. While Hadoop has surely captured the greatest name recognition, it is just one of three classes of technologies well suited to storing and managing Big Data. Hadoop is a software framework,which means it includes a number of components that were specifically designed to solve large-scale distributed data storage,analysis and retrieval tasks.

Need of Big Data

Google processes 20 PB a day.
Wayback Machine has 3 PB + 100 TB/month.
Facebook has 2.5 PB of user data + 15 TB/day.
eBay has 6.5 PB of user data + 50 TB/day.

Big data