Thursday, February 14, 2013

HDFS Architecture

Namenodes(Master Node) - Its keep the address of the file.
Datanode(Slave Node) -   Its keep the actual data.


HDFS has a master and slave node architecture.An HDFS cluster consists of a single Namenode,secondary Namenode and Datanodes.A master node that manages the file system namespace and regulates access to files by clients.Without NameNode, there is no way to access the files in the HDFS cluster.Slave node manages Datanode and Datanode keep the actual data.It allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of Datanodes.The Datanodes are responsible for serving read and write requests from the file system’s clients. The Datanodes also perform block creation, deletion, and replication upon instruction from the Namenode.

Data Replication
HDFS stores each file as a sequence of blocks, all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified
at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The Namenode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the Datanodes in the cluster. Receipt of a Heartbeat implies that the Datanode is functioning properly. A Blockreport contains a list of all blocks on a Datanode.

Secondary NameNode
The secondary NameNode merges the file system image and the edits log files periodically and keeps edits log size within a limit. It is usually run on a different machine than the primary NameNode since its memory requirements are on the same order as the primary NameNode. The secondary NameNode is started by bin/start-dfs.sh on the nodes specified in conf/masters file. The secondary NameNode stores the latest checkpoint in a directory which is structured the same way as the primary NameNode's directory. So that the check pointed image is always ready to be read by the primary NameNode if necessary.

5 comments: