Thursday, February 14, 2013

What is HDFS

HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster.
HDFS is a filesystem written in Java,based on Google’s GFS and Sits on top of a native filesystem Such as ext3, ext4 or xfs.
HDFS is a filesystem designed for storing very large files and running on clusters of commodity hardware.
Data is split into blocks and distributed across multiple nodes in the cluster
  • Each block is typically 64MB or 128MB in size
  • Each block is replicated multiple times
  • Default is to replicate each block three times
  • Replicas are stored on different nodes
Files in HDFS
Files in HDFS are ‘write once’ - No random writes to files are allowed
Files in HDFS are broken into block-sized chunks,which are stored as independent units.

When data is loaded into the system, it is split into ‘blocks’ - Typically 64MB or 128MB.
A good split size tends to be the size of an HDFS block, 64 MB by default, although this can be changed for the cluster.In file system there are three types of permission: the read permission (r), the write permission (w),and the execute permission (x).
Namenode
Without the metadata on the NameNode, there is no way to access the files in the HDFS cluster.
When a client application wants to read a file:
  • It communicates with the NameNode to determine which blocks make up the file, and which DataNodes those blocks reside on
  • It then communicates directly with the DataNodes to read the data.

No comments:

Post a Comment