It is the most important component of the Hadoop Ecosystem. HDFS is the primary storage system for Hadoop. Hadoop distributed file system (HDFS) is a java based file system that provides scalable, fault tolerance, reliable and cost-efficient data storage for Big data. HADOOP based applications make use of HDFS. HDFS is designed for storing very large data files, running on clusters of commodity hardware. It is fault tolerant, scalable, and extremely simple to expand.
There are two major components of Hadoop HDFS- NameNode and DataNode.
It is also known as Master node. NameNode does not store actual data or dataset. NameNode stores Metadata i.e. number of blocks, their location, on which Rack, which Datanode the data is stored and other details. It consists of files and directories.
It is also known as Slave. HDFS Datanode is responsible for storing actual data in HDFS. Datanode performs read and write operation as per the request of the clients. Replica block of Datanode consists of 2 files on the file system. The first file is for data and the second file is for recording the block’s metadata. HDFS Metadata includes checksums for data. At startup, each Datanode connects to its corresponding Namenode and does handshaking. Verification of namespace ID and software version of DataNode take place by handshaking. At the time of mismatch found, DataNode goes down automatically.
Read/write operations in HDFS operate at a block level. Data files in HDFS are broken into block-sized chunks, which are stored as independent units. Default block-size is 64 MB.
HDFS operates on a concept of data replication wherein multiple replicas of data blocks are created and are distributed on nodes throughout a cluster to enable high availability of data in the event of node failure.
Read Operation In HDFS
Data read request is served by HDFS, NameNode and DataNode. Let's call the reader as a 'client'. Below diagram depicts file read operation in Hadoop.
- Client initiates read request by calling 'open()' method of FileSystem object; it is an object of type DistributedFileSystem.
- This object connects to namenode using RPC and gets metadata information such as the locations of the blocks of the file. Please note that these addresses are of first few blocks of the file.
- In response to this metadata request, addresses of the DataNodes having copy of that block, is returned back.
Once addresses of DataNodes are received, an object of type FSDataInputStream is returned to the client. FSDataInputStream contains DFSInputStream which takes care of interactions with DataNode and NameNode. In step 4 shown in above diagram, client invokes 'read()' method which causes DFSInputStream to establish a connection with the first DataNode with the first block of file.
Data is read in the form of streams wherein client invokes 'read()' method repeatedly. This process of read() operation continues till it reaches the end of the block.
- Once end of block is reached, DFSInputStream closes the connection and moves on to locate the next DataNode for the next block
- Once client has done with the reading, it calls close() method.
Write Operation In HDFS
- The client initiates write operation by calling 'create()' method of Distributed File System object which creates a new file - Step no. 1 in the above diagram.
- DistributedFileSystem object connects to the NameNode using RPC call and initiates new file creation. However, this file creates operation does not associate any blocks with the file. It is the responsibility of NameNode to verify that the file (which is being created) does not exist already and the client has correct permissions to create a new file. If the file already exists or client does not have sufficient permission to create a new file, then IOException is thrown to the client. Otherwise, the operation succeeds and a new record for the file is created by the NameNode.
- Once the new record in NameNode is created, an object of type FSDataOutputStream is returned to the client. The client uses it to write data into the HDFS. Data write method is invoked (step 3 in the diagram).
- FSDataOutputStream contains DFSOutputStream object which looks after communication with DataNodes and NameNode. While the client continues writing data, DFSOutputStream continues creating packets with this data. These packets are enqueued into a queue which is called as DataQueue.
- There is one more component called DataStreamer which consumes this DataQueue. DataStreamer also asks NameNode for allocation of new blocks thereby picking desirable DataNodes to be used for replication.
- Now, the process of replication starts by creating a pipeline using DataNodes. In our case, we have chosen replication level of 3 and hence there are 3 DataNodes in the pipeline.
- The DataStreamer pours packets into the first DataNode in the pipeline.
- Every DataNode in a pipeline stores packet received by it and forwards the same to the second DataNode in the pipeline.
- Another queue, 'Ack Queue' is maintained by DFSOutputStream to store packets which are waiting for acknowledgment from DataNodes.
- Once acknowledgment for a packet in the queue is received from all DataNodes in the pipeline, it is removed from the 'Ack Queue'. In the event of any DataNode failure, packets from this queue are used to reinitiate the operation.
- After the client is done with the writing data, it calls close() method (Step 9 in the diagram) Call to close(), results into flushing remaining data packets to the pipeline followed by waiting for acknowledgment.
- Once a final acknowledgment is received, NameNode is contacted to tell it that the file write operation is complete.
Access HDFS using JAVA API
In this section, we try to understand Java interface used for accessing Hadoop's file system.
In order to interact with Hadoop's file system programmatically, Hadoop provides multiple JAVA classes. Package named org.apache.hadoop.fs contains classes useful in manipulation of a file in Hadoop's filesystem. These operations include, open, read, write, and close. Actually, file API for Hadoop is generic and can be extended to interact with other filesystems other than HDFS.
Reading a file from HDFS, programmatically
Object java.net.URL is used for reading the contents of a file. To begin with, we need to make Java recognize Hadoop's hdfs URL scheme. This is done by calling a setURLStreamHandlerFactory method on URL object and an instance of FsUrlStreamHandlerFactory is passed to it. This method needs to be executed only once per JVM, hence it is enclosed in a static block.