The read/write operational flow in HDFS

To get a better understanding of HDFS, we need to understand the flow of operations for the following two scenarios:

  • A file is written to HDFS
  • A file is read from HDFS

HDFS uses a single-write, multiple-read model, where the files are written once and read several times. The data cannot be altered once written. However, data can be appended to the file by reopening it. All files in the HDFS are saved as data blocks.

Writing files in HDFS

The following sequence of steps occur when a client tries to write a file to HDFS:

  1. The client informs the namenode daemon that it wants to write a file. The namenode daemon checks to see whether the file already exists.
  2. If it exists, an appropriate message is sent back to the client. If it does not exist, the namenode daemon makes a metadata entry for the new file.
  3. The file to be written is split into data packets at the client end and a data queue is built. The packets in the queue are then streamed to the datanodes in the cluster.
  4. The list of datanodes is given by the namenode daemon, which is prepared based on the data replication factor configured. A pipeline is built to perform the writes to all datanodes provided by the namenode daemon.
  5. The first packet from the data queue is then transferred to the first datanode daemon. The block is stored on the first datanode daemon and is then copied to the next datanode daemon in the pipeline. This process goes on till the packet is written to the last datanode daemon in the pipeline.
  6. The sequence is repeated for all the packets in the data queue. For every packet written on the datanode daemon, a corresponding acknowledgement is sent back to the client.
  7. If a packet fails to write onto one of the datanodes, the datanode daemon is removed from the pipeline and the remainder of the packets is written to the good datanodes. The namenode daemon notices the under-replication of the block and arranges for another datanode daemon where the block could be replicated.
  8. After all the packets are written, the client performs a close action, indicating that the packets in the data queue have been completely transferred.
  9. The client informs the namenode daemon that the write operation is now complete.

The following diagram shows the data block replication process across the datanodes during a write operation in HDFS:

Writing files in HDFS

Reading files in HDFS

The following steps occur when a client tries to read a file in HDFS:

  1. The client contacts the namenode daemon to get the location of the data blocks of the file it wants to read.
  2. The namenode daemon returns the list of addresses of the datanodes for the data blocks.
  3. For any read operation, HDFS tries to return the node with the data block that is closest to the client. Here, closest refers to network proximity between the datanode daemon and the client.
  4. Once the client has the list, it connects the closest datanode daemon and starts reading the data block using a stream.
  5. After the block is read completely, the connection to datanode is terminated and the datanode daemon that hosts the next block in the sequence is identified and the data block is streamed. This goes on until the last data block for that file is read.

The following diagram shows the read operation of a file in HDFS:

Reading files in HDFS
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.