HDFS I/O

An HDFS read operation from a client involves the following:

The client requests NameNode to determine where the actual data blocks are stored for a given file.
NameNode obliges by providing the block IDs and locations of the hosts (DataNode) where the data can be found.
The client contacts DataNode with the respective block IDs to fetch the data from DataNode while preserving the order of the block files.

An HDFS write operation from a client involves the following:

The client contacts NameNode to update the namespace with the filename and verify the necessary permissions.
If the file exists, then NameNode throws an error; otherwise, it returns the client FSDataOutputStream which points to the data queue.
The data queue negotiates with the NameNode to allocate new blocks on suitable DataNodes.
The data is then copied to that DataNode, and, as per the replication strategy, the data is further copied from that DataNode to the rest of the DataNodes.
It's important to note that the data is never moved through the NameNode as it would caused a performance bottleneck.

Table of Contents for HDFS I/O