HDFS Write

The client connects to the NameNode and asks the NameNode to let it write to the HDFS. The NameNode looks up information and plans the blocks, the Data Nodes to be used to store the blocks, and the replication strategy to be used. The NameNode does not handle any data and only tells the client where to write. Once the first DataNode receives the block, based on the replication strategy, the NameNode tells the first DataNode where else to replicate. So, the DataNode that is received from client sends the block over to the second DataNode (where the copy of the block is supposed to be written to) and then the second DataNode sends it to a third DataNode (if replication-factor is 3).

The following is the flow of a write request from a client. First, the client gets the locations and then writes to the first DataNode. The DataNode that receives the block replicates the block to the DataNodes that should hold the replica copy of the block. This happens for all the blocks being written to from the client. If a DataNode fails in the middle, then the block gets replicated to another DataNode as determined by the NameNode.

So far, we have seen how HDFS provides a distributed filesystem using blocks, the NameNode, and DataNodes. Once data is stored at a PB scale, it is also important to actually process the data to serve the various use cases of the business.

MapReduce framework was created in the Hadoop framework to perform distributed computation. We will look at this further in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.59.84