Introducing Hadoop | 51
HDFS to manage all the machines and storage spaces in the cluster. It does that by setting up a
master-slave configuration and it will be discussed further as follows.
Within the cluster, one of the machines is designated as the master node. The master node
is responsible for coordinating the storage across all other nodes on the cluster, which are slave
nodes. On this master node, HDFS runs a process, which receives all requests that are made to
the cluster, and forwards it to the slave nodes which contain the data. Let us examine this process
in detail. The master node is called the NameNode. All other machines in the cluster are desig-
nated as DataNodes. Thus, there is one NameNode per cluster and any number of DataNodes
depends on the number of machines in that cluster.
An easy way to understand the role that the NameNode and DataNodes play within Hadoop is
to take the example of this book. If the data stored in the distributed file system is equivalent to
the text in the book, then the NameNode is considered as the table of contents, and the chapters
in the book where the actual contents of this book are found specifies the DataNodes. The table
of contents holds the list of content which is to be read.
The NameNode serves two primary functions. First, any request from a client is passed to the
NameNode because it tells us where to find the required data. The NameNode has the directory
structure and it knows which piece of a file is located on which DataNode. Secondly, it also has
the metadata for the file other than the actual content. Examples of metadata include file per-
missions, how the file is split up and where the replica of the file is stored. The function of the
DataNode is simply that it physically stores the actual file contents.
Points to Ponder
❐ NameNode has the main process to hold HDFS metadata.
❐ NameNode receives block reports from each DataNode from the cluster and consolidates those
reports to create HDFS metadata.
❐ All the Hadoop daemons, i.e., NameNode, DataNode, Secondary NameNode,
ResourceManager, NodeManager are Java processes.
3.4.2 Storing and Reading Files from HDFS
Now that you have understood the functioning of the NameNode and DataNodes, let us look at
how the les are stored and read in HDFS.
Let us consider a very large text file for the sake of our discussion. Note that, this is very typical
of a data set that HDFS would store. This file is then split into smaller pieces of information called
blocks (Figure 3.2). Note that the blocks are of the same size. This allows HDFS to deal with files of
different lengths in the same manner. The reason is due to the fact that HDFS does not deal with a
file as a whole. Instead, it deals only with the blocks of a file, where each block, except the last, is of
the same size. This makes the entire storage management mechanism within HDFS rather simple.
At any point in time, only a block of data is dealt with. This block is also the unit for replica-
tion and fault tolerance. Therefore, there is no need to maintain multiple copies of the entire file.
Instead, multiple copies of each blocks of same size into which any file is split up are kept. Hence,
it brings about uniformity and standardization. These blocks are of equal size, it is made sure that
when a block of data is processed, the same amount of data is dealt with.
M03 Big Data Simplified XXXX 01.indd 51 5/10/2019 9:57:29 AM