Introducing Hadoop | 51
HDFS to manage all the machines and storage spaces in the cluster. It does that by setting up a
master-slave configuration and it will be discussed further as follows.
Within the cluster, one of the machines is designated as the master node. The master node
is responsible for coordinating the storage across all other nodes on the cluster, which are slave
nodes. On this master node, HDFS runs a process, which receives all requests that are made to
the cluster, and forwards it to the slave nodes which contain the data. Let us examine this process
in detail. The master node is called the NameNode. All other machines in the cluster are desig-
nated as DataNodes. Thus, there is one NameNode per cluster and any number of DataNodes
depends on the number of machines in that cluster.
An easy way to understand the role that the NameNode and DataNodes play within Hadoop is
to take the example of this book. If the data stored in the distributed file system is equivalent to
the text in the book, then the NameNode is considered as the table of contents, and the chapters
in the book where the actual contents of this book are found specifies the DataNodes. The table
of contents holds the list of content which is to be read.
The NameNode serves two primary functions. First, any request from a client is passed to the
NameNode because it tells us where to find the required data. The NameNode has the directory
structure and it knows which piece of a file is located on which DataNode. Secondly, it also has
the metadata for the file other than the actual content. Examples of metadata include file per-
missions, how the file is split up and where the replica of the file is stored. The function of the
DataNode is simply that it physically stores the actual file contents.
Points to Ponder
NameNode has the main process to hold HDFS metadata.
NameNode receives block reports from each DataNode from the cluster and consolidates those
reports to create HDFS metadata.
All the Hadoop daemons, i.e., NameNode, DataNode, Secondary NameNode,
ResourceManager, NodeManager are Java processes.
3.4.2 Storing and Reading Files from HDFS
Now that you have understood the functioning of the NameNode and DataNodes, let us look at
how the les are stored and read in HDFS.
Let us consider a very large text file for the sake of our discussion. Note that, this is very typical
of a data set that HDFS would store. This file is then split into smaller pieces of information called
blocks (Figure 3.2). Note that the blocks are of the same size. This allows HDFS to deal with files of
different lengths in the same manner. The reason is due to the fact that HDFS does not deal with a
file as a whole. Instead, it deals only with the blocks of a file, where each block, except the last, is of
the same size. This makes the entire storage management mechanism within HDFS rather simple.
At any point in time, only a block of data is dealt with. This block is also the unit for replica-
tion and fault tolerance. Therefore, there is no need to maintain multiple copies of the entire file.
Instead, multiple copies of each blocks of same size into which any file is split up are kept. Hence,
it brings about uniformity and standardization. These blocks are of equal size, it is made sure that
when a block of data is processed, the same amount of data is dealt with.
M03 Big Data Simplified XXXX 01.indd 51 5/10/2019 9:57:29 AM
52 | Big Data Simplied
However, a critical question arises about the optimum size of a block. Here, a trade-off has to
be made. If the block size is increased, then the parallelism that can be achieved while processing
a file is reduced. However, there are fewer chunks of data into which the file is broken down, so
there will be fewer processes working on those chunks (because one process works on one piece
of data). On the other hand, if the block size is too small, then it means that one file will have a
large number of splits, which will require large number of processes to run on them. This will
increase the overhead for coordinating those processes and aggregating the results from them.
Here, the increased overhead tends to dominate the execution time for any task.
The time taken to read a block of data from the disk can be broken down into two compo-
nents. The first component is the time taken to seek to that position in disk where the block
resides physically (which is called the seek time). The second component is the time taken to
actually read the block (which is called the transfer time). A good ratio between seek time and
transfer time needs to be achieved. It has been observed that a block size of 128 MB provides an
optimum ratio between seek time and transfer time.
Each of these blocks are then stored on a different node in the cluster (Figure 3.3). Also note
that the entire file is not stored together in one node of the cluster. Once the blocks are distrib-
uted across the nodes in a cluster, how do you know where the blocks of a particular file are
located? How do you track down the data in a single file, when you have spread it out in blocks
FIGURE 3.2 A large text file broken down into Blocks
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
Block 8
FIGURE 3.3 Blocks of a file stored in different DataNodes
DataNode 1 DataNode 3
DataNode 2 DataNode 4
Block 1 Block 2 Block 5 Block 6
Block 3 Block 4 Block 7 Block 8
M03 Big Data Simplified XXXX 01.indd 52 5/10/2019 9:57:30 AM
Introducing Hadoop | 53
distributed across different DataNodes in the cluster? This is where the NameNode comes into
play. As shown in Figure 3.4, the NameNode contains a mapping for every file where the blocks
in that file exist within the cluster.
Now that you understand how the blocks are distributed across DataNodes and how the
NameNode contains the desired mapping, let us examine how a file is read from HDFS. Reading
a file from HDFS is a two-step process. Initially, you need to know where the blocks of data in
that file are located. For this, the metadata in the NameNode is used to look up block locations
in the DataNodes. Once the information on block location is obtained, reach out to the respective
DataNode and read the block.
Let us look at an example. In a cluster where a file is distributed across DataNodes as described
above, let’s read the beginning of a file, say
File 1. A client comes in and makes a request to
the NameNode, saying it wants to read ’File 1‘ from the beginning (Figure 3.5). The NameNode
FIGURE 3.4 NameNode containing mapping of Blocks for a file to the DataNodes
DataNode 1 DataNode 3
DataNode 2 DataNode 4
Block 1 Block 2 Block 5 Block 6
Block 3 Block 4 Block 7 Block 8
NameNode
File 1Block 1 DataNode 1
File 1Block 2
DataNode 1
File 1Block 3 DataNode 2
File 1Block 4 DataNode 2
File 1Block 5 DataNode 3
File 1Block 6 DataNode 3
File 1Block 7 DataNode 4
File 1Block 8 DataNode 4
FIGURE 3.5
Step 1 of reading a file in HDFS, such as client requests NameNode for the
location of Block 1 of File 1
DataNode 1
Block 1 Block 2
NameNode
File 1Block 1 DataNode 1
File 1Block 2 DataNode 1
File 1Block 3 DataNode 2
File 1Block 4 DataNode 2
File 1Block 5 DataNode 3
Client requests the NameNode
M03 Big Data Simplified XXXX 01.indd 53 5/10/2019 9:57:30 AM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.247.53