Introducing Big Data | 25
creating a total of three copies. In this way, if a particular node fails, then the data is not lost
because that data is also stored on at least two other nodes by default.
Figure 2.1 shows a very large text file, split into smaller pieces of information called blocks
and they are of the same size. HDFS deals only with the blocks of a file, where each block except
the last is of the same size. This block is also the unit for replication and fault tolerance.
Based on various considerations that will be covered in the subsequent chapter, it has been
observed that a block size of 128 megabytes (1 megabyte is 10 raised to the power of 6 number
of bytes) is optimum.
Each of these blocks are then stored on a different node in the cluster (refer to Figure 2.2).
Once the blocks are distributed across the nodes in a cluster, the NameNode comes into play.
The NameNode contains a mapping for every file where the blocks in that file exist within the
cluster (refer to Figure 2.3).
2.7.2 MapReduce
MapReduce is an algorithmic approach to deal with Big Data. It provides a way to derive large
amount of data, breaking it up into several smaller chunks of data, then processing each of those
chunks in parallel, and nally aggregating the outcome from each process to produce a single
unied outcome.
FIGURE 2.1 A large text file broken down into blocks
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
Block 8
FIGURE 2.2 Blocks of a file stored in different DataNodes
DataNode 1 DataNode 3
DataNode 2 DataNode 4
Block 1 Block 2 Block 5 Block 6
Block 3 Block 4 Block 7 Block 8
M02 Big Data Simplified XXXX 01.indd 25 5/10/2019 9:56:52 AM
26 | Big Data Simplied
MapReduce is a two-stage process. The first is called the ‘Map’ step and the second is called the
‘Reduce’ step. The ‘Map’ step concerns itself by breaking up the data into chunks and processing
each of those chunks. The ‘Reduce’ step then takes the outputs of the ‘Map’ step and aggregates
the outcomes from the processes. Specifically, the output from the ‘Map’ step is in a key and value
format. The ‘Reduce’ step expects that the data is sorted by the key. It then produces the aggre-
gated output, where there is only one piece of ‘unified’ data corresponding to each key.
A very simple MapReduce task as depicted in Figure 2.4 considers a very large text file. The
task is to count the number of times each word appearing in that text file. There should be an
output where, for every word, the number of times it occurs in that large text file is to be deter-
mined. The text file has several lines as shown in Figure 2.4. In a real-world scenario, this could
FIGURE 2.4 Each partition is given to a different map process
Mary had a little lamb
Little lamb, little lamb
Mary had a little lamb
Its fleece was white as snow
And everywhere that Mary went
Mary went, Mary went
FIGURE 2.3
The NameNode contains the mapping of blocks for a file to the DataNodes
in the cluster
DataNode 1 DataNode 3
DataNode 2 DataNode 4
Block 1 Block 2 Block 5 Block 6
Block 3 Block 4 Block 7 Block 8
NameNode
File 1Block 1 DataNode 1
File 1Block 2
DataNode 1
File 1Block 3 DataNode 2
File 1Block 4 DataNode 2
File 1Block 5 DataNode 3
File 1Block 6 DataNode 3
File 1Block 7 DataNode 4
File 1Block 8 DataNode 4
M02 Big Data Simplified XXXX 01.indd 26 5/10/2019 9:56:52 AM
Introducing Big Data | 27
be raw data distributed across many machines in a cluster, so the entire file is not present on one
machine. Every machine has a subset of this file and each subset is called a partition. This is what
the Map phase has to work with.
Considering that the file is distributed across multiple machines, a Map process is run on each of
these machines (Figure 2.5), and every Map process handles the input data present on that machine.
All the mappers run in parallel. Within each mapper, the rules are processed one at a time on one
record at a time.
For every record processed by the mapper, the output of the Map phase emits a key-value
pair (Figure 2.6), and the key-value pair depends based on the final output value expected from
the program. For example, the objective is to count the word frequencies. So, the output of the
Mapphase comprises of every word in that single line (considered a single record processed by
the map process running on that machine), along with a count of 1.
So multiple mappers process the inputs available to them, and in the outcome produced
thereby, each individual word has a count of 1. This output is passed on to another process, the
reducer. The reducer accepts as input, every word from the input data set with a count of 1, and
then sums up all the counts associated with every single word. The reducer combines all values
which have the same key (Figure 2.7).
FIGURE 2.5 Map processes work on records in parallel
Mary had a little lamb
Little lamb, little lamb
M
M
M
Mary had a little lamb
Its fleece was white as snow
And everywhere that Mary went
Mary went, Mary went
FIGURE 2.6 Mapper outputs
Mary had a little lamb
Little lamb, little lamb
M
{Mary, 1}
{had, 1}
{a, 1}
{little, 1}
{lamb, 1}
{little, 1}
{lamb, 1}
{little, 1}
{lamb, 1}
M02 Big Data Simplified XXXX 01.indd 27 5/10/2019 9:56:53 AM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.129.253