70 | Big Data Simplied
To sum up, ‘Map’ is the step that is performed in parallel, and ‘Reduce’ is the step which
combines the intermediate results produced by the ‘Map’ phase. Each ‘Map’ phase output is
placed into intermediate state for intermediate events. These events are of three types, namely
shufing, sorting, partitioning and combining (overall, we can say ‘Group By’ with respect
to key’s). This intermediate event is handled by MapReduce framework itself and it actually
happened on each key but not on values. This Intermediate event is performed in local
le system of each DataNode, which means that the Map phase output
<key, value> pair
is copied from HDFS to local le system and after the processes, such as shufing, sorting,
partitioning and combining happened in each key’s, the grouped
<key, value> pair is again
transferred into HDFS from the local le system for main aggregation or operation part in
Reduce phase.
4.2.1 A MapReduce Example
Let us consider a very simple MapReduce task. Consider a very large text le, and let us assume
that we are given the task of counting the number of times each word occurring in that text le. We
require an output where, for every word, we get the count of times it occurs in that large text le.
Let us consider the input to be a text file with several lines (Figure 4.5). In a real-world
scenario, this could be raw data in petabytes. It is distributed across many machines in a cluster,
so the entire file is not present on one machine. Every machine has a subset of this file and each
subset is called a partition. This is what the Map phase has to work with.
Imagine the file distributed across multiple machines, and a Map process is run on each of
these machines, and every Map processes the input data present on that machine. All the map-
pers run in parallel. Within each mapper, the rules are processed one at a time on one record at
a time.
FIGURE 4.3 Internal flow of a MapReduce job
HDFS
HDFS
MAP
Transformation
phase
Intermediate events
in local file system
REDUCE
Operation
phase
Input data read by Map
class from HDFS and
transform each record
into <key, value> pair
Main operation is performed
on values (with business
requirement) and the desired
output is replicated into HDFS
Resultant <key, val>
pair copied again into
HDFS from local with
grouped key’s
Map’s output
<key, val> pair
copied into local
from HDFS
Most time
consuming phase
of MapReduce job
M04 Big Data Simplified XXXX 01.indd 70 5/10/2019 9:58:18 AM
Introducing MapReduce | 71
FIGURE 4.4 Each partition is given to a different map process
Mary had a little lamb
Little lamb, little lamb
Mary had a little lamb
Its fleece was white as snow
And everywhere that Mary went
Mary went, Mary went
FIGURE 4.5 Map processes work on records in parallel
Mary had a little lamb
Little lamb, little lamb
Mary had a little lamb
Its fleece was white as snow
And everywhere that Mary went
Mary went, Mary went
M
M
M
FIGURE 4.6 Mapper outputs
Mary had a little lamb
Little lamb, little lamb
{Mary, 1}
{had, 1}
{a, 1}
{little, 1}
{lamb, 1}
{little, 1}
{lamb, 1}
{little, 1}
{lamb, 1}
M
M04 Big Data Simplified XXXX 01.indd 71 5/10/2019 9:58:19 AM
72 | Big Data Simplied
For every record processed by the mapper, the output of the Map phase produces a key-value
pair, and that key-value pair depends on the output value expected from the program finally. For
instance, assume that we are supposed to count the number of word frequencies. Thus, the out-
put of the Map phase consists of each and every word in that single line (consider a single record
processed by the map process running on that machine), along with a count of 1.
FIGURE 4.7 Functioning of the reduce step
{Mary, 1}
{little, 1}
{lamb, 1}
{little, 1}
{lamb, 1}
{little, 1}
{lamb, 1}
{Mary, 1}
{little, 1}
{lamb, 1}
{Mary, 5}
{little, 4}
{lamb, 4}
{Mary, 1}
{Mary, 1}
{Mary, 1}
M
M
M
R
So multiple mappers process the inputs available to them, and in the outcome produced thereby,
each individual word has a count of 1. This output is passed on to another process, the reducer.
The reducer accepts as input, every word from the input data set with a count of 1, and then
sums up all the counts associated with every single word. The reducer combines all values which
have the same key.
Writing a MapReduce program ultimately boils down to answering two key questions as stated
below.
1. What key-value pair should be produced in the map phase, such that using those keys and
values, the reduce phase can produce the final result?
2. How should these values with the same key be combined in the reduce phase to produce the
expected outcome?
M04 Big Data Simplified XXXX 01.indd 72 5/10/2019 9:58:19 AM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.71.28