70 | Big Data Simplied
To sum up, ‘Map’ is the step that is performed in parallel, and ‘Reduce’ is the step which
combines the intermediate results produced by the ‘Map’ phase. Each ‘Map’ phase output is
placed into intermediate state for intermediate events. These events are of three types, namely
shufing, sorting, partitioning and combining (overall, we can say ‘Group By’ with respect
to key’s). This intermediate event is handled by MapReduce framework itself and it actually
happened on each key but not on values. This Intermediate event is performed in local
le system of each DataNode, which means that the Map phase output
<key, value> pair
is copied from HDFS to local le system and after the processes, such as shufing, sorting,
partitioning and combining happened in each key’s, the grouped
<key, value> pair is again
transferred into HDFS from the local le system for main aggregation or operation part in
Reduce phase.
4.2.1 A MapReduce Example
Let us consider a very simple MapReduce task. Consider a very large text le, and let us assume
that we are given the task of counting the number of times each word occurring in that text le. We
require an output where, for every word, we get the count of times it occurs in that large text le.
Let us consider the input to be a text file with several lines (Figure 4.5). In a real-world
scenario, this could be raw data in petabytes. It is distributed across many machines in a cluster,
so the entire file is not present on one machine. Every machine has a subset of this file and each
subset is called a partition. This is what the Map phase has to work with.
Imagine the file distributed across multiple machines, and a Map process is run on each of
these machines, and every Map processes the input data present on that machine. All the map-
pers run in parallel. Within each mapper, the rules are processed one at a time on one record at
a time.
FIGURE 4.3 Internal flow of a MapReduce job
HDFS
HDFS
MAP
Transformation
phase
Intermediate events
in local file system
REDUCE
Operation
phase
Input data read by Map
class from HDFS and
transform each record
into <key, value> pair
Main operation is performed
on values (with business
requirement) and the desired
output is replicated into HDFS
Resultant <key, val>
pair copied again into
HDFS from local with
grouped key’s
Map’s output
<key, val> pair
copied into local
from HDFS
Most time
consuming phase
of MapReduce job
M04 Big Data Simplified XXXX 01.indd 70 5/10/2019 9:58:18 AM