Introducing MapReduce | 69
on the same machine. Outputs from the ‘Map’ phase are then transferred across the network to a
node, where the ‘Reduce’ phase runs. Now, in reality, the ‘Reduce’ phase can also run on multiple
nodes. However, for the sake of simplicity, let us imagine ‘Reduce’ as running on a single node
where all the data from the ‘Map’ phase is collected.
FIGURE 4.2 Architecture of a MapReduce job
MASTER (RM-100% JOB)
DNx/NMx
HDFS HDFS
Intermediate events in local file system of each DataNode
Parallel and
distributed
processing
Parallel and
distributed
processing
DN1/NM1 (32%) DN2/NM2 (30%) DN3/NM3 (38%)
In this above architecture, assuming that Resource Manager (RM) distributed the overall job
(100%) into the available number of Node Manager (NM) in the Hadoop cluster with respect to
the resource availability (CPU usage, available memory, JVM overload, etc.). Therefore, it means
if the number of NM’s are increased with sufcient resource, then the overall job performance
will boost up automatically. Every NM executes the task with the help of Application Master (or
AM, which is an internal executor inside NM and it is handled by Yarn) and it reports back to the
RM after specic time interval (default 3 seconds) and this process is called ‘Heartbeat’ signal.
As a developer, one has to write the logic for only two functions,
map() and reduce(),
and everything else, such as logic for fault tolerance, for replication, and accounting for the fact
that we are working on a distributed system, is taken care of behind the scenes by the Hadoop
framework.
The ‘Map’ phase runs on one subset of data. So, there are several ‘Map’ processes which run
to process the entire data set. The code that is written for ‘Map’ phase should process only one
record. Thus, while writing the code, imagine one record as the input to the code, figure out what
the output should be for that one record. The output should be in the form of key-value pairs.
On the other hand, the ‘Reduce’ phase collates all the key-value pairs that it receives from
the ‘Map’ phase and figures out how to combine the results in a meaningful manner. It takes the
‘Map’ outputs, finds all the values that are associated with the same key and then combines these
values in specific manner depending upon the requirement from the program. For example,
sums it up, or counts it, or finds an average, etc., thereby producing the final output.
M04 Big Data Simplified XXXX 01.indd 69 5/10/2019 9:58:18 AM