68 | Big Data Simplied
4.1 INTRODUCTION
MapReduce is an algorithmic approach to deal with big data. It provides a way of taking large
amount of data, breaking it up into several smaller chunks of data and then processing each of
those chunks in parallel, and nally aggregating the outcome from each process to produce a
single unied outcome.
MapReduce is a two-stage process. The first step is called the ‘Map’ step and the second is
called the ‘Reduce’ step. The ‘Map’ step concerns itself with breaking up the data into chunks and
processing each of those chunks. The ‘Reduce’ step takes the output of the ‘Map’ step and aggre-
gates the outcomes from the processes. Specifically, the output from the ‘Map’ step is in a key
and value format. The ‘Reduce’ step expects that the data is sorted by the key. It then produces
the aggregated output, where there is only one piece of ‘unified’ data corresponding to each key.
Let us examine these steps in some detail.
4.2 PROCESSING DATA WITH MapReduce
Evidently, we engage with huge amount of data to process, and processing this voluminous data
will require multiple machines. Corporate giants like Google, Amazon, Facebook, etc., which
regularly process data at this scale have highly spacious data centres lled with machines that
work together as a distributed system. A distributed system involves processes, it runs on multi-
ple machines and they are interconnected.
As discussed earlier, MapReduce is a programming paradigm to allow processing on a dis-
tributed computing system. The data set that MapReduce works on is itself distributed across
multiple machines in a cluster. How do we run processes on multiple machines, and bring them
together to achieve a common objective?
As discussed in the preceding section, MapReduce breaks up the processing of each tasks into
two distinct phases. There is one phase called the ‘Map’, which is run on multiple nodes in a
cluster, and a second phase called the ‘Reduce’, which takes the outputs of the ‘Map’ phase and
aggregates the outcomes to produce a final result. MapReduce is always on top of HDFS, which
means it will take input data from HDFS and the final processed output will be generated again
in HDFS.
FIGURE 4.1 High-level MapReduce framework
MapReduce
HDFS
Output data writte
n
into HDFS by
framework after
processing
Take HDFS input
data for pr
ocessing
as per business
requirement
IN OUT
The data set within a cluster, as we know, is partitioned across multiple DataNodes. As we know
already, the data is also replicated for fault tolerance. A ‘Map’ process works on data that resides
M04 Big Data Simplified XXXX 01.indd 68 5/10/2019 9:58:18 AM
Introducing MapReduce | 69
on the same machine. Outputs from the ‘Map’ phase are then transferred across the network to a
node, where the ‘Reduce’ phase runs. Now, in reality, the ‘Reduce’ phase can also run on multiple
nodes. However, for the sake of simplicity, let us imagine ‘Reduce’ as running on a single node
where all the data from the ‘Map’ phase is collected.
FIGURE 4.2 Architecture of a MapReduce job
MASTER (RM-100% JOB)
DNx/NMx
HDFS HDFS
Intermediate events in local file system of each DataNode
Parallel and
distributed
processing
Parallel and
distributed
processing
DN1/NM1 (32%) DN2/NM2 (30%) DN3/NM3 (38%)
In this above architecture, assuming that Resource Manager (RM) distributed the overall job
(100%) into the available number of Node Manager (NM) in the Hadoop cluster with respect to
the resource availability (CPU usage, available memory, JVM overload, etc.). Therefore, it means
if the number of NM’s are increased with sufcient resource, then the overall job performance
will boost up automatically. Every NM executes the task with the help of Application Master (or
AM, which is an internal executor inside NM and it is handled by Yarn) and it reports back to the
RM after specic time interval (default 3 seconds) and this process is called ‘Heartbeat’ signal.
As a developer, one has to write the logic for only two functions,
map() and reduce(),
and everything else, such as logic for fault tolerance, for replication, and accounting for the fact
that we are working on a distributed system, is taken care of behind the scenes by the Hadoop
framework.
The ‘Map’ phase runs on one subset of data. So, there are several ‘Map’ processes which run
to process the entire data set. The code that is written for ‘Map’ phase should process only one
record. Thus, while writing the code, imagine one record as the input to the code, figure out what
the output should be for that one record. The output should be in the form of key-value pairs.
On the other hand, the ‘Reduce’ phase collates all the key-value pairs that it receives from
the ‘Map’ phase and figures out how to combine the results in a meaningful manner. It takes the
‘Map’ outputs, finds all the values that are associated with the same key and then combines these
values in specific manner depending upon the requirement from the program. For example,
sums it up, or counts it, or finds an average, etc., thereby producing the final output.
M04 Big Data Simplified XXXX 01.indd 69 5/10/2019 9:58:18 AM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.99.152