Introducing Big Data | 27
be raw data distributed across many machines in a cluster, so the entire file is not present on one
machine. Every machine has a subset of this file and each subset is called a partition. This is what
the Map phase has to work with.
Considering that the file is distributed across multiple machines, a Map process is run on each of
these machines (Figure 2.5), and every Map process handles the input data present on that machine.
All the mappers run in parallel. Within each mapper, the rules are processed one at a time on one
record at a time.
For every record processed by the mapper, the output of the Map phase emits a key-value
pair (Figure 2.6), and the key-value pair depends based on the final output value expected from
the program. For example, the objective is to count the word frequencies. So, the output of the
Mapphase comprises of every word in that single line (considered a single record processed by
the map process running on that machine), along with a count of 1.
So multiple mappers process the inputs available to them, and in the outcome produced
thereby, each individual word has a count of 1. This output is passed on to another process, the
reducer. The reducer accepts as input, every word from the input data set with a count of 1, and
then sums up all the counts associated with every single word. The reducer combines all values
which have the same key (Figure 2.7).
FIGURE 2.5 Map processes work on records in parallel
Mary had a little lamb
Little lamb, little lamb
M
M
M
Mary had a little lamb
Its fleece was white as snow
And everywhere that Mary went
Mary went, Mary went
FIGURE 2.6 Mapper outputs
Mary had a little lamb
Little lamb, little lamb
M
{Mary, 1}
{had, 1}
{a, 1}
{little, 1}
{lamb, 1}
{little, 1}
{lamb, 1}
{little, 1}
{lamb, 1}
M02 Big Data Simplified XXXX 01.indd 27 5/10/2019 9:56:53 AM