80 | Big Data Simplied
On the cluster running the Hadoop and MapReduce setup, we can consider that these key
value pairs are sent to the machine in which the reducer process operates.
FIGURE 4.12 Sorting of key value pairs
M
M
M
Split 1ABACB
CB BBBAAAAADC CCCCCD
DD
DCADC
Split 2
Split 3
Now, these key value pairs are sorted such that all values which have the same key are available
together and are then fed to the reducer. The sorting implies that all prole views for a particular
member, say John, are available in a group.
FIGURE 4.13 Reducer function applied to sorted results
M
M
R
M
Split 1 A BAC B
CB BBBAAAAADC CCCCCDDD
ABCD
DCADC
Split 2
Split 3
The reducer is a code that we have written to sum up the values associated with the same key.
As we can see, there is a lot going on here beyond the map and reduce logic for which we
write code. All these processes are handled completely behind the scenes by the MapReduce
framework collecting these key value pairs together, transferring them across the network to the
cluster node in which the reduce job runs, and sorting them so that the values associated with
the same key appear together.
4.3.2 Using Multiple Reducers
Let us now consider that we want two reducers running on two different nodes. There are now
two partitions to which the keys can be sent. Now, in this scenario, we have to gure out which
key is sent to which reducer and this process is called assigning partitions.
M04 Big Data Simplified XXXX 01.indd 80 5/10/2019 9:58:21 AM
Introducing MapReduce | 81
FIGURE 4.14 Keys assigned to specific partitions
M
M
M
R
R
Split 1ABAC
Partition 2
Partition 1
Partition 2
Partition 1
Partition 2
Partition 1
B
CBADC
DCADC
Split 2
Split 3
Therefore, after the map phase, the MapReduce framework assigns each key to a certain partition.
Now, this is something that can be controlled by the developer. The developer can decide the
amount of parallelism needed by running more reducers. We can decide that the key coded A
goes to partition 1, the code B to partition 2, the code C to partition 1, the code D to partition2.
Thus, each key is assigned to a partition.
FIGURE 4.15 Partition function determines where each key goes
M
M
M
R
R
Split 1ABA
AA
BB
BD
DD
C
ACC
ACC
C
Partition 2
Partition 1
Partition 2
Partition 1
Partition 2
Partition 1
B
CBADC
DCADC
Split 2
Split 3
Internally, the framework has a partition function that it runs to determine where each key goes.
There is just one job for the partition function and that is to look at the key and determine which
partition or node it belongs to. The manner in which we partition the keys determines the ef-
ciency of MapReduce operation. The partitioning should not be skewed such that one reducer
receives large number of keys and the other reducer receives much less.
So, the cluster manager will distribute the keys, which are the outputs of the map phase to the
right partitions. Notice that A keys are always in partition 1 and B keys always belong to parti-
tion2. And the same is true for the other codes. The number of partitions is equal to the number
M04 Big Data Simplified XXXX 01.indd 81 5/10/2019 9:58:21 AM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.32.67