MapReduce, as the name implies, has a map phase and a reduce phase. A map phase is generally a function that is applied on each element of its input, thus modifying its original value.
MapReduce generates key-value pairs as output.
In practice, map functions can be complex and involve advanced functionalities.
The reduce phase takes the key-value input from the map function and performs a summarization operation. For example, consider the output of a map operation that contains the ages of students in different grades in a school:
Student name |
Class |
Age |
John |
Grade 1 |
7 |
Mary |
Grade 2 |
8 |
Jill |
Grade 1 |
6 |
Tom |
Grade 3 |
10 |
Mark |
Grade 3 |
9 |
We can create a simple key-value pair, taking for example the value of Class and Age (it can be anything, but I'm just taking these to provide the example). In this case, our key-value pairs would be (Grade 1, 7), (Grade 2, 8), (Grade 1, 6), (Grade 3, 10), and (Grade 3, 9).
An operation that calculates the average of the ages of students in each grade could then be defined as a reduce operation.
More concretely, we can sort the output and then send the tuples corresponding to each grade to a different server.
For example, Server A would receive the tuples (Grade 1, 7) and (Grade 1, 6), Server B would receive the tuple (Grade 2, 8), Server C would receive the tuples (Grade 3, 10) and (Grade 3, 9). Each of the servers, A, B, and C, would then find the average of the tuples and report back (Grade 1, 6.5), (Grade 2, 8), and (Grade 3, 9.5).
Observe that there was an intermediary step in this process that involved sending the output to a particular server and sorting the output to determine which server it should be sent to. And indeed, MapReduce requires a shuffle and sort phase, whereby the key-value pairs are sorted so that each reducer receives a fixed set of unique keys.
In this example, if say, instead of three servers there were only two, Server A could be assigned to computing averages for keys corresponding to Grades 1 and 2, and Server B could be assigned to computing an average for Grade 3.
In Hadoop, the following process takes place during MapReduce:
- The client sends a request for a task.
- NameNode allocates DataNodes (individual servers) that will perform the map operation and ones that will perform the reduce operation. Note that the selection of the DataNode server is dependent upon whether the data that is required for the operation is local to the server. The servers where the data resides can only perform the map operation.
- DataNodes perform the map phase and produce key-value (k,v) pairs.
As the mapper produces the (k,v) pairs, they are sent to these reduce nodes based on the keys the node is assigned to compute. The allocation of keys to servers is dependent upon a partitioner function, which could be as simple as a hash value of the key (this is default in Hadoop).
Once the reduce node receives its set of data corresponding to the keys it is responsible to compute on, it applies the reduce function and generates the final output.