MapReduce functionality

All MapReduce programs/modules operate in two phases, as follows:

Map phase: This is the first phase. In the map phase, a set of data is converted into another set of data, where individual elements are broken into tuples (key-value pairs).
Reduce phase: This is the second phase, where the output from the map phase is taken as input and merges data tuples into a smaller set of tuples.

There is a JobTracker that divides a given problem into multiple map tasks. These tasks are distributed across the network to a number of slave nodes, for parallel processing. These slave nodes are referred to as TaskTrackers. Generally, map tasks operate on the same cluster nodes, where the processed data remains. If that server node is already heavily loaded, another node that is close to the data will be chosen. Let's examine the work process of MapReduce, as shown in the following diagram:

Figure 2.3: Illustration of how MapReduce works

The preceding diagram shows a brief overview of how the MapReduce algorithm works. There are different phases involved. Assuming that there is a problem that needs to be solved by the MapReduce program, the program should execute in the order shown in the figure. Let's inspect each phase in detail, as follows:

Input phase: In the input phase, a record reader interprets each record in an input file and sends the parsed data to the mapper, in the form of key-value pairs. This is the first step in the MapReduce module.
Mapper: A mapper is a user-defined program module that uses a series of key-value pairs and processes each of them, in order to generate processed key-value pairs as the output.
Intermediate keys: The mapper consumes the key-value pairs and outputs processed key-value pairs. The key-value pairs generated by the mappers are referred to as the intermediate keys.
Combiner: There is a local reducer that groups similar data from the mapper into identifiable sets. They are often referred to as a combiner. This is an optional phase that may or may not be present in any particular MapReduce subroutine.
Shuffle and sort: In the shuffling and sorting phase, the output from the mapper phase is consumed as the input. There is usually a large amount of middle data to be moved from all of the map nodes to all of the reduce nodes in the shuffle phase. The shuffle phase transfers data from the mapper disks, rather than their main memories, and the intermediate output will be sorted by keys, so that all pairs with the same keys will be grouped together. The data from the local map nodes is transferred to the reduce nodes through the network.
Reducer: The reducer consumes the grouped key-value paired data as input and executes a reducer function on each pair. There are zero or more key-value pairs as the output from the reducer function. This output is redirected to the final step of the MapReduce module.
Output phase: There is an output formatter that translates the final key-value pairs from the reducer function and writes them into a file, using a record writer. The output file contains the final output of the subroutine.

Table of Contents for MapReduce functionality

Create new playlist

Sign In

Sign Up

Table of Contents for
MapReduce functionality