An introduction to MapReduce

MapReduce is a programming model that allows you to express algorithms for efficient execution on a distributed system. The MapReduce model was first introduced by Google in 2004 (https://research.google.com/archive/mapreduce.html), as a way to automatically partition datasets over different machines and for automatic local processing and the communication between cluster nodes.

The MapReduce framework was used in cooperation with a distributed filesystem, the Google File System (GFS or GoogleFS), which was designed to partition and replicate data across the computing cluster. Partitioning was useful for storing and processing datasets that wouldn't fit on a single node while replication ensured that the system was able to handle failures gracefully. MapReduce was used by Google, in conjunction with GFS, for indexing of their web pages. Later on, the MapReduce and GFS concepts were implemented by Doug Cutting (at the time, an employee at Yahoo!), resulting in the first versions of the Hadoop Distributed File System (HDFS) and Hadoop MapReduce.

The programming model exposed by MapReduce is actually quite simple. The idea is to express the computation as a combination of two, fairly generic, steps: Map and Reduce. Some readers will probably be familiar with Python's map and reduce functions; however, in the context of MapReduce, the Map and Reduce steps are capable of representing a broader class of operations.

Map takes a collection of data as input and produces a transformation on this data. What is generally emitted by Map is a series of key value pairs that can be passed to a Reduce step. The Reduce step will aggregate items with the same key and apply a function to the collection to form a usually smaller collection of values.

The estimation of pi, which was shown in the last chapter, can be trivially converted using a series of Map and Reduce steps. In that case, the input was a collection of pairs of random numbers. The transformation (Map step) was the hit test, and the Reduce step was counting the number of times the hit test was True.

The prototypical example of the MapReduce model is the implementation of a word count; the program takes a series of documents as input, and returns, for each word, the total number of occurrences in the document collection. The following figure illustrates the Map and Reduce steps of the word count program. On the left, we have the input documents. The Map operation will produce a (key, value) entry where the first element is the word and the second element is 1 (that's because every word contributes 1 to the final count).

We then perform the reduce operation to aggregate all the elements of the same key and produce the global count for each of the words. In the figure, we can see how all values of the items with key the are summed to produce the final entry (the, 4):  

If we implement our algorithm using the Map and Reduce operation, the framework implementation will ensure that data production and aggregation is done efficiently, by limiting the communication between nodes through clever algorithms.

However, how does MapReduce manage to keep communication to a minimum? Let's go through the journey of a MapReduce task. Imagine that you have a cluster with two nodes, and a partition of the data (this is usually found locally in each node) is loaded in each node from disk and is ready for processing. A mapper process is created in each node and processes the data to produce the intermediate results. 

Next, it is necessary to send the data to the reducer for further processing. In order to do this, however, it is necessary that all the items that possess the same key are shipped to the same reducer. This operation is called shuffling and is the principal communication task in the MapReduce model:

 

Note that, before the data exchange happens, it is necessary to assign a subset of keys to each reducer; this step is called partitioning.  Once a reducer receives its own partition of keys, it is free to process data and write the resulting output on disk.

The MapReduce framework (through the Apache Hadoop project) has been extensively used in its original form by many companies and organizations. More recently, new frameworks that extend the ideas introduced by MapReduce have been developed to create systems able to express more complex workflows, to use memory more efficiently and to support a lean and efficient execution of distributed tasks.

In the following sections, we will describe two of the most used libraries in the Python distributed landscape: Dask and PySpark.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.219.130