Hadoop MapReduce

Hadoop MapReduce is a core component of the Hadoop framework. It is a software framework that enables in-parallel processing distributed across nodes in the cluster. The topic can get confusing as people tend to refer to the concept of MapReduce and the Hadoop component interchangeably, as if they were all one and the same.

The concept of Map and Reduce will be covered in the Spark section later on in this chapter. It is not tied to any specific framework or project. You may read articles that imply that MapReduce is going away, to be replaced by something else. It may not be clearly spelled out (and often it is not), but they are referring to the Hadoop component and not to the concept.

The concept of Map and Reduce is not going anywhere; it has been around for a while and is even a key part of many newer distributed processing frameworks. We will always refer to the component in this book as Hadoop MapReduce to avoid any confusion.

In the Hadoop component, the MapReduce job normally divides the input dataset into independent chunks. The chunks are processed by the map tasks in parallel across the nodes. The output of the map tasks is then sorted and is input into the reduce tasks. Hadoop MapReduce handles task scheduling, monitoring, and will re-execute tasks if they fail.

Unless you are a professional Java programmer in the big data space, it is unlikely you will be interacting with Hadoop MapReduce directly. In analytics, you are far more likely to be using other higher level tools, which use Hadoop MapReduce under the covers. For this reason, we will not spend much time on this component even though it is a core part of the Hadoop ecosystem.

Table of Contents for Hadoop MapReduce

Create new playlist

Sign In

Sign Up

Table of Contents for
Hadoop MapReduce