Chapter 3. Understanding MapReduce

The previous two chapters have discussed the problems that Hadoop allows us to solve, and gave some hands-on experience of running example MapReduce jobs. With this foundation, we will now go a little deeper.

In this chapter we will be:

  • Understanding how key/value pairs are the basis of Hadoop tasks
  • Learning the various stages of a MapReduce job
  • Examining the workings of the map, reduce, and optional combined stages in detail
  • Looking at the Java API for Hadoop and use it to develop some simple MapReduce jobs
  • Learning about Hadoop input and output

Key/value pairs

Since Chapter 1, What It's All About, we have been talking about operations that process and provide the output in terms of key/value pairs without explaining why. It is time to address that.

What it mean

Firstly, we will clarify just what we mean by key/value pairs by highlighting similar concepts in the Java standard library. The java.util.Map interface is the parent of commonly used classes such as HashMap and (through some library backward reengineering) even the original Hashtable.

For any Java Map object, its contents are a set of mappings from a given key of a specified type to a related value of a potentially different type. A HashMap object could, for example, contain mappings from a person's name (String) to his or her birthday (Date).

In the context of Hadoop, we are referring to data that also comprises keys that relate to associated values. This data is stored in such a way that the various values in the data set can be sorted and rearranged across a set of keys. If we are using key/value data, it will make sense to ask questions such as the following:

  • Does a given key have a mapping in the data set?
  • What are the values associated with a given key?
  • What is the complete set of keys?

Think back to WordCount from the previous chapter. We will go into it in more detail shortly, but the output of the program is clearly a set of key/value relationships; for each word (the key), there is a count (the value) of its number of occurrences. Think about this simple example and some important features of key/value data will become apparent, as follows:

  • Keys must be unique but values need not be
  • Each value must be associated with a key, but a key could have no values (though not in this particular example)
  • Careful definition of the key is important; deciding on whether or not the counts are applied with case sensitivity will give different results

Note

Note that we need to define carefully what we mean by keys being unique here. This does not mean the key occurs only once; in our data set we may see a key occur numerous times and, as we shall see, the MapReduce model has a stage where all values associated with each key are collected together. The uniqueness of keys guarantees that if we collect together every value seen for any given key, the result will be an association from a single instance of the key to every value mapped in such a way, and none will be omitted.

Why key/value data?

Using key/value data as the foundation of MapReduce operations allows for a powerful programming model that is surprisingly widely applicable, as can be seen by the adoption of Hadoop and MapReduce across a wide variety of industries and problem scenarios. Much data is either intrinsically key/value in nature or can be represented in such a way. It is a simple model with broad applicability and semantics straightforward enough that programs defined in terms of it can be applied by a framework like Hadoop.

Of course, the data model itself is not the only thing that makes Hadoop useful; its real power lies in how it uses the techniques of parallel execution, and divide and conquer discussed in Chapter 1, What It's All About. We can have a large number of hosts on which we can store and execute data and even use a framework that manages the division of the larger task into smaller chunks, and the combination of partial results into the overall answer. But we need this framework to provide us with a way of expressing our problems that doesn't require us to be an expert in the execution mechanics; we want to express the transformations required on our data and then let the framework do the rest. MapReduce, with its key/value interface, provides such a level of abstraction, whereby the programmer only has to specify these transformations and Hadoop handles the complex process of applying this to arbitrarily large data sets.

Some real-world examples

To become less abstract, let's think of some real-world data that is key/value pair:

  • An address book relates a name (key) to contact information (value)
  • A bank account uses an account number (key) to associate with the account details (value)
  • The index of a book relates a word (key) to the pages on which it occurs (value)
  • On a computer filesystem, filenames (keys) allow access to any sort of data, such as text, images, and sound (values)

These examples are intentionally broad in scope, to help and encourage you to think that key/value data is not some very constrained model used only in high-end data mining but a very common model that is all around us.

We would not be having this discussion if this was not important to Hadoop. The bottom line is that if the data can be expressed as key/value pairs, it can be processed by MapReduce.

MapReduce as a series of key/value transformations

You may have come across MapReduce described in terms of key/value transformations, in particular the intimidating one looking like this:

{K1,V1} -> {K2, List<V2>} -> {K3,V3}

We are now in a position to understand what this means:

  • The input to the map method of a MapReduce job is a series of key/value pairs that we'll call K1 and V1.
  • The output of the map method (and hence input to the reduce method) is a series of keys and an associated list of values that are called K2 and V2. Note that each mapper simply outputs a series of individual key/value outputs; these are combined into a key and list of values in the shuffle method.
  • The final output of the MapReduce job is another series of key/value pairs, called K3 and V3.

These sets of key/value pairs don't have to be different; it would be quite possible to input, say, names and contact details and output the same, with perhaps some intermediary format used in collating the information. Keep this three-stage model in mind as we explore the Java API for MapReduce next. We will first walk through the main parts of the API you will need and then do a systematic examination of the execution of a MapReduce job.

Pop quiz – key/value pairs

Q1. The concept of key/value pairs is…

  1. Something created by and specific to Hadoop.
  2. A way of expressing relationships we often see but don't think of as such.
  3. An academic concept from computer science.

Q2. Are username/password combinations an example of key/value data?

  1. Yes, it's a clear case of one value being associated to the other.
  2. No, the password is more of an attribute of the username, there's no index-type relationship.
  3. We'd not usually think of them as such, but Hadoop could still process a series of username/password combinations as key/value pairs.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.220.16