Time for action – WordCount with a combiner

Let's add a combiner to our first WordCount example. In fact, let's use our reducer as the combiner. Since the combiner must have the same interface as the reducer, this is something you'll often see, though note that the type of processing involved in the reducer will determine if it is a true candidate for a combiner; we'll discuss this later. Since we are looking to count word occurrences, we can do a partial count on the map node and pass these subtotals to the reducer.

  1. Copy WordCount1.java to WordCount2.java and change the driver class to add the following line between the definition of the Mapper and Reducer classes:
            job.setCombinerClass(WordCountReducer.class);
  2. Also change the class name to WordCount2 and then compile it.
    $ javac WordCount2.java
  3. Create the JAR file.
    $ jar cvf wc2.jar WordCount2*class
  4. Run the job on Hadoop.
    $ hadoop jar wc2.jar WordCount2 test.txt output
  5. Examine the output.
    $ hadoop fs -cat output/part-r-00000

What just happened?

This output may not be what you expected, as the value for the word is is now incorrectly specified as 1 instead of 2.

The problem lies in how the combiner and reducer will interact. The value provided to the reducer, which was previously (is, 1, 1), is now (is, 2) because our combiner did its own summation of the number of elements for each word. However, our reducer does not look at the actual values in the Iterable object, it simply counts how many are there.

When you can use the reducer as the combiner

You need to be careful when writing a combiner. Remember that Hadoop makes no guarantees on how many times it may be applied to map output, it may be 0, 1, or more. It is therefore critical that the operation performed by the combiner can effectively be applied in such a way. Distributive operations such as summation, addition, and similar are usually safe, but, as shown previously, ensure the reduce logic isn't making implicit assumptions that might break this property.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.197.136