reduceByKey

It is another transformation for Pair RDDs that helps to aggregate data corresponding to a key with the help of an associative reduce function. A function is passed as an argument to this transformation, which helps to aggregate the values corresponding to a key.

Given the same PairRDD that we used in groupByKey, for example, the reduceByKey operation can be performed on it as follows:

Java 7:

pairRDD.reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
  public Integer call(Integer v1, Integer v2) throws Exception {
    return v1 + v2;
  }
});

Java 8:

pairRDD.reduceByKey((v1, v2) -> v1 + v2);

In the previous example, the reduceByKey transformation will sum all the values corresponding to a key. The function accepts two parameters. The type of both parameters and return type of the function is the same as the value type of PairRDD. The first parameter is the cumulative value and the second value is the value of the pair that is next to be merged.

Similar to group by key, the reduceByKey function also provides an overload that accepts a partitioner as an argument:

reduceByKey(Partitioner partitioner,new Function2<T, T, T>() func)

For an RDD with multiple partitions, the reduceByKey transformation aggregates the values at partition level before shuffling, which helps in reducing the data travel of network. It is similar to the concept of combiner in Hadoop MapReduce. For a large dataset it is huge saving in the data being transferred over network. Therefore, reduceByKey is recommended over groupByKey.

Table of Contents for reduceByKey

Create new playlist

Sign In

Sign Up

Table of Contents for
reduceByKey