countByValue

The countByValue() method can be applied on both RDD as well as pair RDD, however, the returned object has different signatures in both cases. For example, when countByValue() is applied on RDD , it counts the occurrence of elements having the specific value in that RDD and returns a Map having RDD elements as key and count of occurrence as value:

//countByValue
List<Tuple2<String,Integer>> list= new ArrayList<Tuple2<String,Integer>>();
list.add(new Tuple2<String,Integer>("a", 1));
list.add(new Tuple2<String,Integer>("b", 2));
list.add(new Tuple2<String,Integer>("c", 3));
list.add(new Tuple2<String,Integer>("a", 4));
JavaPairRDD<String,Integer> pairRDD = sparkContext.parallelizePairs(list);
Map<Tuple2<String, Integer>, Long> countByValueMap= pairRDD.countByValue();
for(Entry<Tuple2<String, Integer>, Long> entrySet 
:countByValueMap.entrySet()) {
  System.out.println("The key of Map is a tuple having value : 
  "+entrySet.getKey()._1()+" and "+entrySet.getKey()._2()+" and 
  the value of count is : "+entrySet.getValue());
}

When countByValue() is applied on Pair RDD it returns the count of each unique value in the RDD as a map of (value, count) pairs where value is the Tuple for RDD. In the code snippet we can see that there are two tuples having the same key, that is, a, but the values are different. Hence when the countByValue() action is applied on them it treats both the keys having a as differently. The final combine step happens locally on the master, equivalent to running a single reduce task.

Table of Contents for countByValue

Create new playlist

Sign In

Sign Up

Table of Contents for
countByValue