CoGroup

This operation also groups two PairRDD. Consider, we have two PairRDD of <X,Y> and <X,Z> types . When CoGroup transformation is executed on these RDDs, it will return an RDD of <X,(Iterable<Y>,Iterable<Z>)> type. This operation is also called groupwith.

The following is an example of CoGroup transformation. Let's start with creating two pair RDDs:

JavaPairRDD<String, String> pairRDD1 = javaSparkContext.parallelizePairs(Arrays.asList(new Tuple2<String, String>("B", "A"), new Tuple2<String, String>("B", "D"), new Tuple2<String, String>("A", "E"), new Tuple2<String, String>("A", "B")));

JavaPairRDD<String, Integer> pairRDD2 = javaSparkContext.parallelizePairs(Arrays.asList(new Tuple2<String, Integer>("B", 2), new Tuple2<String, Integer>("B", 5), new Tuple2<String, Integer>("A", 7), new Tuple2<String, Integer>("A", 8)));

CoGroup transformation on these RDDs can be executed as follows:

JavaPairRDD<String, Tuple2<Iterable<String>, Iterable<Integer>>> cogroupedRDD = pairRDD1.cogroup(pairRDD2);

or

JavaPairRDD<String, Tuple2<Iterable<String>, Iterable<Integer>>> groupWithRDD = pairRDD1.groupWith(pairRDD2);

As this operation also requires shuffling of the same keys to one executor on a partitioned RDD, an overloaded function is available to provide partitioner:

cogroup(JavaPairRDD<K,V> other,Partitioner partitioner)

This section described various useful transformations, their usage, and their implementation in Java. In the next section, we will learn about various actions available in Spark that can be executed on RDDs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.229.44