Join

It is used to join pair RDDs. This transformation is very similar to Join operations in a database. Consider, we have two pair RDDs of <X,Y> and <X,Z> types . When the Join transformation is executed on these RDDs, it will return an RDD of <X,(Y,Z)> type.

The following is an example of the join operation:

Let's start with creating two PairRDD:

JavaPairRDD<String, String> pairRDD1 = javaSparkContext.parallelizePairs(Arrays.asList(new Tuple2<String, String>("B", "A"), new Tuple2<String, String>("C", "D"), new Tuple2<String, String>("D", "E"), new Tuple2<String, String>("A", "B"))
JavaPairRDD<String, Integer> pairRDD2 = javaSparkContext.parallelizePairs( Arrays.asList(new Tuple2<String, Integer>("B", 2), new Tuple2<String, Integer>("C", 5), new Tuple2<String, Integer>("D", 7), new Tuple2<String, Integer>("A", 8)));

Now, pairRDD1 will be joined with pairRDD2 as follows:

JavaPairRDD<String, Tuple2<String, Integer>> joinedRDD = pairRDD1.join(pairRDD2);

LeftOuterJoin, RightOuterJoin and FullOuterJoin are also supported which can be executed as :

pairRDD1.leftOuterJoin(pairRDD2);
pairRDD1.rightOuterJoin(pairRDD2);
pairRDD1.fullOuterJoin(pairRDD2);

In case of an RDD partitioned across multiple nodes, same keys will be shuffled to one of the executor nodes for the join operation. Therefore, overload is available for all of the join operations and it will let the user provide a partitioner as well:

join(JavaPairRDD<K,V> other,Partitioner partitioner)
leftOuterJoin(JavaPairRDD<K,V> other,Partitioner partitioner);
rightOuterJoin(JavaPairRDD<K,V> other,Partitioner partitioner);
fullOuterJoin(JavaPairRDD<K,V> other,Partitioner partitioner);
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.62.168