Custom Partitioner

Along with Hash Partitioner and Range Partitioner, Apache Spark provides an option to specify a custom partitioner if required. To create a custom partitioner, the user needs to extend the org.apache.spark.Partitioner class and provide the implementation of the required methods.

Consider that we have a pair RDD with key as a StringType. Our requirement is to partition all the tuples into two partitions based on the length of the keys - all the keys with an odd length should be in one partition, and the other partition should contain all the keys with an even length.

Let's create a custom partitioner based on these requirement:

import org.apache.spark.artitioner;
public class CustomPartitioner extends Partitioner{
final int maxPartitions=2;
@Override
public int getPartition(Object key) {
return (((String) key).length()%maxPartitions);
}
@Override
public int numPartitions() {
return maxPartitions;
}}

Now, we will use this partitioner to partition the RDD with StringType keys:

JavaPairRDD<String, String> pairRdd = jsc.parallelizePairs(Arrays.asList(new Tuple2<String, String>("India", "Asia"),new Tuple2<String, String>("Germany", "Europe"),new Tuple2<String, String>("Japan", "Asia"),new Tuple2<String, String>("France", "Europe")),3);
JavaPairRDD<String, String> customPartitioned = pairRdd.partitionBy(new CustomPartitioner());

To verify that the partition is working as expected, we will print the keys belonging to each partition as follows:

JavaRDD<String> mapPartitionsWithIndex = customPartitioned .mapPartitionsWithIndex((index, tupleIterator) -> {List<String> list=new ArrayList<>();
while(tupleIterator.hasNext()){ list.add("Partition number:"+index+",key:"+tupleIterator.next()._1());}
return list.iterator();}, true);
System.out.println(mapPartitionsWithIndex.collect());

It will produce the following output which shows the partitioner is behaving as expected:

[Partition number:0,key:France, Partition number:1,key:India, Partition number:1,key:Germany, Partition number:1,key:Japan]

In this section, we looked in detail at RDD partitioning, and we also learnt about different partitioners, including worked examples of each. In the next section, we will learn about some advanced transformations available in Spark.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.247.11