As already stated, for any shuffle operation on a Pair RDD, the destination partition for a key is decided by the partitioner used during the shuffle. The partitioner used during the shuffling plays a major role in deciding the amount of data shuffling.
The basic requirement of a partitioner is that similar keys should land in the same partition. Apache Spark provides a transformation called partitionBy which helps to partition an RDD at any point as per the user's choice. Also, as stated previously, the user can also provide a partitioner of its own choice during the pair RDD transformation that required shuffling, like reduceByKey, aggregateByKey, foldByKey, and so on.
There are two built-in types of partitioner available in Spark. Spark also provides an option to specify a custom partitioner. The following are the partitioner implementations available in Spark:
- Hash Partitioner
- Range Partitioner