Partitioner

As already stated, for any shuffle operation on a Pair RDD, the destination partition for a key is decided by the partitioner used during the shuffle. The partitioner used during the shuffling plays a major role in deciding the amount of data shuffling.

The basic requirement of a partitioner is that similar keys should land in the same partition. Apache Spark provides a transformation called partitionBy which helps to partition an RDD at any point as per the user's choice. Also, as stated previously, the user can also provide a partitioner of its own choice during the pair RDD transformation that required shuffling, like reduceByKey, aggregateByKey, foldByKey, and so on.

There are two built-in types of partitioner available in Spark. Spark also provides an option to specify a custom partitioner. The following are the partitioner implementations available in Spark:

  1. Hash Partitioner
  2. Range Partitioner
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.14.245