HashPartitioner

HashPartitioner is the default partitioner in Spark and works by calculating a hash value for each key of the RDD elements. All the elements with the same hashcode end up in the same partition as shown in the following code snippet:

partitionIndex = hashcode(key) % numPartitions

The following is an example of the String hashCode() function and how we can generate partitionIndex:

scala> val str = "hello"
str: String = hello

scala> str.hashCode
res206: Int = 99162322

scala> val numPartitions = 8
numPartitions: Int = 8

scala> val partitionIndex = str.hashCode % numPartitions
partitionIndex: Int = 2
The default number of partitions is either from the Spark configuration parameter spark.default.parallelism or the number of cores in the cluster

The following diagram is an illustration of how hash partitioning works. We have an RDD with 3 elements a, b, and e. Using String hashcode we get the partitionIndex for each element based on the number of partitions set at 6:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.182.150