repartition

repartition applies a transformation function to input partitions to repartition the input into fewer or more output partitions in the output RDD.

As shown in the following code snippet, this is how we can map an RDD of a text file to an RDD with more partitions:

scala> val rdd_two = sc.textFile("wiki1.txt")
rdd_two: org.apache.spark.rdd.RDD[String] = wiki1.txt MapPartitionsRDD[8] at textFile at <console>:24

scala> rdd_two.partitions.length
res21: Int = 2

scala> val rdd_three = rdd_two.repartition(5)
rdd_three: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[25] at repartition at <console>:26

scala> rdd_three.partitions.length
res23: Int = 5

The following diagram explains how repartition works. You can see that a new RDD is created from the original RDD, essentially redistributing the partitions by combining/splitting them as needed:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.237.122