ShuffledRDD

ShuffledRDD shuffles the RDD elements by key so as to accumulate values for the same key on the same executor to allow an aggregation or combining logic. A very good example is to look at what happens when reduceByKey() is called on a PairRDD:

class ShuffledRDD[K, V, C] extends RDD[(K, C)] 

The following is a reduceByKey operation on the pairRDD to aggregate the records by the State:

scala> val pairRDD = statesPopulationRDD.map(record => (record.split(",")(0), 1))
pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[82] at map at <console>:27

scala> pairRDD.take(5)
res101: Array[(String, Int)] = Array((State,1), (Alabama,1), (Alaska,1), (Arizona,1), (Arkansas,1))

scala> val shuffledRDD = pairRDD.reduceByKey(_+_)
shuffledRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[83] at reduceByKey at <console>:29

scala> shuffledRDD.take(5)
res102: Array[(String, Int)] = Array((Montana,7), (California,7), (Washington,7), (Massachusetts,7), (Kentucky,7))

The following diagram, is an illustration of the shuffling by Key to send the records of the same Key(State) to the same partitions:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.4.46