Pair RDD

Pair RDDs are RDDs consisting of key-value tuples which suits many use cases such as aggregation, sorting, and joining data. The keys and values can be simple types such as integers and strings or more complex types such as case classes, arrays, lists, and other types of collections. The key-value based extensible data model offers many advantages and is the fundamental concept behind the MapReduce paradigm.

Creating a PairRDD can be done easily by applying transformation to any RDD to convert the RDD to an RDD of key-value pairs.

Let's read the statesPopulation.csv into an RDD using the SparkContext, which is available as sc.

The following is an example of a basic RDD of the state population and how PairRDD looks like for the same RDD splitting the records into tuples (pairs) of state and population:

scala> val statesPopulationRDD = sc.textFile("statesPopulation.csv")
statesPopulationRDD: org.apache.spark.rdd.RDD[String] = statesPopulation.csv MapPartitionsRDD[47] at textFile at <console>:25

scala> statesPopulationRDD.first

res4: String = State,Year,Population

scala> statesPopulationRDD.take(5)
res5: Array[String] = Array(State,Year,Population, Alabama,2010,4785492, Alaska,2010,714031, Arizona,2010,6408312, Arkansas,2010,2921995)

scala> val pairRDD = statesPopulationRDD.map(record => (record.split(",")(0), record.split(",")(2)))
pairRDD: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[48] at map at <console>:27

scala> pairRDD.take(10)
res59: Array[(String, String)] = Array((Alabama,4785492), (Alaska,714031), (Arizona,6408312), (Arkansas,2921995), (California,37332685), (Colorado,5048644), (Delaware,899816), (District of Columbia,605183), (Florida,18849098))

The following is a diagram of the preceding example showing how the RDD elements are converted to (key - value) pairs:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.248.69