HadoopRDD

HadoopRDD provides core functionality for reading data stored in HDFS using the MapReduce API from the Hadoop 1.x libraries. HadoopRDD is the default used and can be seen when loading data from any file system into an RDD:

class HadoopRDD[K, V] extends RDD[(K, V)]

When loading the state population records from the CSV, the underlying base RDD is actually HadoopRDD as in the following code snippet:

scala> val statesPopulationRDD = sc.textFile("statesPopulation.csv")
statesPopulationRDD: org.apache.spark.rdd.RDD[String] = statesPopulation.csv MapPartitionsRDD[93] at textFile at <console>:25

scala> statesPopulationRDD.toDebugString
res110: String =
(2) statesPopulation.csv MapPartitionsRDD[93] at textFile at <console>:25 []
| statesPopulation.csv HadoopRDD[92] at textFile at <console>:25 []

The following diagram is an illustration of a HadoopRDD created by loading a textfile from the file system into an RDD:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.112.210