NewHadoopRDD

NewHadoopRDD provides core functionality for reading data stored in HDFS, HBase tables, Amazon S3 using the new MapReduce API from Hadoop 2.x libraries.NewHadoopRDD can read from many different formats thus is used to interact with several external systems.

Prior to NewHadoopRDD, HadoopRDD was the only available option which used the old MapReduce API from Hadoop 1.x
class NewHadoopRDD[K, V](
sc : SparkContext,
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
@transient private val _conf: Configuration)
extends RDD[(K, V)]

As seen in the preceding code snippet, NewHadoopRDD takes an input format class, a key class, and a value class. Let's look at examples of NewHadoopRDD.

The simplest example is to use SparkContext's wholeTextFiles function to create WholeTextFileRDD. Now, WholeTextFileRDD actually extends NewHadoopRDD as shown in the following code snippet:

scala> val rdd_whole = sc.wholeTextFiles("wiki1.txt")
rdd_whole: org.apache.spark.rdd.RDD[(String, String)] = wiki1.txt MapPartitionsRDD[3] at wholeTextFiles at <console>:31

scala> rdd_whole.toDebugString
res9: String =
(1) wiki1.txt MapPartitionsRDD[3] at wholeTextFiles at <console>:31 []
| WholeTextFileRDD[2] at wholeTextFiles at <console>:31 []

Let's look at another example where we will use the function newAPIHadoopFile using the SparkContext:

import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat

import org.apache.hadoop.io.Text

val newHadoopRDD = sc.newAPIHadoopFile("statesPopulation.csv", classOf[KeyValueTextInputFormat], classOf[Text],classOf[Text])
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.98.34