Reading data from an external source

A second method for creating an RDD is by reading data from an external distributed source such as Amazon S3, Cassandra, HDFS, and so on. For example, if you are creating an RDD from HDFS, then the distributed blocks in HDFS are all read by the individual nodes in the Spark cluster.

Each of the nodes in the Spark cluster is essentially doing its own input-output operations and each node is independently reading one or more blocks from the HDFS blocks. In general, Spark makes the best effort to put as much RDD as possible into memory. There is the capability to cache the data to reduce the input-output operations by enabling nodes in the spark cluster to avoid repeated reading operations, say from the HDFS blocks, which might be remote to the Spark cluster. There are a whole bunch of caching strategies that can be used within your Spark program, which we will examine later in a section for caching.

The following is an RDD of text lines loading from a text file using the Spark Context and the textFile() function. The textFile function loads the input data as a text file (each newline terminated portion becomes an element in the RDD). The function call also automatically uses HadoopRDD (shown in next chapter) to detect and load the data in the form of several partitions as needed, distributed across the cluster.

scala> val rdd_two = sc.textFile("wiki1.txt")
rdd_two: org.apache.spark.rdd.RDD[String] = wiki1.txt MapPartitionsRDD[8] at textFile at <console>:24

scala> rdd_two.count
res6: Long = 9

scala> rdd_two.first
res7: String = Apache Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.

Table of Contents for Reading data from an external source

Create new playlist

Sign In

Sign Up

Table of Contents for
Reading data from an external source