wholeTextFiles

wholeTextFiles() can be used to load multiple text files into a paired RDD containing pairs <filename, textOfFile> representing the filename and the entire content of the file. This is useful when loading multiple small text files and is different from textFile API because when whole TextFiles() is used, the entire content of the file is loaded as a single record:

sc.wholeTextFiles(path, minPartitions=None, use_unicode=True)

The following is an example of loading a textfile into an RDD using wholeTextFiles():

scala> val rdd_whole = sc.wholeTextFiles("wiki1.txt")
rdd_whole: org.apache.spark.rdd.RDD[(String, String)] = wiki1.txt MapPartitionsRDD[37] at wholeTextFiles at <console>:25

scala> rdd_whole.take(10)
res56: Array[(String, String)] =
Array((file:/Users/salla/spark-2.1.1-bin-hadoop2.7/wiki1.txt,Apache Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data

Table of Contents for wholeTextFiles

Create new playlist

Sign In

Sign Up

Table of Contents for
wholeTextFiles