Interaction with HDFS

Hadoop Distributed File System (HDFS), is widely used for storing huge amounts of data. It is the most popular storage system in the big data world. Its scalability, high availability, and fault tolerance make it an ideal filesystem for storing large chunks of data.

We have already explained some basics of HDFS in Chapter 1, Introduction to Spark. Please refer to it for further clarification of its workings.

For reading or writing data on HDFS, the client has to connect to NameNode, which holds all the metadata of HDFS. Then, NameNode redirects the client to the DataNode(s) where the actual data resides if the client has requested to read, or to the DataNode(s) where resources (that is, storage space) are available to write the data in the case of write requests.

So, to read from HDFS, the client requires the NameNode address and path of the file(s) that needs to be read. Users can read a file in Spark from HDFS as follows:

JavaRDD<String> hadoopFile= jsc.textFile("hdfs://"+"namenode-host:port/"+"absolute-file-path-on-hdfs"); 

For example:

JavaRDD<String> hadoopFile= jsc.textFile("hdfs://localhost:8020/user/spark/data.csv") 

As the textFile method always returns an RDD of Strings, further operations can be done similarly:

JavaPairRDD<String, Integer> hadoopWordCount=hadoopFile.flatMap(x -> Arrays.asList(x.split(", ")).iterator()) 
               .mapToPair(x -> new Tuple2<String, Integer>((String) x, 1)) 
               .reduceByKey((x, y) -> x + y); 

The textfile method can be used to read multiple files from HDFS as follows:

  • To read more than one file:
jsc.textFile("hdfs://"+"namenode-host:port/"+"absolute-file-path-of-file1"+","+"hdfs://"+"namenode-host:port/"+"absolute-file-path-of-file2"); 
  • To read all the files in a directory:
jsc.textFile("hdfs://"+"namenode-host:port/"+"absolute-file-path-of-dir/*"); 
  • To read all the files in multiple directories:
jsc.textFile("hdfs://"+"namenode-host:port/"+"absolute-file-path-of-dir1/*"+","+"hdfs://"+"namenode-host:port/"+"absolute-file-path-of-dir2/*"); 

To write the output back to HDFS, users can provide the HDFS URL as follows:

hadoopWordCount.saveAsTextFile("hdfs://"+"namenode-host:port/"+"absolute-file-path-of-output-dir"); 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.122.68