Reading and writing data to HDFS

There are many ways to read data from and write data to HDFS. We will start by using the FileSystem API to create and write to a file in HDFS, followed by an application to read a file from HDFS and write it back to the local filesystem.

Getting ready

You will need to download the weblog_entries.txt dataset from the Packt website, http://www.packtpub.com/support.

How to do it...

Carry out the following steps to read and write data to HDFS:

  1. Once you have downloaded the test dataset, we can write an application to read a file from the local filesystem and write the contents to HDFS.
    public class HdfsWriter extends Configured implements Tool {
            
        public int run(String[] args) throws Exception {
            
            String localInputPath = args[0];
            Path outputPath = new Path(args[1]);
            Configuration conf = getConf();
            FileSystem fs = FileSystem.get(conf);
            OutputStream os = fs.create(outputPath);
            InputStream is = new BufferedInputStream(
               new FileInputStream(localInputPath));
            IOUtils.copyBytes(is, os, conf);
            return 0;
        }
        
        public static void main(String[] args) throws Exception {
            int returnCode = ToolRunner.run(
               new HdfsWriter(), args);
            System.exit(returnCode);
        }
    }
  2. Next, we write an application to read the file we just created in HDFS and write its contents back to the local filesystem.
    public class HdfsReader extends Configured implements Tool {
        
        public int run(String[] args) throws Exception {
            
            Path inputPath = new Path(args[0]);
            String localOutputPath = args[1];
            Configuration conf = getConf();
            FileSystem fs = FileSystem.get(conf);
            InputStream is = fs.open(inputPath);
            OutputStream os = new BufferedOutputStream(
                new FileOutputStream(localOutputPath));
            IOUtils.copyBytes(is, os, conf);
            return 0;
        }
        
        public static void main(String[] args) throws Exception {
            int returnCode = ToolRunner.run(
               new HdfsReader(), args);
            System.exit(returnCode);
        }
    }

How it works...

FileSystem is an abstract class that represents a generic filesystem. Most Hadoop filesystem implementations can be accessed and manipulated through the FileSystem object. To create an instance of the Hadoop Distributed File System, you call the method FileSystem.get(). The FileSystem.get() method will look at the URI assigned to the fs.default.name parameter of the Hadoop configuration files on your classpath and choose the correct implementation of the FileSystem class to instantiate. The fs.default.name parameter of HDFS has the value hdfs://.

Once an instance of the FileSystem class has been created, the HdfsWriter class calls the create() method to create a file (or overwrite if it already exists) in HDFS. The create() method returns an OutputStream object, which can be manipulated using normal Java I/O methods. Similarly, HdfsReader calls the method open() to open a file in HDFS, which returns an InputStream object that can be used to read the contents of the file.

There's more...

The FileSystem API is extensive. To demonstrate some of the other methods available in the API, we can add some error checking to the HdfsWriter and HdfsReader classes we created.

To check whether the file exists before we call create(), use:

boolean exists = fs.exists(inputPath);

To check whether the path is a file, use:

boolean isFile = fs.isFile(inputPath);

To rename a file that already exists, use:

boolean renamed = fs.rename(inputPath, new Path("old_file.txt"));
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.107.229