Reading and writing data to SequenceFiles

The SequenceFile format is a flexible format included with the Hadoop distribution. It is capable of storing both text and binary data. SequenceFiles store data as binary key-value pairs. The binary pairs are then grouped together into blocks. This format supports compressing the value portion of a record or an entire block of key-value pairs. SequenceFiles are splittable even when using a compression codec that is not normally splittable, such as GzipCodec. SequenceFiles are able to do this because individual values (or blocks) are compressed, not the entire SequenceFile.

This recipe will demonstrate how to write and read to SequenceFiles.

Getting ready

You will need to download the weblog_entries.txt dataset from the Packt website, http://www.packtpub.com/support. Also, weblog_entries.txt should be available in HDFS. You can place the weblog_entries.txt file in HDFS using the Hadoop FS shell as follows:

$ hadoop fs –put /path/on/local/filesystem/weblog_entries.txt /path/in/hdfs

How to do it...

  1. Once you have downloaded the test dataset, we can write an application to read a plain text file from HDFS and write the contents to a SequenceFile in HDFS using MapReduce.
    public class SequenceWriter extends Configured implements Tool {
    
        public int run(String[] args) throws Exception {
            
            Path inputPath = new Path(args[0]);
            Path outputPath = new Path(args[1]);
            
            Configuration conf = getConf();
            Job weblogJob = new Job(conf);
            weblogJob.setJobName("Sequence File Writer");
            weblogJob.setJarByClass(getClass());
            weblogJob.setNumReduceTasks(0);
            weblogJob.setMapperClass(IdentityMapper.class);    
            weblogJob.setMapOutputKeyClass(LongWritable.class);
            weblogJob.setMapOutputValueClass(Text.class);
            weblogJob.setOutputKeyClass(LongWritable.class);
            weblogJob.setOutputValueClass(Text.class);
            weblogJob.setInputFormatClass(TextInputFormat.class);
            weblogJob.setOutputFormatClass(
    SequenceFileOutputFormat.class);
            
            FileInputFormat.setInputPaths(weblogJob, inputPath);
            SequenceFileOutputFormat.setOutputPath(
    weblogJob, outputPath);
            
           
            if(weblogJob.waitForCompletion(true)) {
                return 0;
            }
            return 1;
        }
        
        public static void main(String[] args) throws Exception {
            int returnCode = ToolRunner.run(
    new SequenceWriter(), args);
            System.exit(returnCode);
        }
    }
  2. Now, use the MapReduce job to read a SequenceFile from HDFS and transform it back to normal text:
    public class SequenceReader extends Configured implements Tool {
        
        public int run(String[] args) throws Exception {
            
            Path inputPath = new Path(args[0]);
            Path outputPath = new Path(args[1]);
            
            Configuration conf = getConf();
            Job weblogJob = new Job(conf);
            weblogJob.setJobName("Sequence File Reader");
            weblogJob.setJarByClass(getClass());
            weblogJob.setNumReduceTasks(0);
            weblogJob.setMapperClass(IdentityMapper.class);  
            weblogJob.setMapOutputKeyClass(LongWritable.class);
            weblogJob.setMapOutputValueClass(Text.class);
            weblogJob.setOutputKeyClass(LongWritable.class);
            weblogJob.setOutputValueClass(Text.class);
            weblogJob.setInputFormatClass(
    SequenceFileInputFormat.class);
            weblogJob.setOutputFormatClass(
    TextOutputFormat.class);
            
            SequenceFileInputFormat.addInputPath(
    weblogJob, inputPath);
            FileOutputFormat.setOutputPath(
    weblogJob, outputPath);
            
            if(weblogJob.waitForCompletion(true)) {
                return 0;
            }
            return 1;
        }
        
        public static void main(String[] args) throws Exception {
            int returnCode = ToolRunner.run(
    new SequenceReader(), args);
            System.exit(returnCode);
        }
    }

How it works...

MapReduce is an efficient way to transform data in HDFS. These two MapReduce jobs are very simple to code and are capable of transforming data using the distributed processing power of the cluster.

First, both MapReduce jobs are "map-only" jobs. This means that Hadoop will launch only mappers to process the test data. This is achieved by setting the number of reducers to 0, as shown in the following line of code:

weblogJob.setNumReduceTasks(0);

Next, we want the sequence writer job to read text input and save its output as a SequenceFile. To do this, the SequenceWriter class sets the input format class to TextInputFormat.class, as shown in the following line of code:

weblogJob.setInputFormatClass(TextInputFormat.class);

And we also set the output format class to SequenceFileInputFormat.class, as shown in the following lines of code:

weblogJob.setOutputFormatClass(
SequenceFileOutputFormat.class);

For the next application, we wanted to read a sequence file and write a normal text file. To do this, we reversed the input and output formats we used for the sequence writer job.

In the sequence reader job, set the input format to read SequenceFiles.

weblogJob.setInputFormatClass(
SequenceFileInputFormat.class);

Set the output format to plain text.

weblogJob.setOutputFormatClass(
TextOutputFormat.class);

There's more...

SequenceFiles have three compression options:

  • Uncompressed: Key-value pairs are stored uncompressed
  • Record compression: The value emitted from a mapper or reducer is compressed
  • Block compression: An entire block of key-value pairs is compressed

You can compress SequenceFiles using the following methods when you set up your job:

SequenceFileOutputFormat.setOutputCompression(job, true);

Next, set the compression option you want to use; the following code sets the record compression option:

SequenceFileOutputFormat.setOutputCompressionType(weblogJob, SequenceFile.CompressionType.RECORD);

Or set the block compression option:

SequenceFileOutputFormat.setOutputCompressionType(weblogJob, SequenceFile.CompressionType.BLOCK);

Finally, choose a compression codec class, for example gzip:

SequenceFileOutputFormat.setOutputCompressorClass(weblogJob, GzipCodec.class);

See also

In the following recipes, we will continue to explore different data serialization libraries and formats:

  • Using Apache Avro to serialize data
  • Using Apache Thrift to serialize data
  • Using Protocol Buffers to serialize data
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.253.223