Compressing data using LZO

Hadoop supports a number of compression algorithms, including:

  • bzip2
  • gzip
  • DEFLATE

Hadoop provides Java implementations of these algorithms, and therefore, files can be easily compressed/decompressed using the FileSystem API or MapReduce input and output formats.

However, there is a drawback to storing data in HDFS using the compression formats listed previously. These formats are not splittable. Meaning, once a file is compressed using any of the codecs that Hadoop provides, the file cannot be decompressed without the whole file being read.

To understand why this is a drawback, you must first understand how Hadoop MapReduce determines the number of mappers to launch for a given task. The number of mappers launched is roughly equal to the input size divided by dfs.block.size (the default block size is 64 MB). The blocks of work that each mapper will receive are called input splits. For example, if the input to a MapReduce job was an uncompressed file that was 128 MB, this would probably result in two mappers being launched (128 MB/64 MB).

Since files compressed using the bzip2, gzip, and DEFLATE codecs cannot be split, the whole file must be given as a single input split to the mapper. Using the previous example, if the input to a MapReduce job was a gzip compressed file that was 128 MB, the MapReduce framework would only launch one mapper.

Now, where does LZO fit in to all of this? Well, the LZO algorithm was designed to have fast decompression speeds while having a similar compression speed as compared to DEFLATE. In addition, thanks to the hard work of the Hadoop community, LZO compressed files are splittable.

Note

bzip2 is splittable as of Hadoop Version 0.21.0; however, the algorithm does have some performance limitations and should be investigated thoroughly before being used in a production environment.

Getting ready

You will need to download the LZO codec implementation for Hadoop from https://github.com/kevinweil/hadoop-lzo.

How to do it...

Perform the following steps to set up LZO and then compress and index a text file:

  1. First, install the lzo and lzo-devel packages.

    On Red Hat Linux, use:

    # yum install liblzo-devel

    On Ubuntu, use:

    # apt-get install liblzo2-devel
  2. Navigate to the directory where you extracted the hadoop-lzo source, and build the project.
    # cd kevinweil-hadoop-lzo-6bb1b7f/
    # export JAVA_HOME=/path/to/jdk/ # ./setup.sh
  3. If the build was successful, you should see:
    BUILD SUCCESSFUL
  4. Copy the build JAR files to the Hadoop lib folder on your cluster.
    # cp build/hadoop-lzo*.jar /path/to/hadoop/lib/
  5. Copy the native libraries to the Hadoop native lib folder on your cluster.
    # tar -cBf - -C build/hadoop-lzo-0.4.15/lib/native/ . | tar -xBvf - -C /path/to/hadoop/lib/native
  6. Next, update core-site.xml to use the LZO codec classes.
    <property>
    <name>io.compression.codecs</name>
    <value>org.apache.hadoop.io.compress.GzipCodec,
                org.apache.hadoop.io.compress.DefaultCodec,
    org.apache.hadoop.io.compress.BZip2Codec,
    com.hadoop.compression.lzo.LzoCodec,
    com.hadoop.compression.lzo.LzopCodec
      </value>
    </property>
    <property>
      <name>io.compression.codec.lzo.class</name>
      <value>com.hadoop.compression.lzo.LzoCodec</value>
    </property>
  7. Finally, update the following environment variables in your hadoop-env.sh script:
    export HADOOP_CLASSPATH=/path/to/hadoop/lib/hadoop-lzo-X.X.XX.jar
    export JAVA_LIBRARY_PATH=/path/to/hadoop/lib/native/hadoop-lzo-native-lib:/path/to/hadoop/lib/native/other-native-libs

    Now test the installation of the LZO library.

  8. Compress the test dataset:
    $ lzop weblog_entries.txt
  9. Put the compressed weblog_entries.txt.lzo file into HDFS:
    $ hadoop fs –put weblog_entries.txt.lzo /test/weblog_entries.txt.lzo
  10. Run the MapReduce LZO indexer to index the weblog_entries.txt.lzo file:
    $ hadoop jar /usr/lib/hadoop/lib/hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.DistributedLzoIndexer /test/weblog_entries.txt.lzo

You should now see two files in the /test folder

$ hadoop fs –ls /test
$ /test/weblog_entries.txt.lzo
$ /test/weblog_entries.txt.lzo.index

How it works...

This recipe involved a lot of steps. After we moved the LZO JAR files and native libraries into place, we updated the io.compression.codecs property in core-site.xml. Both HDFS and Hadoop MapReduce share this configuration file, and the value of the io.compression.codecs property will be used to determine which codecs are available to the system.

Finally, we ran DistributedLzoIndexer. This is a MapReduce application that will read one or more LZO compressed files and index the LZO block boundaries of each file. Once this application has been run on an LZO file, the LZO file can be split and sent to multiple mappers by using the included input format LzoTextInputFormat.

There's more...

In addition to DistributedLzoIndexer, the Hadoop LZO library also includes a class named LzoIndexer. LzoIndexer launches a standalone application to index LZO files in HDFS. To index the weblog_entries.txt.lzo in HDFS, run the following command:

$ hadoop jar /usr/lib/hadoop/lib/hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.LzoIndexer /test/weblog_entries.txt.lzo

See also

  • Using Apache Thrift to serialize data
  • Using Protocol Buffers to serialize data
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.0.145