Compression

Compression techniques in Hive can significantly reduce the amount of data transferring between mappers and reducers by properly compressing intermediate and final output data. As a result, the query will have better performance. To compress intermediate files produced between multiple MapReduce jobs, we need to set the following property (false by default) in the command-line session or the hive-site.xml file:

> SET hive.exec.compress.intermediate=true

Then, we need to decide which compression codec to configure. A list of commonly supported codecs is in the following table:

Compression	Codec	Extension	Splittable
Deflate	`org.apache.hadoop.io.compress.DefaultCodec`	`.deflate`	N
Gzip	`org.apache.hadoop.io.compress.GzipCodec`	`.gz`	N
Bzip2	`org.apache.hadoop.io.compress.BZip2Codec`	`.gz`	Y
LZO	`com.apache.compression.lzo.LzopCodec`	`.lzo`	N
LZ4	`org.apache.hadoop.io.compress.Lz4Codec`	`.lz4`	N
Snappy	`org.apache.hadoop.io.compress.SnappyCodec`	`.snappy`	N

Deflate (.deflate) is a default codec with a balanced compression ratio and CPU cost. The compression ratio for Gzip is very high, as is its CPU cost. Bzip2 is splittable, but it is too slow for compression considering its huge CPU cost, like Gzip. LZO files are not natively splittable, but we can preprocess them (using com.hadoop.compression.lzo.LzoIndexer) to create an index that determines the file splits. When it comes to the balance of CPU cost and compression ratio, LZ4 or Snappy do a better job than Deflate, but Snappy is more popular. Since the majority of compressed files are not splittable, it is not suggested to compress a single big file. The best practice is to produce compressed files in a couple of HDFS block sizes so that each file takes less time for processing. The compression codec can be specified in either mapred-site.xml, hive-site.xml, or a command-line session as follows:

> SET hive.intermediate.compression.codec=
org.apache.hadoop.io.compress.SnappyCodec

Intermediate compression will only save disk space for specific jobs that require multiple MapReduce jobs. For further saving of disk space, the actual Hive output files can be compressed. When the hive.exec.compress.output property is set to true, Hive will use the codec configured by the mapreduce.output.fileoutputformat.compress.codec property to compress the data in HDFS as follows. These properties can be set in the hive-site.xml or in the command-line session:

> SET hive.exec.compress.output=true
> SET mapreduce.output.fileoutputformat.compress.codec=
org.apache.hadoop.io.compress.SnappyCodec

Table of Contents for Compression

Create new playlist

Sign In

Sign Up

Table of Contents for
Compression