Choosing an appropriate file format and compression type for better performance

Impala is used to process large amounts of data stored in your Hadoop cluster. There is no limitation in Hadoop about what type of data can be stored; however, to improve data access performance in Hadoop, some file types and compression provide better results than others. Impala can query most of the popular structured and unstructured file formats available in Hadoop along with compression used in a file. Here is a list of the supported file formats and compression types in Impala:

File type

File format

Compression type

Text

Unstructured

LZO

Avro

Structured

GZIP, BZIP2, deflate, Snappy

RCFile

Structured

GZIP, BZIP2, deflate, Snappy

SequenceFile

Structured

GZIP, BZIP2, deflate, Snappy

Parquet

Structured

GZIP, Snappy (Default)

Now let's take a look at how choosing a proper file format can improve performance in Impala:

  • Sometimes the original file format in which data is stored does not provide the required performance. The possible solution here is to create a new table with a different file format or compression, and then use the INSERT statement to perform a one-time conversion. This new table will provide comparatively better performance if you have chosen a new format or compression carefully.
  • Processing data, which is compressed, requires disk I/O and CPU cycles to read and uncompress. However, if data were uncompressed, only the disk I/O would comprise the primary cost during processing. So if the application architecture supports processing, uncompressed data does expedite the performance. With uncompressed data storage, you will end up taking lots of space on the disk compared to compressed data. So, you will need to take storage cost into consideration with performance gain.
  • Sometimes, changing the file format or compression does not yield any performance gain; rather it slows down the processing comparatively. In this scenario, just using the original file and compression format is fine. So, the lesson here is to understand the file and compression formats properly and then choose them to derive better performance.

    Tip

    Chapter 7, Advanced Impala Concepts, has more information about various file formats and compression types and how to use them in Impala.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.172.93