Impala is used to process large amounts of data stored in your Hadoop cluster. There is no limitation in Hadoop about what type of data can be stored; however, to improve data access performance in Hadoop, some file types and compression provide better results than others. Impala can query most of the popular structured and unstructured file formats available in Hadoop along with compression used in a file. Here is a list of the supported file formats and compression types in Impala:
File type |
File format |
Compression type |
---|---|---|
Text |
Unstructured |
LZO |
Avro |
Structured |
GZIP, BZIP2, deflate, Snappy |
RCFile |
Structured |
GZIP, BZIP2, deflate, Snappy |
SequenceFile |
Structured |
GZIP, BZIP2, deflate, Snappy |
Parquet |
Structured |
GZIP, Snappy (Default) |
Now let's take a look at how choosing a proper file format can improve performance in Impala:
INSERT
statement to perform a one-time conversion. This new table will provide comparatively better performance if you have chosen a new format or compression carefully.Chapter 7, Advanced Impala Concepts, has more information about various file formats and compression types and how to use them in Impala.
52.14.172.93