Processing different file and compression types in Impala

Impala loads files stored in HDFS and these files could be of various types. Some of these files are stored in HDFS directly from their source, or some of the files could be the output of MapReduce or Pig or any other application running on Hadoop.

Impala is limited in terms of supporting various file types on Hadoop; however, it does cover most popular Big Data file formats, which gives Impala a very wide range to cover user input requests. If Impala cannot read an input file type, you can perform the following steps to use a combination of Hive and Impala:

  1. Use the CREATE TABLE statement in the Hive shell to create the table with input data.
  2. Use the Impala shell with the INVALIDATE METADATA statement so that it does not generate unsupported file type errors.
  3. Now write query statements in the Impala shell to achieve your objective.

A very important point to note here is that Impala performance mostly depends on the input file format and the compression algorithm used to compress input files. Compression is used for two main reasons. First, it requires less disk space to store files and small file reads require less disk I/O and CPU resources to load files in memory. Once the file is loaded in memory, it is decompressed in memory only when the data in the file is required for processing. The following table shows the list of Impala-supported compression types and their usage patterns and properties:

Compression type

Why use it?

Snappy

Very fast; it is the fastest in compression and decompression

GZIP

It is the best option to save disk space

LZO

Use only with text files

BZIP2

Not a top choice but Impala can read input files

Deflate

Not a first or second choice; however, can read input files

The following are a few considerations to keep in mind when choosing an appropriate file format for a table with Impala:

  • When CREATE TABLE is used with Impala, text files are the default input format. It is easier to read for humans and helps troubleshooting problems; however, it does not provide superfast processing with large amounts of data due to significant disk read activity.
  • When performance is your primary consideration, use Snappy, and when disk space saving is your primary consideration, use GZIP. LZO can also be used with text files as an option to expedite things a little.
  • If your source files are already in one of Impala's supported type, create a table in Impala using the same file format in most of the cases unless changing the format in the Impala table gives you significant improvement in processing the source data in your file.
  • If you want to change the file format sometime in Impala, first use CREATE TABLE to create a table with your desired file type format and then use the INSERT statement to copy data into the Impala table, which requires a one-time file conversion from source to Impala.
  • Data compression does not always means that you will achieve faster processing time by saving important time in disk I/O. Data compression does require CPU cycles to uncompress before processing so it does adds up time somewhere. Sometimes having uncompressed data provides significant speed in processing, that the cost to keep it uncompressed in disk compensating the logic to store uncompressed in disk.
  • When using uncompressed text files with Impala, you can just copy them onto HDFS first. After that, use CREATE TABLE and then use the INSERT statement to copy them into Impala.

Now let's take a look at some of the SQL statements that you can use with various input file types in Impala.

The regular text file format with Impala tables

By default, Impala uses the text file format with the CREATE TABLE syntax. When data is inserted into this table using INSERT, Ctrl+A (Hex 01) is used as a default delimiter. The default syntax is as follows:

CREATE TABLE users (userID int, username string);

To change the delimiter to, for example, ,, , |, or your_choice, you can use the following syntax:

CREATE TABLE users (userID int, username string) STORED AS textfile FIELDS TERMINATED BY '	';

Tip

Please visit the Cloudera Impala documentation for text file format support at the following URL:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_txtfile.html

The Avro file format with Impala tables

With the Avro file format, you would have to create tables in Hive first, as shown in the following code snippet:

CREATE TABLE my_avro_table (userID int, userName string)
      ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
       STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
       OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
      TBLPROPERTIES (
          'avro.schema.literal'='{
          "type": "record",
          "name": "user_record",
          "fields": [
             {"name": "userID", "type": "int"},
             {"name": "userName", "type": "string"}
    ]}'),
INSERT OVERWRITE TABLE my_avro_table SELECT *,  "avro"
        FROM functional.alltypes;

Once the file is created in Hive, you can just use it in Impala as any other file as follows:

SELECT * from my_avro_table;

Tip

Please visit the Cloudera Impala documentation for Avro file format support at the following URL:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_avro.html

The RCFile file format with Impala tables

When you create a table with the RCFile format, without using any existing data use with the table, the syntax is as follows:

CREATE TABLE my_rcfile_table (userID int, userName string)
STORED AS RCFile;

Impala can query RCFile-type tables but cannot write to them, so you would need to use Hive to write data into the file using the INSERT statement. With Hive, you don't need to specify the storage file type as Hive takes care of it by default.

Tip

Please visit the Cloudera Impala documentation for RCFile file support at the following URL:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_rcfile.html

The SequenceFile file format with Impala tables

Like RCFile, Impala supports the creation of tables that can store SequenceFile data. To create an empty table to store SequenceFile-type data in Impala, you just need to use the following syntax in the Impala shell:

CREATE TABLE my_sequencefile_table (userID int, userName string)
STORED AS SEQUENCEFILE;

The rest of the steps require you to use Hive for setting up file compression and then writing data into the table using the appropriate INSERT statement.

Tip

Please visit the Cloudera Impala documentation for SequenceFile file format support at the following URL:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_seqfile.html

The Parquet file format with Impala tables

You might be wondering what the Parquet file format is. I would like to provide a little information in this context. The Parquet file format is a column-oriented binary file format that is designed to provide column-specific access to the data. As the data is stored in columns and all columns are stored separately, lookups are happening on columns first. This column-oriented access method makes query processing very fast and efficient, and Impala takes advantage of this file format. Impala provides native support to create, manage, and query tables based on the Parquet file format.

The following is the syntax for creating a table that can store the Parquet file format in Impala:

CREATE TABLE my_parquet_table (userID int, userName string)
     STORED AS PARQUETFILE;

As Impala supports writing the Parquet file format within Impala, you can use the INSERT statement as shown in the following code snippet to write to your Parquet file type from other files:

INSERT OVERWRITE TABLE my_parquet_table
    SELECT * FROM other_table_name;

Tip

Please visit the Cloudera Impala documentation for Parquet file format support at the following URL:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_parquet.html

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.36.221