Impala loads files stored in HDFS and these files could be of various types. Some of these files are stored in HDFS directly from their source, or some of the files could be the output of MapReduce or Pig or any other application running on Hadoop.
Impala is limited in terms of supporting various file types on Hadoop; however, it does cover most popular Big Data file formats, which gives Impala a very wide range to cover user input requests. If Impala cannot read an input file type, you can perform the following steps to use a combination of Hive and Impala:
CREATE TABLE
statement in the Hive shell to create the table with input data.INVALIDATE METADATA
statement so that it does not generate unsupported file type errors.A very important point to note here is that Impala performance mostly depends on the input file format and the compression algorithm used to compress input files. Compression is used for two main reasons. First, it requires less disk space to store files and small file reads require less disk I/O and CPU resources to load files in memory. Once the file is loaded in memory, it is decompressed in memory only when the data in the file is required for processing. The following table shows the list of Impala-supported compression types and their usage patterns and properties:
Compression type |
Why use it? |
---|---|
Snappy |
Very fast; it is the fastest in compression and decompression |
GZIP |
It is the best option to save disk space |
LZO |
Use only with text files |
BZIP2 |
Not a top choice but Impala can read input files |
Deflate |
Not a first or second choice; however, can read input files |
The following are a few considerations to keep in mind when choosing an appropriate file format for a table with Impala:
CREATE TABLE
is used with Impala, text files are the default input format. It is easier to read for humans and helps troubleshooting problems; however, it does not provide superfast processing with large amounts of data due to significant disk read activity.CREATE TABLE
to create a table with your desired file type format and then use the INSERT
statement to copy data into the Impala table, which requires a one-time file conversion from source to Impala.CREATE TABLE
and then use the INSERT
statement to copy them into Impala.Now let's take a look at some of the SQL statements that you can use with various input file types in Impala.
By default, Impala uses the text file format with the CREATE TABLE
syntax. When data is inserted into this table using INSERT
, Ctrl+A (Hex 01) is used as a default delimiter. The default syntax is as follows:
CREATE TABLE users (userID int, username string);
To change the delimiter to, for example, ,
,
, |
, or your_choice
, you can use the following syntax:
CREATE TABLE users (userID int, username string) STORED AS textfile FIELDS TERMINATED BY ' ';
With the Avro file format, you would have to create tables in Hive first, as shown in the following code snippet:
CREATE TABLE my_avro_table (userID int, userName string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.literal'='{ "type": "record", "name": "user_record", "fields": [ {"name": "userID", "type": "int"}, {"name": "userName", "type": "string"} ]}'), INSERT OVERWRITE TABLE my_avro_table SELECT *, "avro" FROM functional.alltypes;
Once the file is created in Hive, you can just use it in Impala as any other file as follows:
SELECT * from my_avro_table;
When you create a table with the RCFile format, without using any existing data use with the table, the syntax is as follows:
CREATE TABLE my_rcfile_table (userID int, userName string) STORED AS RCFile;
Impala can query RCFile-type tables but cannot write to them, so you would need to use Hive to write data into the file using the INSERT
statement. With Hive, you don't need to specify the storage file type as Hive takes care of it by default.
Like RCFile, Impala supports the creation of tables that can store SequenceFile data. To create an empty table to store SequenceFile-type data in Impala, you just need to use the following syntax in the Impala shell:
CREATE TABLE my_sequencefile_table (userID int, userName string) STORED AS SEQUENCEFILE;
The rest of the steps require you to use Hive for setting up file compression and then writing data into the table using the appropriate INSERT
statement.
You might be wondering what the Parquet file format is. I would like to provide a little information in this context. The Parquet file format is a column-oriented binary file format that is designed to provide column-specific access to the data. As the data is stored in columns and all columns are stored separately, lookups are happening on columns first. This column-oriented access method makes query processing very fast and efficient, and Impala takes advantage of this file format. Impala provides native support to create, manage, and query tables based on the Parquet file format.
The following is the syntax for creating a table that can store the Parquet file format in Impala:
CREATE TABLE my_parquet_table (userID int, userName string) STORED AS PARQUETFILE;
As Impala supports writing the Parquet file format within Impala, you can use the INSERT
statement as shown in the following code snippet to write to your Parquet file type from other files:
INSERT OVERWRITE TABLE my_parquet_table SELECT * FROM other_table_name;
3.12.163.175