Fine-tuning Impala performance

In this section, let's review a few key factors that affect Impala performance.

Partitioning

In this method, data is physically divided from frequently queried fields or columns into different values. This way, when a query is executed, it processes only a specific partition or a portion of the data, achieving significant faster results than the full dataset. In general, data files specific to a single table reside in a single directory. Using partitioning, you can distribute the data in a way such that a fraction of data should be read, depending on the query and its data-limiting clause. Once a partition is applied to a table, the data is physically loaded on a different location on the disk based on the query parameter, which provides faster access to the data when queried by the SELECT statement and the partition name.

You can write TABLE specific Impala SQL statements as follows to take advantage of the PARTITIONED BY method:

  • Add a partition when creating a table by adding the PARTITIONED BY clause as follows:
    CREATE TABLE […] PARTITIONED BY
  • You can modify an existing table to support partitions by using the ALTER TABLE statement followed by PARTITONED BY as follows:
    ALTER TABLE […] PARTITIONED BY

Join queries

It is a well-known fact that a query operating on multiple tables using the JOIN operation will take a long time to finish if it is not written correctly. Here are some techniques you can use in this regard:

  • A well-known technique for fine-tuning an Impala join query is to specify the tables in an optimal order by first having the table that has the maximum number of records or rows, which are part of the result set. After that, select the next largest table in terms of the number of rows in the results set, and finally the smallest table.
  • Another method to optimize a faster JOIN operation is to use the HINT clause with JOIN to select a specific Impala query planner. When a SELECT statement based on JOIN is given to Impala to execute, the Impala query planner works on it to find the best strategy. The Impala query planner first checks metadata and the number of records in the result set for each table in the SELECT statement and then chooses an appropriate JOIN strategy. You can get this information by using the EXPLAIN clause with your query. If you think that changing the JOIN strategy will be helpful, you can use the HINT clause as follows to apply the specific JOIN type. You can learn more about HINT in any detailed SQL documentation:
    SELECT table1.field1, table2.field1 FROM table1 JOIN
     [BROADCAST | SHUFFLE] table2 ON [condition….]

The SQL JOIN operation itself is very large and requires a great deal of understanding to optimize. While the preceding information is good for reference purposes, I would suggest reading more details on it in an external reference document to achieve optimum performance with JOIN queries.

Table and column statistics

In the previous section, we talked about the query planner. Now, we will learn how the query planner decides which strategy is best for it. The query planner uses individual column statistics by getting metadata from the metastore if it is available. All the columns, which are part of the result set in the JOIN query, are calculated for all the records, which helps the query planner to make its decision.

The Impala query planner also uses statistics for all the tables and partitions from the metastore and, based on this information, it makes a decision. These table, column, and partition statistics can be gathered using the ANALYZE TABLE statement by passing the table name. I would like to inform you that the Impala query planner does not create this information; instead it depends on Hive for this and the ANALYZE TABLE statement does work on Hive Shell only for now. The syntax of this SQL statement is as follows:

ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS all_column_list;
ANALYZE TABLE table_name PARTITION (partition_specs) COMPUTE STATISTICS FOR COLUMNS column_list;
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.199.112