In this section, let's review a few key factors that affect Impala performance.
In this method, data is physically divided from frequently queried fields or columns into different values. This way, when a query is executed, it processes only a specific partition or a portion of the data, achieving significant faster results than the full dataset. In general, data files specific to a single table reside in a single directory. Using partitioning, you can distribute the data in a way such that a fraction of data should be read, depending on the query and its data-limiting clause. Once a partition is applied to a table, the data is physically loaded on a different location on the disk based on the query parameter, which provides faster access to the data when queried by the SELECT
statement and the partition name.
You can write TABLE
specific Impala SQL statements as follows to take advantage of the PARTITIONED BY
method:
It is a well-known fact that a query operating on multiple tables using the JOIN
operation will take a long time to finish if it is not written correctly. Here are some techniques you can use in this regard:
join
query is to specify the tables in an optimal order by first having the table that has the maximum number of records or rows, which are part of the result set. After that, select the next largest table in terms of the number of rows in the results set, and finally the smallest table.JOIN
operation is to use the HINT
clause with JOIN
to select a specific Impala query planner. When a SELECT
statement based on JOIN
is given to Impala to execute, the Impala query planner works on it to find the best strategy. The Impala query planner first checks metadata and the number of records in the result set for each table in the SELECT
statement and then chooses an appropriate JOIN
strategy. You can get this information by using the EXPLAIN
clause with your query. If you think that changing the JOIN
strategy will be helpful, you can use the HINT
clause as follows to apply the specific JOIN
type. You can learn more about HINT
in any detailed SQL documentation:SELECT table1.field1, table2.field1 FROM table1 JOIN [BROADCAST | SHUFFLE] table2 ON [condition….]
The SQL JOIN
operation itself is very large and requires a great deal of understanding to optimize. While the preceding information is good for reference purposes, I would suggest reading more details on it in an external reference document to achieve optimum performance with JOIN
queries.
In the previous section, we talked about the query planner. Now, we will learn how the query planner decides which strategy is best for it. The query planner uses individual column statistics by getting metadata from the metastore if it is available. All the columns, which are part of the result set in the JOIN
query, are calculated for all the records, which helps the query planner to make its decision.
The Impala query planner also uses statistics for all the tables and partitions from the metastore and, based on this information, it makes a decision. These table, column, and partition statistics can be gathered using the ANALYZE TABLE
statement by passing the table name. I would like to inform you that the Impala query planner does not create this information; instead it depends on Hive for this and the ANALYZE TABLE
statement does work on Hive Shell only for now. The syntax of this SQL statement is as follows:
ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS all_column_list; ANALYZE TABLE table_name PARTITION (partition_specs) COMPUTE STATISTICS FOR COLUMNS column_list;
3.145.199.112