Now, we will spend some time in understanding the query-specific SQL statements used in Impala. Most of these statements are exactly the same as they are defined in SQL, so to learn more, I would suggest you to look at any SQL reference documentation. Here, I am covering some key information for reference purposes:
EXPLAIN
clause, we can learn the execution plan of a SQL statement by understanding low-level mechanisms that Impala will use to read and process the data in the whole cluster, and then finally show the results. You can use the EXPLAIN
clause ahead of a SELECT
statement as shown in the following example:[Hadoop.testdomain:21000] > EXPLAIN SELECT * FROM list; Explain query: select * from list PLAN FRAGMENT 0 PARTITION: UNPARTITIONED 1:EXCHANGE tuple ids: 0 PLAN FRAGMENT 1 PARTITION: RANDOM STREAM DATA SINK EXCHANGE ID: 1 UNPARTITIONED 0:SCAN HDFS table=default.list #partitions=0 size=0B tuple ids: 0 REFRESH table_name;
REFRESH
clause with the table name loads the latest metadata and block location of the data files for a particular table.Please refer to Chapter 2, The Impala Shell Commands and Interface, to understand more on how REFRESH
works and why it is so important to use.
JOIN
clause is used in SQL statements to select data from two or more tables and then return the result set containing items from some of all of those tables, depending on the conditions applied. The JOIN
query result set is filtered by including the corresponding join column names in the ON
clause or by comparison operators referencing columns from both tables in the WHERE
clause. To improve JOIN
performance, here are some suggestions:JOIN
operation on the biggest table first and then smaller tables3.145.179.85