Query-specific SQL statements in Impala

Now, we will spend some time in understanding the query-specific SQL statements used in Impala. Most of these statements are exactly the same as they are defined in SQL, so to learn more, I would suggest you to look at any SQL reference documentation. Here, I am covering some key information for reference purposes:

  • EXPLAIN: Using the EXPLAIN clause, we can learn the execution plan of a SQL statement by understanding low-level mechanisms that Impala will use to read and process the data in the whole cluster, and then finally show the results. You can use the EXPLAIN clause ahead of a SELECT statement as shown in the following example:
    [Hadoop.testdomain:21000] > EXPLAIN SELECT * FROM  list;
    Explain query: select * from list
    PLAN FRAGMENT 0
      PARTITION: UNPARTITIONED
    
      1:EXCHANGE
         tuple ids: 0 
    
    PLAN FRAGMENT 1
      PARTITION: RANDOM
    
      STREAM DATA SINK
        EXCHANGE ID: 1
        UNPARTITIONED
    
      0:SCAN HDFS
         table=default.list #partitions=0 size=0B
         tuple ids: 0
    REFRESH table_name;
  • REFRESH: In a multimode environment, the data files reside on multiple DataNodes while the Impala shell is interacting with the Impala daemon (which acts as the data coordinator for all other nodes). Data files can be updated on other nodes without any update event or information to the coordinator. In this situation, using the REFRESH clause with the table name loads the latest metadata and block location of the data files for a particular table.

    Please refer to Chapter 2, The Impala Shell Commands and Interface, to understand more on how REFRESH works and why it is so important to use.

  • JOIN: The JOIN clause is used in SQL statements to select data from two or more tables and then return the result set containing items from some of all of those tables, depending on the conditions applied. The JOIN query result set is filtered by including the corresponding join column names in the ON clause or by comparison operators referencing columns from both tables in the WHERE clause. To improve JOIN performance, here are some suggestions:
    • It is advisable to perform the JOIN operation on the biggest table first and then smaller tables
    • Join subsequent tables depending on which table has the most selective filter

    Note

    The JOIN clause itself is very detailed, so I have introduced it here only for reference; however, I would suggest you study some SQL documentation on JOIN to learn more about it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.179.85