Impala processing strategy

Now let's review how Impala starts processing a query when it is submitted through any of the following ways:

When a query is submitted, Impala needs two kinds of metadata to start query processing:
- Catalog information using Hive metadata
- File metadata using NameNode
It is strongly recommended to have the Impala daemon running on all DataNodes, which helps Impala run distributed queries directly on the stored data; however, if the Impala daemon is not running on all DataNodes, it still plans to run the query as effectively and as fast as it can.
At the time of writing this book, Impala only supports in-memory hash aggregations.
In the case of the JOIN operation, all of the tables referenced in the JOIN operation must fit in the aggregate memory on the host or hosts where Impala is running.
If the JOIN operation is submitted, Impala will use either broadcast or partitioned join, depending on the query planner's decision, and follow the table order provided in the SELECT statement.
Impala processes all queries in memory, so memory limitation on nodes is definitely a factor. You must have enough memory to support the resultant dataset, which could grow multifold during complex JOIN operations.
If a query starts processing the data and the resultant dataset cannot fit in the available memory, the query will fail.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Impala processing strategy