Why Impala is faster than Hive in query processing

We have mentioned many times in this book that Impala is a very fast distributed data-processing framework, so you might want to know how Impala achieves such speed or what is behind Impala that makes it so fast. I would answer this question by providing the following key points:

  • While processing SQL-like queries, Impala does not write intermediate results on disk; instead full SQL processing is done in memory, which makes it faster.
  • With Impala, the query starts its execution instantly compared to MapReduce, which may take significant time to start processing larger SQL queries and this adds more time in processing.
  • Impala Query Planner uses smart algorithms to execute queries in multiple stages in parallel nodes to provide results faster, avoiding sorting and shuffle steps, which may be unnecessary in most of the cases.
  • Impala has information about each data block in HDFS, so when processing the query, it takes advantage of this knowledge to distribute queries more evenly in all DataNodes.
  • Another key reason for fast performance is that Impala first generates assembly-level code for each query. The assembly code executes faster than any other code framework because while Impala queries are running natively in memory, having a framework will add additional delay in the execution due to the framework overhead.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.71.106