Chapter 7. Advanced Impala Concepts

In Chapter 6, Troubleshooting Impala, we discussed various concepts about Impala, which have definitely given you enough information to let you take charge of Impala projects and successfully manage them. In this chapter, we are going to learn more about Impala; however, this information is more advanced in nature, to help you excel in data-processing projects using Impala. I describe how Impala works side by side with MapReduce without using it in the same cluster. I also explain why Impala has an edge over Hive even though Hive is a key component on which Impala is dependent. Finally, we will cover some details on using HBase with Impala and processing various Big Data input file formats on Hadoop with Impala.

Impala and MapReduce

The very first thing to note is that Impala does not replace MapReduce or use MapReduce as a processing engine. Impala processes data much, much faster than MapReduce and is considered an alternative data-processing framework on Hadoop. Impala processes data stored at the Hadoop data storage layer using its open source in-memory processing framework, which does not have an overhead as MapReduce does. Impala bypasses MapReduce to have native access to data in HDFS using the distributed query engine designed specially for superfast data processing. As each Impala daemon processes data locally on DataNode, processing is fast due to little or no network latency. You must know the fact that MapReduce is an amazing distributed data-processing framework to process data directly in a distributed clustered environment on DataNodes; however, executing SQL statements through the MapReduce framework exhibits performance inefficiencies mainly due to disk access. Impala overcomes this inefficiency by processing data in memory. Impala runs side by side with MapReduce by using the same Hadoop core components and hardware infrastructure. As mentioned earlier and rephrased here again, Impala is faster because the data is processed in memory; therefore, the memory requirement for Impala-installed Hadoop clusters is comparatively higher.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.123.106