In this section, we will learn a few helpful pointers to improve performance by modifying Impala daemon execution and the underlying platform where Impala performs user actions.
When queries are executed in Impala, data is read from HDFS that is distributed across multiple DataNodes in the form of data blocks. If Impala knows more information about these data blocks on HDFS, the data can be read faster and queries can achieve faster execution. To enable block location tracking for Impala, you just need to perform the following steps:
hdfs-site.xml
as follows:<property> <name>dfs.datanode.hdfs-blocks-metadata.enabled</name> <value>true</value> </property>
hdfs-site.xml
and core-site.xml
from the Hadoop cluster to each Impala node into the Impala configuration folder, /etc/impala/conf
.Computing data checksum for very large amounts of data could add a significant amount of time. So having a native library to perform checksum helps improve the performance. You can use the following information to enable native checksumming in Impala:
libhadoop.so
. If this library is not available, you might receive the following message in Impala logs, indicating native checksumming is not enabled:"Unable to load native-hadoop library for your platform... using built-in-java classes where applicable"
Short-circuit read means reading data locally from the filesystem instead of communicating first with DataNode, and it definitely improves performance. You must have Cloudera CDH 4.2 or higher to achieve faster and compatible short-circuit reading. The following guideline is provided based on the assumption that you have Cloudera CDH 4.2 or higher installed:
hdfs-site.xml
on each Impala node as follows:<property> <name>dfs.client.read.shortcircuit</name> <value>true</value> </property> <property> <name>dfs.domain.socket.path</name> <value>/var/run/hadoop-hdfs/dn._PORT</value> </property> <property> <name>dfs.client.file-block-storage-locations.timeout</name> <value>3000</value> </property>
/var/run/hadoop-hdfs/
is group writable for root users.hdfs-site.xml
and core-site.xml
from the Hadoop configuration to each Impala node configuration at /etc/impala/conf
.It is a fact that Impala performance improves if more nodes are added to the cluster. In the same way, Hadoop performance improves by adding more DataNodes and TaskTrackers. Having more nodes in the Hadoop cluster will distribute the data to more clusters, and queries will have more distribution, which ultimately will return higher performance.
You can improve query performance by restricting the amount of memory consumed by a query during its execution and you can do that by setting the –mem_limits
flag when starting Impala daemon. This flag will restrict the memory consumed only by a query; however, there is still memory available for starting Impala to cache metadata and perform other startup actions.
You might wonder about memory limitation impact on query execution as Impala has a strong dependency on available memory. If dataset size exceeds the available memory in a machine, the query will fail. The memory usages in Impala are not directly based on the input dataset size; instead it varies depending on types of query. An aggregation will require memory equivalent to the number of rows after grouping; however, join queries require memory equivalent to the combined size of remaining tables excluding the biggest table.
18.216.200.106