Improving performance

In this section, we will learn a few helpful pointers to improve performance by modifying Impala daemon execution and the underlying platform where Impala performs user actions.

Enabling block location tracking

When queries are executed in Impala, data is read from HDFS that is distributed across multiple DataNodes in the form of data blocks. If Impala knows more information about these data blocks on HDFS, the data can be read faster and queries can achieve faster execution. To enable block location tracking for Impala, you just need to perform the following steps:

  1. Modify the HDFS configuration hdfs-site.xml as follows:
    <property>
      <name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
      <value>true</value>
    </property>
  2. Copy hdfs-site.xml and core-site.xml from the Hadoop cluster to each Impala node into the Impala configuration folder, /etc/impala/conf.
  3. Restart all DataNodes in your cluster.

Enabling native checksumming

Computing data checksum for very large amounts of data could add a significant amount of time. So having a native library to perform checksum helps improve the performance. You can use the following information to enable native checksumming in Impala:

  • If Impala is installed using Cloudera Manager, native checksumming is configured automatically and no action is needed.
  • To enable native checksumming on your self-installed Impala, you must build and install the Hadoop native library, libhadoop.so. If this library is not available, you might receive the following message in Impala logs, indicating native checksumming is not enabled:
    "Unable to load native-hadoop library for your platform... using built-in-java classes where applicable"

Enabling Impala to perform short-circuit read on DataNode

Short-circuit read means reading data locally from the filesystem instead of communicating first with DataNode, and it definitely improves performance. You must have Cloudera CDH 4.2 or higher to achieve faster and compatible short-circuit reading. The following guideline is provided based on the assumption that you have Cloudera CDH 4.2 or higher installed:

  1. Modify hdfs-site.xml on each Impala node as follows:
    <property>
        <name>dfs.client.read.shortcircuit</name>
        <value>true</value>
    </property>
    <property>
        <name>dfs.domain.socket.path</name>
        <value>/var/run/hadoop-hdfs/dn._PORT</value>
    </property>
    <property>
        <name>dfs.client.file-block-storage-locations.timeout</name>
        <value>3000</value>
    </property>
  2. Make sure that /var/run/hadoop-hdfs/ is group writable for root users.
  3. Copy hdfs-site.xml and core-site.xml from the Hadoop configuration to each Impala node configuration at /etc/impala/conf.
  4. Restart all DataNodes.

Adding more Impala nodes to achieve higher performance

It is a fact that Impala performance improves if more nodes are added to the cluster. In the same way, Hadoop performance improves by adding more DataNodes and TaskTrackers. Having more nodes in the Hadoop cluster will distribute the data to more clusters, and queries will have more distribution, which ultimately will return higher performance.

Optimizing memory usage during query execution

You can improve query performance by restricting the amount of memory consumed by a query during its execution and you can do that by setting the –mem_limits flag when starting Impala daemon. This flag will restrict the memory consumed only by a query; however, there is still memory available for starting Impala to cache metadata and perform other startup actions.

Query execution dependency on memory

You might wonder about memory limitation impact on query execution as Impala has a strong dependency on available memory. If dataset size exceeds the available memory in a machine, the query will fail. The memory usages in Impala are not directly based on the input dataset size; instead it varies depending on types of query. An aggregation will require memory equivalent to the number of rows after grouping; however, join queries require memory equivalent to the combined size of remaining tables excluding the biggest table.

Using resource isolation

If you are using Cloudera Manager, you have the ability to implement resource isolation using the cgroups mechanism and it can be achieved by configuring Cloudera Manager. For more information, please read the Cloudera Manager documentation on resource isolation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.200.106