In the first part of this chapter, we are going to learn how to troubleshoot various Impala issues in different categories. We will use Impala logging to understand more about Impala execution, query processing, and possible issues. The objective of this chapter is to provide you some critical information about Impala troubleshooting and log analysis, so that you can manage the Impala cluster effectively and make it useful for your team and yourself. Let's start with troubleshooting various problems while managing the Impala cluster.
Impala runs on DataNodes in a distributed clustered environment. So when we consider the potential issues with Impala, we also need to think about the problems within the platform itself that can impact Impala. In this section, we will cover most of these issues along with query, connectivity, and HDFS-specific issues.
If you find that Impala is not performing as expected, and you want to make sure it is configured correctly, it is best to check the Impala configuration. With Impala installed using Cloudera Manager, you can use the Impala debug web server at port 25000 to check the Impala configuration. Here is a small list describing what you could see in the Impala debug web server:
In Chapter 5, Impala Administration and Performance Improvements, we have learned that enabling "block locality" helps Impala to process queries faster. However, it is possible that "block locality" is not configured properly and you might not be taking advantage of such functionality. You can make sure by checking the logs to verify if you see the following log message:
Unknown disk id. This will negatively affect performance. Check your hdfs settings to enable block location metadata
If you see the preceding log message, it means that tracking block locality is not enabled. Therefore, configure it correctly as described in Chapter 5, Impala Administration and Performance Improvements.
We have also studied in Chapter 5, Impala Administration and Performance Improvements, that having native checksumming improves performance. If you see the following log message, it means native checksumming is not enabled and you need to configure it correctly. This is described in Chapter 5, Impala Administration and Performance Improvements.
Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
In this section, we will cover various connectivity scenarios and learn what could go wrong in each and how to troubleshoot them.
When you start Impala shell by passing the hostname using the -i
option or the Impala shell, try connecting to the default Impala daemon that is running on the local machine. The connection can not be established. You will see a connection error as shown in following screenshot:
To troubleshoot the preceding connection problem, you can try the following options:
ping
or another similar utility to check the connectivity between machines.ps
command to get more information about the Impala process.Impala provides connection through the third-party application that uses the ODBC/JDBC driver running on the machine, which is trying to connect to the Impala server. The connection may not work due to various reasons, which are given as follows:
The very first query-specific issue is a bad query. The Impala query interpreter is smart in various ways to guide you within the Impala shell for a bad query, or while using API to execute the query statement a detailed error in the log file about it will help you. Besides a bad query, you may also experience the following issues:
REFRESH
statement, you can sync Hive metadata to solve this problem. Also, make sure that Impala daemons are running on all the nodes.JOIN
operations are failing, it is very mush possible that you are hitting the memory limitation. While checking Impala logs, you might look for Out of Memory
errors logged to confirm memory limitation-specific errors. As the JOIN
operation is performed among multiple tables, which requires comparatively large memory to process the JOIN
request, so adding more memory could solve this problem.During the Impala installation process, the Impala username and group is created. Impala runs under this username and accesses system resources within this group. If you delete this user or group, or modify its access, either Impala will start acting weird or it will show some undeterministic behavior. If you start Impala under the root user, it will also impact the Impala execution by disabling direct reading. So if you suddenly experience such issues, please check Impala user access settings and make sure that Impala is running as configured.
In this section, I will explain a few platform-specific issues so the an event an Impala execution is sporadic or not working at all, you can troubleshoot the problem and find the appropriate resolution.
Impala has two main services, Impala daemon and statestore, and both these services are configured to use internal and external ports for effective communication. This is described in the following table:
Component |
Port |
Type |
Service description |
---|---|---|---|
Impalad |
21000 |
External |
Frontend port to communicate with the Impala shell |
Impalad |
21050 |
External |
Frontend port for ODBC 2.0 |
Impalad |
22000 |
Internal |
Backend port to communicate with each other |
Impalad |
23000 |
Internal |
Backend port to get update from the statestore |
Impalad |
25000 |
External |
Impala web interface for monitoring and troubleshooting |
Statestored |
24000 |
Internal |
Statestore listen for registration/unregistration |
Statestored |
25010 |
External |
Statestore web interface for monitoring and troubleshooting |
It is important to remember that if any of the preceding port configuration is wrong or blocked, you will experience various problems and would need to make sure that the preceding port configuration is correct.
Impala runs on the DataNode that has dependency on NameNode in the Hadoop environment. Various HDFS-specific issues such as permission to read or write data on HDFS, space limitation, memory swapping, or latency could impact the Impala execution. Any of these issues could introduce instability in HDFS or impact the whole cluster, depending on how serious the problem is. In this situation, you would need to work with your Hadoop administrator to resolve these problems and get Impala up and running.
Impala can load and query various kinds of datafiles stored on Hadoop. Sometimes you may receive an error while reading these datafiles or failed query requests. Most probably it is because either the file format is not supported, or Impala is limited to only queries and cannot process CREATE
or INSERT
requests. In the following table, you can see which file formats are supported and whether Impala can read and query those files:
File type |
Format type |
Compression type |
Is CREATE and INSERT supported |
Is the query supported? |
---|---|---|---|---|
Text |
Unstructured |
LZO |
Yes |
Yes |
Avro |
Structured |
Snappy, GZIP, deflate, BZIP2 |
No |
Query only (use Hive to load file) |
RCFile |
Structured |
Snappy, GZIP, deflate, BZIP2 |
|
Query only (use Hive to load file) |
SequenceFile |
Structured |
Snappy, GZIP, deflate, BZIP2 |
|
Query only (use Hive to load file) |
Parquet |
Structured |
Snappy, GZIP |
Yes |
Yes |
3.145.101.81