Chapter 1. Getting Started with Impala

This chapter covers the information on Impala, its core components, and its inner workings in detail. We will cover Impala architecture including Impala daemon, statestore, and execution model, and how they interact together along with other components. Impala metadata and metastore are also discussed here, to understand how Impala maintains its information. Finally, we will study various ways to interface Impala.

The objective of this chapter is to provide enough information for you to kick-start Impala on a single node experimental or multimode production cluster. This chapter covers the Impala essentials within the following broad categories:

  • System requirement
  • Installation
  • Configuration
  • Upgradation
  • Security
  • Impala architecture and execution

Impala is for a new breed of data wranglers who want to process the data at lightening-fast speed using traditional SQL knowledge. Impala provides data analysts or scientists a way to access data, which is stored on Hadoop at lightening speed by directly using SQL or other Business Intelligence tools. Impala uses the Hadoop data processing layer, also called HDFS, to process the data so there is no need to migrate data from Hadoop to any other middleware, specialized system, or data warehouse. Impala provides data wranglers a Massively Parallel Processing (MPP) query engine, which runs natively on Hadoop.

Native on Hadoop means the engine runs on Hadoop and uses the Hadoop core component, HDFS, along with other additional components, such as Hive and HBase. To process data, Impala has its own execution component, which runs on each DataNode where the data is stored in blocks. There is a list of third-party applications that can directly process data stored on Hadoop through Impala. The biggest advantage of Impala is that data transformation or data movement is not required for data stored on Hadoop. No data movement means all the processing is happening where the data resides in the cluster. In other distributed systems, data is transferred over the network before it is processed; however, with Impala the processing happens at the place where data is stored, which is one of the premier reasons why Impala is very fast in comparison to other large data processing systems.

Before we learn more about Impala, let's see what the key Impala features are:

  • First and foremost, Impala is 100% open source under the Apache license
  • Impala is a native MPP engine, running on the Cloudera Hadoop distribution
  • Impala supports in-memory processing for data through SQL-like queries
  • Impala uses Hadoop Distributed File System (HDFS) and HBase
  • Impala supports integration with leading Business Intelligence tools, such as Tableau, Pentaho, Microstrategy, Zoomdata, and so on
  • Impala supports a wide variety of input file formats, that is, regular text files, files in CSV/TSV or other delimited format, sequence files, Avro, RCFile, LZO, and Parquet types
  • For third-party application connectivity, Impala supports ODBC drive, SQL-like syntax, and Beeswax GUI (in Apache Hue) from Apache Hive
  • Impala uses Kerberos authentication and role-based authorization with Sentry

The key benefits of using Impala are:

  • Impala uses Hive to read a table's metadata; however, using its own distributed execution engine it makes data processing very fast. So the very first benefit of using Impala is the super fast access of data from HDFS.
  • Impala uses a SQL-like syntax to interact with data, so you can leverage the existing BI tools to interact with data stored on Hadoop. The engineers with SQL expertise can benefit from Impala as they do not need to learn new languages and skills. Additionally, Impala offers higher performance and execution speed.
  • While running on Hadoop, Impala leverages the Hadoop file and data format, metadata, resource management, and security, all available on Hadoop.
  • As Impala interacts with the stored data in Hadoop, it preserves full fidelity of data while analyzing the data, due to aggregations or conformance of fixed schemas.
  • Impala performs interactive analysis directly on the data stored on Hadoop DataNodes without requiring data movement, which results in lightening-fast query results, because there are no network bottlenecks and the time available to move data is zero.
  • Impala provides a single repository and metadata store from source to analysis, which enables more users to interact with a large amount of data. The presence of a single repository also reduces data movement, which helps in performing interactive analysis directly on full fidelity data.

Impala requirements

Impala is supported on 64-bit Linux-based operating systems. At the time of writing this book, Impala was supported on the following operating systems:

  • Red Hat Enterprise Linux 5.7/6.2/6.4
  • CentOS 5.7/6.2/6.4
  • SLES 11 with SP 1 or newer
  • Ubuntu 10.04/12.04
  • Debian 6.03

As Impala runs on Hadoop, it is also important to discuss the supported Hadoop version. At the time of writing this book, Impala was supported on the following Hadoop distributions:

  • Impala 1.1 and 1.1.1
    • Cloudera Hadoop CDH 4.1 or later
  • Impala 1.0
    • ClouderaHadoopCDH 4.1 or later
  • Impala 0.7 and older
    • Cloudera Hadoop CDH 4.1 only

Besides CDH, Impala can run on other Hadoop distributions by compiling the source code and then configuring it correctly as required.

Note

Depending on the latest version of Impala, requirements might change, so please visit the Cloudera Impala website for updated information.

Dependency on Hive for Impala

Even though the common perception is that Impala needs Hive to function, it is not completely true. The fact is that only the Hive metastore is required for Impala to function and Hive can be installed on some other client machine. Hive doesn't require being installed on the same DataNode where Impala is installed, because as long as Impala can access the Hive metastore, it will function as expected. In brief, the Hive metastore stores tables and partitions' specific information, which is also called metadata.

As Hive uses PostgreSQL or MySQL for the Hive metastore, we can also consider that either PostgreSQL or MySQL is required for Impala.

Dependency on Java for Impala

For those who don't know, Impala is written in C++. However, Impala uses Java to communicate with various Hadoop components. In Impala, the impala-dependencies.jar file located at /usr/lib/impala/lib includes all the required Java dependencies. Oracle JVM is the officially supported JVM for Impala and other JVMs might cause problems while running Impala.

Hardware dependency

The source datasets processed by Impala, along with join operations, could be very large, and because processing is done in the memory, as an Impala user you must make sure that you have sufficient memory to process the join operations. The memory requirement is based on your source dataset requirement, which you are going to process through Impala. You also know that Impala cannot run queries that have a working set greater than the maximum available RAM. In a case when memory is not sufficient, Impala will not be able to process the query and the query will be canceled.

For best performance with Impala, it is suggested to have DataNodes with multiple storage disks because disk I/O speed is often considered the bottleneck for Impala performance. The total amount of physical storage requirement is based on the source data, which you would want to process with Impala.

As Impala uses the SSE4.2 CPU instructions set, which is mostly found in the latest processors, the latest processors are often suggested for better performance with Impala.

Networking requirements

Impala daemons running in DataNodes can process data stored in local nodes as well as in remote nodes. To achieve the highest performance, it is advised that Impala attempts to complete data processing on the local data instead of remote data using a network connection. To achieve local data processing, Impala matches the hostname provided to each Impala daemon with the IP address of each DataNode by resolving the hostname flag to an IP address. For Impala to work with the local data stored in a DataNode, you must use a single IP interface for the DataNode and an Impala daemon on each machine. Since there is a single IP address, make sure that the Impala daemon hostname flags resolve the IP address of the DataNode.

User account requirements

When Impala is installed, a user name impala and group name impala is created, and Impala uses this username and group name during its life after installation. You must ensure that no one changes the impala group and user settings, and also no other application or system activity obstructs the functionality of the impala user and group. To achieve the highest performance, Impala uses direct reads and, because a root user cannot do direct reads, Impala is not executed as root. To achieve full performance with Impala, the user must make sure that Impala is not running as a root user.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.144.12