Impala is the default MPP SQL engine for Kudu. Impala allows you to interact with Kudu using SQL. If you have experience with traditional relational databases where the SQL and storage engines are tightly integrated, you might find it unusual that Kudu and Impala are decoupled from each other. Impala was designed to work with other storage engines such as HDFS, HBase, and S3, not just Kudu. There’s also work underway to integrate other SQL engines such as Apache Drill (DRILL-4241) and Hive (HIVE-12971) with Kudu. Decoupling storage, SQL, and processing engines is common in the open source community.
The Impala-Kudu integration works great but there is still work to be done. While it matches or exceeds traditional data warehouse platforms in terms of performance and scalability, Impala-Kudu still lacks some of the enterprise features found in most traditional data warehouse platforms. Kudu is a young project. We discuss some of these limitations later in the chapter.
Primary Key
Every Kudu table needs to have a primary key. When creating Kudu tables, the column or columns used as the primary key must be listed first. Kudu’s primary key is implemented as a clustered index. With a clustered index the rows are stored physically in the tablet in the same order as the index. Also note that Kudu doesn’t have an auto-increment feature so you will have to include a unique primary key value when inserting rows to a Kudu table. If you don’t have a primary key value, you can use Impala’s built-in uuid() function or a more efficient method of generating a unique value.
Data Types
Like other relational databases, Kudu supports various data types (Table 4-1).
List of Data Types, with Available and Default Encoding
Data Type | Encoding | Default |
---|---|---|
boolean | plain, run length | run length |
8-bit signed integer | plain, bitshuffle, run length | bitshuffle |
16-bit signed integer | plain, bitshuffle, run length | bitshuffle |
32-bit signed integer | plain, bitshuffle, run length | bitshuffle |
64-bit signed integer | plain, bitshuffle, run length | bitshuffle |
unixtime_micros (64-bit microseconds since the Unix epoch) | plain, bitshuffle, run length | bitshuffle |
single-precision (32-bit) IEEE-754 floating-point number | plain, bitshuffle | bitshuffle |
double-precision (64-bit) IEEE-754 floating-point number | plain, bitshuffle | bitshuffle |
UTF-8 encoded string (up to 64KB uncompressed) | plain, prefix, dictionary | dictionary |
binary (up to 64KB uncompressed) | plain, prefix, dictionary | dictionary |
Because of this, behavior float and double are not appropriate for storing financial data. At the time of writing, support for decimal data type is still under development (Apache Kudu 1.5 / CDH 5.13). Check KUDU-721 for more details. There are various work arounds. You can store financial data as string and then use Impala to cast the value to decimal every time you need to read the data. Since Parquet supports decimals, another workaround would be to use Parquet for your fact tables and Kudu for dimension tables. Kudu committers are working on adding decimal support and may be included in newer versions of Kudu.
As shown in Table 4-1, Kudu columns can use different encoding types depending on the type of column. Supported encoding types includes Plain, Bitshuffle, Run Length, Dictionary, and Prefix. By default Kudu columns are uncompressed. Kudu supports column compression using Snappy, zlib, or LZ4 compression codecs. Compression and encoding can significantly reduce space overhead and improve performance. Refer to Kudu’s online documentation for more information on encoding and compression.
Note
In earlier versions of Kudu, date and time are represented as a BIGINT. You can use the TIMESTAMP data type in Kudu tables starting in Impala 2.9/CDH 5.12. However, there are several things to keep in mind. Kudu represents date and time columns using 64-bit values, while Impala represents date and time as 96-bit values. Nanosecond values generated by Impala is rounded when stored in Kudu. When reading and writing TIMESTAMP columns, there is an overhead converting between Kudu’s 64-bit representation and Impala’s 96-bit representation. There are two workarounds: use the Kudu client API or Spark to insert data, or continue using BIGINT to represent date and time. ii
Internal and External Impala Tables
You can create internal and external tables in Impala.
Internal Tables
Internal tables are created and managed by Impala. Internal tables are immediately visible to Impala as soon as they are created. Administration tasks such as dropping and renaming tables are performed using Impala. Here’s an example on how to create an internal table in Impala.
External Tables
Kudu tables created via the Kudu API and Spark are not immediately visible to Impala. The table exists in Kudu but since it was not created via Impala, it does not know anything about the table. An external table must be created in Impala that references the Kudu table. Dropping an external table only removes the mapping between Impala and the Kudu table, it does not drop the physical table in Kudu. Below is an example on how to create an external table in Impala to an existing Kudu table.
Changing Data
You can change the data stored in Kudu tables using SQL statements via Impala, just like traditional relational databases. This is one of the main reasons why you would use Kudu over immutable storage formats such as ORC or Parquet.
Inserting Rows
You can use a standard SQL insert statement.
Inserting rows with multiple VALUES subclauses.
Bulk inserts are also supported.
Updating Rows
Standard SQL update statement.
Bulk updates are also supported.
Upserting Rows
Upserts are supported. If the primary key does not exist in the table, the entire row is inserted.
However, if the primary key already exists, the columns are updated with the new values.
Deleting Rows
Standard SQL Delete statement.
More complicated Delete statements are also supported.
Note
As discussed in Chapter 2, Kudu doesn’t support ACID-compliant transactions. Updates will not be rolled back if it fails halfway through. Additional data validation must be performed after an update to ensure data integrity.
Changing Schema
Typical database administration tasks such as renaming tables, adding and dropping range partitions, and dropping tables are supported with Kudu. For more information, please consult Kudu’s online documentation.
Partitioning
Table partitioning is a common way to enhance performance, availability, and manageability of Kudu tables. Partitioning allows tables to be subdivided into smaller segments, or tablets. Partitioning enables Kudu to take advantage of partition pruning by allowing access to tablets at a finer level of granularity. Table partitioning is required for all Kudu tables and is completely transparent to applications. Kudu supports Hash, Range, and Composite Hash-Range and Hash-Hash partitioning. Below are a few examples of partitioning in Kudu. Partitioning is discussed in more detail in Chapter 2.
Hash Partitioning
Hash partitioning uses a hash key to distribute rows evenly across the different tablets.
Range Partitioning
Range partitioning stores ranges of data separately in different tablets.
Hash-Range Partitioning
Hash partitioning spread writes evenly across tablets. Range partitioning enables partition pruning and allows adding and removing partitions. Composite partitioning combines the strengths of both types of partitioning schemes while limiting its weaknesses. This is a good partitioning scheme for IoT use cases.
Hash-Hash Partitioning
You can have multiple levels of hash partitions in your table. Each partition level should use a different hashed column.
List Partitioning
You may have used list partitions if you are familiar with other relational database management systems such as Oracle. While Kudu doesn’t technically have list partitioning, you can imitate its behavior using range partitioning.
Note
Partitioning is not a panacea. It is just one of the many ways to optimize Kudu. You may still run into performance issues when ingesting data into Kudu if you are using default Kudu settings. Make sure you adjust the parameters maintenance_manager_num_threads, iii which is the number of maintenance threads and can help to speed up compactions and flushes. You can monitor the bloom_lookups_per_op metric and memory pressure rejection to see if compactions and flushes are affecting performance. Another setting you may need to adjust is memory_limit_hard_bytes, which controls the total amount of memory allocated to the Kudu daemon. iv Refer to the online Kudu documentation for more details.
Using JDBC with Apache Impala and Kudu
Federation with SQL Server Linked Server and Oracle Gateway
You can create database links from SQL Server and Oracle to Impala. This is sometimes useful if copying data back and forth is too burdensome and data is relatively small. Additionally, if you need access to the latest data stored in Kudu in near real time from SQL Server or Oracle and ETL is too slow, then accessing data via a database link is an option. Here’s an example on how to create a database link in SQL Server. Users can then access data stored in a remote Impala environment (as shown in Listing 4-1). Note that it is generally recommended to use OpenQuery when executing queries to ensure the query is executed on the remote server.
Creating and using linked server from SQL Server to Impala
You can create a database link from Oracle to Impala using Oracle’s ODBC Heterogeneous Gateway. Refer to Oracle’s documentation for more details.
Database links are appropriate if accessing small- to medium-sized data sets. If you need a full-blown data virtualization tool, you might want to look at using Polybase, Denodo, or TIBCO data virtualization (previously owned by Cisco).
Summary
Impala provides Kudu with a powerful MPP SQL Engine. Together, they rival traditional data warehouse platforms in terms of performance and scalability. Kudu committers and contributors are hard at work at adding more features and capabilities. Chapter 2 provides an in-depth discussion of Kudu, including its limitations. For more information on Impala, I refer you to Chapter 3. Chapter 8 covers using Impala and Kudu for big data warehousing. Chapter 9 shows users how to use Impala and Kudu for real-time data visualization.
References
- i.
Microsoft; “Complete tutorial to understand IEEE floating point errors,” Microsoft, 2018, https://support.microsoft.com/en-us/help/42980/-complete-tutorial-to-understand-ieee-floating-point-errors
- ii.
Apache Impala; “TIMESTAMP Data Type,” Apache Impala, 2018, https://impala.apache.org/docs/build/html/topics/impala_timestamp.html
- iii.
Cloudera; “Apache Kudu Background Maintenance Tasks,” Cloudera, 2018, https://www.cloudera.com/documentation/kudu/latest/topics/kudu_background_tasks.html
- iv.
Todd Lipcon; “Re: How to calculate the optimal value of `maintenance_manager_num_threads`,” Todd Lipcon, 2017, https://www.mail-archive.com/[email protected]/msg00358.html
- v.
Saikrishna Teja Bobba; “Tutorial: Using Impala JDBC and SQL with Apache Kudu,” Progress, 2017, https://www.progress.com/blogs/tutorial-using-impala-jdbc-and-sql-with-apache-kudu