© Butch Quinto 2018
Butch QuintoNext-Generation Big Datahttps://doi.org/10.1007/978-1-4842-3147-0_4

4. High Performance Data Analysis with Impala and Kudu

Butch Quinto1 
(1)
Plumpton, Victoria, Australia
 

Impala is the default MPP SQL engine for Kudu. Impala allows you to interact with Kudu using SQL. If you have experience with traditional relational databases where the SQL and storage engines are tightly integrated, you might find it unusual that Kudu and Impala are decoupled from each other. Impala was designed to work with other storage engines such as HDFS, HBase, and S3, not just Kudu. There’s also work underway to integrate other SQL engines such as Apache Drill (DRILL-4241) and Hive (HIVE-12971) with Kudu. Decoupling storage, SQL, and processing engines is common in the open source community.

The Impala-Kudu integration works great but there is still work to be done. While it matches or exceeds traditional data warehouse platforms in terms of performance and scalability, Impala-Kudu still lacks some of the enterprise features found in most traditional data warehouse platforms. Kudu is a young project. We discuss some of these limitations later in the chapter.

Primary Key

Every Kudu table needs to have a primary key. When creating Kudu tables, the column or columns used as the primary key must be listed first. Kudu’s primary key is implemented as a clustered index. With a clustered index the rows are stored physically in the tablet in the same order as the index. Also note that Kudu doesn’t have an auto-increment feature so you will have to include a unique primary key value when inserting rows to a Kudu table. If you don’t have a primary key value, you can use Impala’s built-in uuid() function or a more efficient method of generating a unique value.

Data Types

Like other relational databases, Kudu supports various data types (Table 4-1).

You may notice that Kudu does not support the decimal data type. This is a key limitation in Kudu. The float and double data types only store a very close approximation of the value instead of the exact value as defined in the IEEE 754 specification. i
Table 4-1

List of Data Types, with Available and Default Encoding

Data Type

Encoding

Default

boolean

plain, run length

run length

8-bit signed integer

plain, bitshuffle, run length

bitshuffle

16-bit signed integer

plain, bitshuffle, run length

bitshuffle

32-bit signed integer

plain, bitshuffle, run length

bitshuffle

64-bit signed integer

plain, bitshuffle, run length

bitshuffle

unixtime_micros (64-bit microseconds since the Unix epoch)

plain, bitshuffle, run length

bitshuffle

single-precision (32-bit) IEEE-754 floating-point number

plain, bitshuffle

bitshuffle

double-precision (64-bit) IEEE-754 floating-point number

plain, bitshuffle

bitshuffle

UTF-8 encoded string (up to 64KB uncompressed)

plain, prefix, dictionary

dictionary

binary (up to 64KB uncompressed)

plain, prefix, dictionary

dictionary

Because of this, behavior float and double are not appropriate for storing financial data. At the time of writing, support for decimal data type is still under development (Apache Kudu 1.5 / CDH 5.13). Check KUDU-721 for more details. There are various work arounds. You can store financial data as string and then use Impala to cast the value to decimal every time you need to read the data. Since Parquet supports decimals, another workaround would be to use Parquet for your fact tables and Kudu for dimension tables. Kudu committers are working on adding decimal support and may be included in newer versions of Kudu.

As shown in Table 4-1, Kudu columns can use different encoding types depending on the type of column. Supported encoding types includes Plain, Bitshuffle, Run Length, Dictionary, and Prefix. By default Kudu columns are uncompressed. Kudu supports column compression using Snappy, zlib, or LZ4 compression codecs. Compression and encoding can significantly reduce space overhead and improve performance. Refer to Kudu’s online documentation for more information on encoding and compression.

Note

In earlier versions of Kudu, date and time are represented as a BIGINT. You can use the TIMESTAMP data type in Kudu tables starting in Impala 2.9/CDH 5.12. However, there are several things to keep in mind. Kudu represents date and time columns using 64-bit values, while Impala represents date and time as 96-bit values. Nanosecond values generated by Impala is rounded when stored in Kudu. When reading and writing TIMESTAMP columns, there is an overhead converting between Kudu’s 64-bit representation and Impala’s 96-bit representation. There are two workarounds: use the Kudu client API or Spark to insert data, or continue using BIGINT to represent date and time. ii

Internal and External Impala Tables

You can create internal and external tables in Impala.

Internal Tables

Internal tables are created and managed by Impala. Internal tables are immediately visible to Impala as soon as they are created. Administration tasks such as dropping and renaming tables are performed using Impala. Here’s an example on how to create an internal table in Impala.

CREATE TABLE users
(
  id BIGINT,
  name STRING,
  age TINYINT,
  salary FLOAT,
  PRIMARY KEY(id)
)
PARTITION BY HASH PARTITIONS 8
STORED AS KUDU;

External Tables

Kudu tables created via the Kudu API and Spark are not immediately visible to Impala. The table exists in Kudu but since it was not created via Impala, it does not know anything about the table. An external table must be created in Impala that references the Kudu table. Dropping an external table only removes the mapping between Impala and the Kudu table, it does not drop the physical table in Kudu. Below is an example on how to create an external table in Impala to an existing Kudu table.

CREATE EXTERNAL TABLE users
STORED AS KUDU
TBLPROPERTIES (
  'kudu.table_name' = 'kudu_users'
);

Changing Data

You can change the data stored in Kudu tables using SQL statements via Impala, just like traditional relational databases. This is one of the main reasons why you would use Kudu over immutable storage formats such as ORC or Parquet.

Inserting Rows

You can use a standard SQL insert statement.

INSERT INTO users VALUES (100, "John Smith", 25, 50000);

Inserting rows with multiple VALUES subclauses.

INSERT INTO users VALUES (100, "John Smith", 25, 50000), (200, "Cindy Nguyen", 38, 120000), (300, "Steve Mankiw", 60, 75000);

Bulk inserts are also supported.

INSERT INTO users SELECT * from users_backup;

Updating Rows

Standard SQL update statement.

UPDATE users SET age=39 where name = 'Cindy Nguyen';

Bulk updates are also supported.

UPDATE users SET salary=150000 where id > 100;

Upserting Rows

Upserts are supported. If the primary key does not exist in the table, the entire row is inserted.

UPSERT INTO users VALUES (400, "Mike Jones", 21, 80000);

However, if the primary key already exists, the columns are updated with the new values.

UPSERT INTO users VALUES (100, "John Smith", 27, 70000);

Deleting Rows

Standard SQL Delete statement.

DELETE FROM users WHERE id < 3;

More complicated Delete statements are also supported.

DELETE FROM users WHERE id in (SELECT id FROM users WHERE name = 'Steve Mankiw');

Note

As discussed in Chapter 2, Kudu doesn’t support ACID-compliant transactions. Updates will not be rolled back if it fails halfway through. Additional data validation must be performed after an update to ensure data integrity.

Changing Schema

Typical database administration tasks such as renaming tables, adding and dropping range partitions, and dropping tables are supported with Kudu. For more information, please consult Kudu’s online documentation.

Partitioning

Table partitioning is a common way to enhance performance, availability, and manageability of Kudu tables. Partitioning allows tables to be subdivided into smaller segments, or tablets. Partitioning enables Kudu to take advantage of partition pruning by allowing access to tablets at a finer level of granularity. Table partitioning is required for all Kudu tables and is completely transparent to applications. Kudu supports Hash, Range, and Composite Hash-Range and Hash-Hash partitioning. Below are a few examples of partitioning in Kudu. Partitioning is discussed in more detail in Chapter 2.

Hash Partitioning

Hash partitioning uses a hash key to distribute rows evenly across the different tablets.

CREATE TABLE myTable (
 id BIGINT NOT NULL,
 name STRING,
 PRIMARY KEY(id)
)
PARTITION BY HASH PARTITIONS 4
STORED AS KUDU;

Range Partitioning

Range partitioning stores ranges of data separately in different tablets.

CREATE TABLE myTable (
  year INT,
  deviceid INT,
  totalamt INT,
  PRIMARY KEY (deviceid, year)
)
PARTITION BY RANGE (year) (
  PARTITION VALUE = 2016,
  PARTITION VALUE = 2017,
  PARTITION VALUE = 2018
)
STORED AS KUDU;

Hash-Range Partitioning

Hash partitioning spread writes evenly across tablets. Range partitioning enables partition pruning and allows adding and removing partitions. Composite partitioning combines the strengths of both types of partitioning schemes while limiting its weaknesses. This is a good partitioning scheme for IoT use cases.

CREATE TABLE myTable (
 id BIGINT NOT NULL,
 sensortimestamp BIGINT NOT NULL,
 sensorid INTEGER,
 temperature INTEGER,
 pressure INTEGER,
 PRIMARY KEY(id, sensortimestamp)
)
PARTITION BY HASH (id) PARTITIONS 16,
RANGE (sensortimestamp)
(
PARTITION unix_timestamp('2017-01-01') <= VALUES < unix_timestamp('2018-01-01'),
PARTITION unix_timestamp('2018-01-01') <= VALUES < unix_timestamp('2019-01-01'),
PARTITION unix_timestamp('2019-01-01') <= VALUES < unix_timestamp('2020-01-01')
)
STORED AS KUDU;

Hash-Hash Partitioning

You can have multiple levels of hash partitions in your table. Each partition level should use a different hashed column.

CREATE TABLE myTable (
  id BIGINT,
  city STRING,
  name STRING
  age TINYINT,
  PRIMARY KEY (id, city)
)
PARTITION BY HASH (id) PARTITIONS 8,
             HASH (city) PARTITIONS 8
STORED AS KUDU;

List Partitioning

You may have used list partitions if you are familiar with other relational database management systems such as Oracle. While Kudu doesn’t technically have list partitioning, you can imitate its behavior using range partitioning.

CREATE TABLE myTable (
  city STRING
  name STRING,
  age TINYINT,
  PRIMARY KEY (city, name)
)
PARTITION BY RANGE (city)
(
  PARTITION VALUE = 'San Francisco',
  PARTITION VALUE = 'Los Angeles',
  PARTITION VALUE = 'San Diego',
  PARTITION VALUE = 'San Jose'
)
STORED AS KUDU;

Note

Partitioning is not a panacea. It is just one of the many ways to optimize Kudu. You may still run into performance issues when ingesting data into Kudu if you are using default Kudu settings. Make sure you adjust the parameters maintenance_manager_num_threads, iii which is the number of maintenance threads and can help to speed up compactions and flushes. You can monitor the bloom_lookups_per_op metric and memory pressure rejection to see if compactions and flushes are affecting performance. Another setting you may need to adjust is memory_limit_hard_bytes, which controls the total amount of memory allocated to the Kudu daemon. iv Refer to the online Kudu documentation for more details.

Using JDBC with Apache Impala and Kudu

Popular BI and data visualization tools such as Power BI, Tableau, Qlik, OBIEE, and MicroStrategy (to mention a few) can access Apache Impala and Kudu using JDBC/ODBC. The Impala JDBC drivers can be downloaded from Cloudera’s website. The Progress DataDirect JDBC driver is another alternative. v In some cases, some JDBC drivers from different companies have additional performance features. Your experience may vary. Figure 4-1 shows a sample screenshot of a Zoomdata dashboard visualizing data stored in Kudu via Impala JDBC/ODBC.
../images/456459_1_En_4_Chapter/456459_1_En_4_Fig1_HTML.jpg
Figure 4-1

Kudu data accessible from ZoomData in real time

Federation with SQL Server Linked Server and Oracle Gateway

You can create database links from SQL Server and Oracle to Impala. This is sometimes useful if copying data back and forth is too burdensome and data is relatively small. Additionally, if you need access to the latest data stored in Kudu in near real time from SQL Server or Oracle and ETL is too slow, then accessing data via a database link is an option. Here’s an example on how to create a database link in SQL Server. Users can then access data stored in a remote Impala environment (as shown in Listing 4-1). Note that it is generally recommended to use OpenQuery when executing queries to ensure the query is executed on the remote server.

EXEC master.dbo.sp_addlinkedserver
 @server = 'ClouderaImpala', @srvproduct='ClouderaImpala',
 @provider='MSDASQL', @datasrc='ClouderaImpala',
 @provstr='Provider=MSDASQL.1;Persist Security Info=True;User ID=;Password=';
SELECT * FROM OpenQuery(ClouderaImpala, 'SELECT * FROM order_items');
SELECT * FROM OpenQuery(ClouderaImpala, 'SELECT * FROM orders');
SELECT count(*)
FROM [ClouderaImpala].[IMPALA].[default].order_items
SELECT count(*)
FROM [ClouderaImpala].[IMPALA].[default].orders
create view OrderItems_v as
SELECT * FROM OpenQuery(ClouderaImpala, 'SELECT * from order_items');
create view Orders_v as
SELECT * FROM OpenQuery(ClouderaImpala, 'SELECT * from orders');
create view OrderDetail_v as
SELECT * FROM OpenQuery(ClouderaImpala, 'SELECT o.order_id,oi.order_item_id, o.order_date,o.order_status
FROM [IMPALA].[default].orders o, [IMPALA].[default].order_items oi
where o.order_id=oi.order_item_order_id')
Listing 4-1

Creating and using linked server from SQL Server to Impala

You can create a database link from Oracle to Impala using Oracle’s ODBC Heterogeneous Gateway. Refer to Oracle’s documentation for more details.

Database links are appropriate if accessing small- to medium-sized data sets. If you need a full-blown data virtualization tool, you might want to look at using Polybase, Denodo, or TIBCO data virtualization (previously owned by Cisco).

Summary

Impala provides Kudu with a powerful MPP SQL Engine. Together, they rival traditional data warehouse platforms in terms of performance and scalability. Kudu committers and contributors are hard at work at adding more features and capabilities. Chapter 2 provides an in-depth discussion of Kudu, including its limitations. For more information on Impala, I refer you to Chapter 3. Chapter 8 covers using Impala and Kudu for big data warehousing. Chapter 9 shows users how to use Impala and Kudu for real-time data visualization.

References

  1. i.

    Microsoft; “Complete tutorial to understand IEEE floating point errors,” Microsoft, 2018, https://support.microsoft.com/en-us/help/42980/-complete-tutorial-to-understand-ieee-floating-point-errors

     
  2. ii.

    Apache Impala; “TIMESTAMP Data Type,” Apache Impala, 2018, https://impala.apache.org/docs/build/html/topics/impala_timestamp.html

     
  3. iii.

    Cloudera; “Apache Kudu Background Maintenance Tasks,” Cloudera, 2018, https://www.cloudera.com/documentation/kudu/latest/topics/kudu_background_tasks.html

     
  4. iv.

    Todd Lipcon; “Re: How to calculate the optimal value of `maintenance_manager_num_threads`,” Todd Lipcon, 2017, https://www.mail-archive.com/[email protected]/msg00358.html

     
  5. v.

    Saikrishna Teja Bobba; “Tutorial: Using Impala JDBC and SQL with Apache Kudu,” Progress, 2017, https://www.progress.com/blogs/tutorial-using-impala-jdbc-and-sql-with-apache-kudu

     
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.205.136