Index design

Using indexes is a very common best practice for performance tuning in relational databases. Hive has supported index creation on tables/partitions since Hive v0.7.0. An index in Hive provides a key-based data view and better data access for certain operations, such as WHERE, GROUP BY, and JOIN. Using an index is always a cheaper alternative than full-table scans. The command to create an index in HQL is straightforward, as follows:

> CREATE INDEX idx_id_employee_id
> ON TABLE employee_id (employee_id)
> AS 'COMPACT'
> WITH DEFERRED REBUILD;
No rows affected (1.149 seconds)

In addition to this COMPACT index, which stores the pair of the indexed column's value and its block ID, HQL has also supported BITMAP indexes since v0.8.0 for column values with less variance, as shown in the following example:

> CREATE INDEX idx_gender_employee_id
> ON TABLE employee_id (gender_age)
> AS 'BITMAP'
> WITH DEFERRED REBUILD;
No rows affected (0.251 seconds)

The WITH DEFERRED REBUILD option in this example prevents the index from immediately being built. To build the index, we can issue the ALTER...REBUILD commands as shown in the following example. When data in the base table changes, the same command must be used again to bring the index up to date. This is an atomic operation. If the index rebuilt on a table has been previously indexed failed, the state of the index remains the same. See this example to build the index:

> ALTER INDEX idx_id_employee_id ON employee_id REBUILD;

No rows affected (111.413 seconds)

> ALTER INDEX idx_gender_employee_id ON employee_id REBUILD; No rows affected (82.23 seconds)

Once the index is built, a new index table is created for each index with the name in the format of <database_name>__<table_name>_<index_name>__:

> SHOW TABLES '*idx*';
+-----------+---------------------------------------------+-----------+
|TABLE_SCHEM|                 TABLE_NAME                  | TABLE_TYPE|
+-----------+---------------------------------------------+-----------+
|default    |default__employee_id_idx_id_employee_id__    |INDEX_TABLE|
|default    |default__employee_id_idx_gender_employee_id__|INDEX_TABLE|
+-----------+---------------------------------------------+-----------+

The index table contains the indexed column, the _bucketname (a typical file URI on HDFS), and _offsets (offsets for each row). Then, this index table can be referred to when we query the indexed columns from the indexed table, as shown here:

> DESC default__employee_id_idx_id_employee_id__;
+--------------+----------------+----------+
|   col_name   |   data_type    | comment  |
+--------------+----------------+----------+
| employee_id  | int            |          |
| _bucketname  | string         |          |
| _offsets     | array<bigint>  |          |
+--------------+----------------+----------+
3 rows selected (0.135 seconds)

> SELECT * FROM default__employee_id_idx_id_employee_id__;
+--------------+------------------------------------------------------+
| employee_id | _bucketname | _offsets |
+--------------+------------------------------------------------------+
| 100 | .../warehouse/employee_id/employee_id.txt | [0] |
| 101 | .../warehouse/employee_id/employee_id.txt | [66] |
| 102 | .../warehouse/employee_id/employee_id.txt | [123] |
| ... | ... ... | ... |
+--------------+-------------------------------------------+----------+
25 rows selected (0.219 seconds)

To drop an index, we can only use the DROP INDEX index_name ON table_name statement as follows. We cannot drop the index with a DROP TABLE statement:

> DROP INDEX idx_gender_employee_id ON employee_id;
No rows affected (0.247 seconds)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.177.125