Use skewed/temporary tables

Besides regular internal/external or partition tables, we should also consider using a skewed or temporary table for better design as well as performance.

Since Hive v0.10.0, HQL has supported the creation of a special table for organizing skewed data. A skewed table can be used to improve performance by splitting those skewed values into separate files or directories automatically. As a result, the total number of files or partition folders is reduced. Also, a query can include or ignore this data quickly and efficiently. Here is an example used to create a skewed table:

> CREATE TABLE sample_skewed_table (
> dept_no int,
> dept_name string
> )
> SKEWED BY (dept_no) ON (1000, 2000); -- Specify value skewed
No rows affected (3.122 seconds)

> DESC FORMATTED sample_skewed_table;
+-----------------+------------------+---------+
| col_name | data_type | comment |
+-----------------+------------------+---------+
| ... | ... | |
| Skewed Columns: | [dept_no] | NULL |
| Skewed Values: | [[1000], [2000]] | NULL |
| ... | ... | |
+-----------------+------------------+---------+
33 rows selected (0.247 seconds)

On the other hand, using temporary tables in HQL to keep intermediate data during data recursive processing will save you the effort of rebuilding the common or shared result set. In addition, temporary tables can leverage storage policy settings to use SSD or memory for data storage, and this adds up to better performance too. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.195.29