Partitions

By default, a simple HQL query scans the whole table. This slows down the performance when querying a big table. This issue could be resolved by creating partitions, which are very similar to what's in the RDBMS. In Hive, each partition corresponds to a predefined partition column(s), which maps to subdirectories in the table's directory in HDFS. When the table gets queried, only the required partitions (directory) of data in the table are being read, so the I/O and time of the query is greatly reduced. Using partition is a very easy and effective way to improve performance in Hive.

The following is an example of partition creation in HQL:

> CREATE TABLE employee_partitioned (
> name STRING,
> work_place ARRAY<STRING>,
> gender_age STRUCT<gender:STRING,age:INT>,
> skills_score MAP<STRING,INT>,
> depart_title MAP<STRING,ARRAY<STRING>> 
-- This is regular column > ) > PARTITIONED BY (year INT, month INT)
-- Use lower case partition column > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '|' > COLLECTION ITEMS TERMINATED BY ',' > MAP KEYS TERMINATED BY ':'; No rows affected (0.293 seconds)


> DESC employee_partitioned;
-- Partition columns are listed twice
+-------------------------+-------------------------------+---------+
| col_name | data_type | comment |
+-------------------------+-------------------------------+---------+
| name | string | |
| work_place | array<string> | |
| gender_age | struct<gender:string,age:int> | |
| skills_score | map<string,int> | |
| depart_title | map<string,array<string>> | |
| year | int | |
| month | int | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| year | int | |
| month | int | |
+-------------------------+-------------------------------+---------+
13 rows selected (0.38 seconds)


> SHOW PARTITIONS employee_partitioned; -- Check partitions +------------+ | partition | +------------+ +------------+ No rows selected (0.177 seconds)

From the preceding result, we can see that the partition is not enabled automatically. We have to use the ALTER TABLE ADD PARTITION statement to add static partitions to a table. Here, static means the partition is being added manually. This command changes the table's metadata but does not load data. If the data does not exist in the partition's location, queries will not return any results. To drop the partition metadata, use the ALTER TABLE ... DROP PARTITION statement. For external tables, ALTER does not change data but metadata, drop partition will not drop data inside the partition. In order to remove data, we can use the hdfs dfs -rm command to remove data from HDFS for the external table. For internal tables, ALTER TABLE ... DROP PARTITION will remove both partition and data. The following are more examples of common operations on partition tables:

  1. Perform partition operations, such as add, remove, and rename partitions:
      > ALTER TABLE employee_partitioned ADD -- Add multiple static 
partitions > PARTITION (year=2018, month=11) PARTITION (year=2018,
month=12); No rows affected (0.248 seconds)


> SHOW PARTITIONS employee_partitioned; +---------------------+ | partition | +---------------------+ | year=2018/month=11 | | year=2018/month=12 | +---------------------+ 2 rows selected (0.108 seconds)

-- Drop partition with PURGE at the end will remove completely
-- Drop partition will NOT remove data for external table
-- Drop partition will remove data with partition for internal table
> ALTER TABLE employee_partitioned

> DROP IF EXISTS PARTITION (year=2018, month=11);
> SHOW PARTITIONS employee_partitioned; +---------------------+ | partition | +---------------------+ | year=2018/month=12 | +---------------------+ 1 row selected (0.107 seconds)


> ALTER TABLE employee_partitioned
> DROP IF EXISTS PARTITION (year=2017); -- Drop all partitions in
2017

> ALTER TABLE employee_partitioned
> DROP IF EXISTS PARTITION (month=9); -- Drop all month is 9

> ALTER TABLE employee_partitioned -- Rename exisiting partition
values
> PARTITION (year=2018, month=12)
> RENAME TO PARTITION (year=2018,month=10);
No rows affected (0.274 seconds)


> SHOW PARTITIONS employee_partitioned;
+---------------------+
| partition |
+---------------------+
| year=2018/month=10 |
+---------------------+
2 rows selected (0.274 seconds)


-- Below is failed
-- Because all partition columns should be specified for partition
rename
> --ALTER TABLE employee_partitioned PARTITION (year=2018)
> --RENAME TO PARTITION (year=2017);
  1. Load data into a table partition once the partition is created:
      > LOAD DATA INPATH '/tmp/hivedemo/data/employee.txt'
> OVERWRITE INTO TABLE employee_partitioned
> PARTITION (year=2018, month=12);
No rows affected (0.96 seconds)


> SELECT name, year, month FROM employee_partitioned; -- Verify data
loaded
+----------+-------+--------+
| name | year | month |
+----------+-------+--------+
| Michael | 2018 | 12 |
| Will | 2018 | 12 |
| Shelley | 2018 | 12 |
| Lucy | 2018 | 12 |
+----------+-------+--------+
4 rows selected (37.451 seconds)

To avoid manually adding static partitions, dynamic partition insert (or multipartition insert) is designed for dynamically determining which partitions should be added and populated while scanning the input table. This part is introduced in more detail for the INSERT statement in Chapter 5, Data Manipulation. To populate data in the partition, we can use the LOAD or INSERT statements. The statement only loads the data in the specified partition lists.

Although partition columns map to directory names rather than data, we can query or select them like regular columns in HQL to narrow down the result set.

The use case for static and dynamic partition is quite different. Static partition is often used for an external table containing data newly landed in HDFS. In this case, it often uses the date, such as yyyyMMdd, as the partition column. Whenever the data of the new day arrives, we add the day-specific static partition (by script) to the table, and then the newly arrived data is queryable from the table immediately. For dynamic partition, it is often being used for data transformation between internal tables with partition columns derived from data itself; see Chapter 5Data Manipulation.
  1. Remove data from the partition. Note, removing data will not remove the partition information. In order to do a complete data cleaning, we can drop the partition described in step 1 after the data is removed:

      -- For internal table, we use truncate
> TRUNCATE TABLE employee_partitioned PARTITION
(year=2018,month=12);


-- For external table, we have to use hdfs command
> dfs -rm -r -f /user/dayongd/employee_partitioned;
  1. Add regular columns to a partition table. Note, we CANNOT add new columns as partition columns. There are two options when adding/removing columns from a partition table, CASCADE and RESTRICT. The commonly used CASCADE option cascades the same change to all the partitions in the table. However,  RESTRICT is the default, limiting column changes only to table metadata, which means the changes will be only applied to new partitions rather than existing partitions:

      > ALTER TABLE employee_partitioned ADD COLUMNS (work string) 
CASCADE;
  1. We can change the existing partition column data type:

      > ALTER TABLE employee_partitioned PARTITION COLUMN(year string);
No rows affected (0.274 seconds)

> DESC employee_partitioned; -- Verify the changes
+-------------------------+-------------------------------+---------+
| col_name | data_type | comment |
+-------------------------+-------------------------------+---------+
| name | string | |
| work_place | array<string> | |
| gender_age | struct<gender:string,age:int> | |
| skills_score | map<string,int> | |
| depart_title | map<string,array<string>> | |
| work | string | |
| year | int | |
| month | int | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| year | string | |
| month | int | |
+-------------------------+-------------------------------+---------+
13 rows selected (0.38 seconds)
Right now, we can only change the partition column data type. We cannot add/remove a column from partition columns. If we have to change the partition design, we must back up and recreate the table, and then migrate the data. In addition, we are NOT able to change a non-partition table to a partition table directly.
  1. Changing the partition's other properties in terms of file format, location, protections, and concatenation have the same syntax to alter the table statement:
      > ALTER TABLE employee_partitioned PARTITION (year=2018) 
> SET FILEFORMAT ORC;
> ALTER TABLE employee_partitioned PARTITION (year=2018)
> SET LOCATION '/tmp/data';
> ALTER TABLE employee_partitioned PARTITION (year=2018) ENABLE
NO_DROP;

> ALTER TABLE employee_partitioned PARTITION (year=2018) ENABLE
OFFLINE;

> ALTER TABLE employee_partitioned PARTITION (year=2018) DISABLE
NO_DROP;

> ALTER TABLE employee_partitioned PARTITION (year=2018) DISABLE
OFFLINE;

> ALTER TABLE employee_partitioned PARTITION (year=2018) CONCATENATE;
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.17.154