Buckets

Besides partition, the bucket is another technique to cluster datasets into more manageable parts to optimize query performance. Different from a partition, a bucket corresponds to segments of files in HDFS. For example, the employee_partitioned table from the previous section uses year and month as the top-level partition. If there is a further request to use employee_id as the third level of partition, it creates many partition directories. For instance, we can bucket the employee_partitioned table using employee_id as a bucket column. The value of this column will be hashed by a user-defined number of buckets. The records with the same employee_id will always be stored in the same bucket (segment of files). The bucket columns are defined by CLUSTERED BY keywords. It is quite different from partition columns since partition columns refer to the directory, while bucket columns have to be actual table data columns. By using buckets, an HQL query can easily and efficiently do sampling (see Chapter 6, Data Aggregation and Sampling), bucket-side joins, and map-side joins (see Chapter 4, Data Correlation and Scope). An example of creating a bucket table is shown as follows:

--Prepare table employee_id and its dataset to populate bucket table
> CREATE TABLE employee_id (
> name STRING,
> employee_id INT, 
> work_place ARRAY<STRING>,
> gender_age STRUCT<gender:STRING,age:INT>,
> skills_score MAP<STRING,INT>,
> depart_title MAP<STRING,ARRAY<STRING>>
> )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '|'
> COLLECTION ITEMS TERMINATED BY ','
> MAP KEYS TERMINATED BY ':';
No rows affected (0.101 seconds)
  

> LOAD DATA INPATH 
> '/tmp/hivedemo/data/employee_id.txt' 
> OVERWRITE INTO TABLE employee_id
No rows affected (0.112 seconds)
   

--Create the buckets table
> CREATE TABLE employee_id_buckets (
> name STRING,
> employee_id INT,  -- Use this table column as bucket column later
> work_place ARRAY<STRING>,
> gender_age STRUCT<gender:string,age:int>,
> skills_score MAP<string,int>,
> depart_title MAP<string,ARRAY<string >>
> )
> CLUSTERED BY (employee_id) INTO 2 BUCKETS -- Support more columns
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '|'
> COLLECTION ITEMS TERMINATED BY ','
> MAP KEYS TERMINATED BY ':';
No rows affected (0.104 seconds)

To define the proper number of buckets, we should avoid having too much or too little data in each bucket. A better choice is somewhere near two blocks of data, such as 512 MB of data in each bucket. As a best practice, use 2^N as the number of buckets.

Bucketing has a close dependency on the data-loading process. To properly load data into a bucket table, we need to either set the maximum number of reducers to the same number of buckets specified in the table creation (for example, 2), or enable enforce bucketing (recommended), as follows:

> set map.reduce.tasks = 2;
No rows affected (0.026 seconds)
     
 
> set hive.enforce.bucketing = true; -- This is recommended
No rows affected (0.002 seconds)

To populate the data to a bucket table, we cannot use the LOAD DATA statement, because it does not verify the data against the metadata. Instead, INSERT should be used to populate the bucket table all the time:

> INSERT OVERWRITE TABLE employee_id_buckets SELECT * FROM employee_id;
No rows affected (75.468 seconds)

-- Verify the buckets in the HDFS from shell
$hdfs dfs -ls /user/hive/warehouse/employee_id_buckets
Found 2 items
-rwxrwxrwx   1 hive hive        900 2018-07-02 10:54 
/user/hive/warehouse/employee_id_buckets/000000_0
-rwxrwxrwx   1 hive hive        582 2018-07-02 10:54 
/user/hive/warehouse/employee_id_buckets/000001_0

Table of Contents for Buckets

Create new playlist

Sign In

Sign Up

Table of Contents for
Buckets