Partition table design

Hive partitioning is one of the most effective ways to improve query performance on larger tables. A query with partition filtering will only load data from the specified partitions (sub-directories), so it can execute much faster than a normal query that filters by a non-partitioning field. The selection of the partition key is always an important factor for performance. It should always be a low-cardinal attribute to avoid so many sub-directories overhead. The following are some attributes commonly used as partition keys:

Partitions by date and time: Use date and time, such as year, month, and day (even hours), as partition keys when data is associated with the date/time columns, such as load_date, business_date, run_date, and so on
Partitions by location: Use country, territory, state, and city as partition keys when data is location related
Partitions by business logic: Use department, sales region, applications, customers, and so on as partition keys when data can be separated evenly by business logic

Table of Contents for Partition table design

Create new playlist

Sign In

Sign Up

Table of Contents for
Partition table design