11. Partitioning

Edward Pollack¹

(1)

Albany, NY, USA

As an analytic table grows in size, it becomes apparent that newer data is read far more often than older data. While columnstore metadata and rowgroup elimination provide the ability to quickly filter out large amounts of columnstore index data, managing a table with millions or billions of rows can become cumbersome.

Equally important is the fact that older data tends to not change often. For a typical fact table, data is added onto the end of it in the order it is created, whereas older data remains untouched. If older data is modified, it is usually the result of software releases, upgrades, or other processes that fall squarely into the world of the data architect (that’s us!) to manage.

Table partitioning is a natural fit for a clustered columnstore index, especially if it is large. Partitioning allows data to be split into multiple filegroups within a single database. These filegroups can be stored in different data files that reside in whatever locations are ideal for the data contained within them. The beauty of partitioning is that the table is logically unchanged, with its physical structure being influenced by the details of how it partitioned. This means that application code does not need to change in order to benefit from it. There are many benefits to partitioning, each of which is described in this chapter.

Maintain Hot/Warm/Cold Data

If older data is used far less often, then it can be stored in separate files that reside on slower storage. For example, reporting data from 10 years ago that is maintained for posterity but rarely used can be placed on inexpensive NAS storage, whereas new data can go on speedy SSD or flash storage.

The details will vary depending on the organization, but Figure 11-1 provides a simple representation of how an analytic table could be tiered to take advantage of different service levels.

Figure 11-1
How partitioning can influence storage speed and cost

Partitioning allows storage to be tiered based on expected SLAs (service-level agreements). For partitions containing new data that is expected to be highly available and have low latency, fast and expensive storage can be used. For partitions containing older data that is rarely accessed, slower and cheaper storage can be used. The details are up to an organization, but the ability to efficiently divide up a large table automatically into different storage tiers is exceptionally useful and can save significant money in the long run.

Faster Data Movement/Migration

Having data divided up physically means that it can be copied and moved with ease. For example, consider a columnstore index with 50 billion rows that consumes 1TB of storage. If there was a need to migrate this table from its current server to a new server with minimal downtime, how would it be done? The simplest solutions involve database backups, copying data files, or using ETL to move the data slowly from one server to the other. All of these options work, but would be time-consuming and would involve the need to incur some downtime or make the table read-only once the data movement process starts.

Partitioning allows older/unchanging data to reside in separate data files, which in turn means that those files can be freely backed up/copied to the new server ahead of time. Assuming nothing changes within them, that process can occur days, weeks, or months prior to the migration. ETL or similar processes could be used on the day of the migration to catch up the target database with new data prior to permanently cutting over from the old data source to the new one.

Figure 11-2 illustrates the difference between migrating a large/monolithic table vs. a partitioned one.

Figure 11-2
Migration of a partitioned table vs. a nonpartitioned table

Partitioning opens up the ability to subdivide the migration logically knowing that the physical storage of the table will facilitate the ability to copy/move each file one by one when needed. Instead of having to move a terabyte of data all at once or being forced to write ETL against the entire table, the table can be subdivided into smaller pieces, each of which is smaller and easier to move.

Partition Elimination

The logical definitions for each partition are not solely used for storage purposes. They also assist the query optimizer and allow it to automatically skip data when the filter in a query aligns with the partition function. For a query against a columnstore index, its filter can allow the metadata for rowgroups outside of the target partition to be ignored.

For example, if a columnstore index contains data ranging from 2010 through 2021 and is partitioned by year (with a single partition per year), then a query requesting rows from January 2021 would be able to automatically skip all rowgroups in partitions prior to 2021. Figure 11-3 shows how partition elimination can reduce reads and further speed up queries against columnstore indexes.

Figure 11-3
Reading metadata in a partitioned columnstore index

Partition functions are evaluated prior to columnstore metadata; therefore, data in irrelevant partitions is ignored when a query is executed. While columnstore metadata may be relatively lightweight, being able to skip the metadata for thousands of rowgroups can further improve analytic performance while reducing IO and memory consumption.

Database Maintenance

Some database maintenance tasks , such as backups and index maintenance, can be executed on a partition-by-partition basis. Since the data in older partitions rarely changes, it is less likely to require maintenance as often as newer data. Therefore, maintenance can be focused on the specific partitions that need it most. This also means that maintenance speeds can be greatly improved by no longer needing to operate on the entire table at one time.

In addition, how data is managed can vary partition to partition. Some ways in which data can be handled differently depending on the partition include

Compression can be customized by partition. In a columnstore index, archive compression can be applied to older and less used partitions, whereas standard columnstore compression can be the default for newer data.
Partitions can be truncated individually, allowing data to be removed from a portion of a table quickly, without having to use a DELETE statement.
Partition switching allows data to be swapped in and out of a partitioned table quickly and efficiently. This can greatly speed up migrations, data archival, and other processes where a large volume of data needs to be moved at one time.

Partitioning in Action

To visualize how partitioning works, a new version of the test table Fact.Sale_CCI will be created. This version will be ordered by Invoice Date Key and also partitioned by year. Each partition needs to target a filegroup , which in turn will contain a data file. For this demonstration, a new filegroup and file will be created for each year represented in the table.

Listing 11-1 shows how new filegroups can be created that will be used to store data for the partitioned table.

ALTER DATABASE WideWorldImportersDW ADD FILEGROUP WideWorldImportersDW_2013_fg;

ALTER DATABASE WideWorldImportersDW ADD FILEGROUP WideWorldImportersDW_2014_fg;

ALTER DATABASE WideWorldImportersDW ADD FILEGROUP WideWorldImportersDW_2015_fg;

ALTER DATABASE WideWorldImportersDW ADD FILEGROUP WideWorldImportersDW_2016_fg;

ALTER DATABASE WideWorldImportersDW ADD FILEGROUP WideWorldImportersDW_2017_fg;

Listing 11-1

Script That Creates a New Filegroup for Each Year of Data in the Table

Once executed, the presence of the new database filegroups can be confirmed by checking the Filegroups menu within the database’s properties, as seen in Figure 11-4.

Figure 11-4
New filegroups that will be used to store partitioned data

While the five new filegroups are present in the database, they contain no files. The next step in this process is to add files to these filegroups (one file each). Listing 11-2 contains the code needed to add these files.