© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
E. PollackAnalytics Optimization with Columnstore Indexes in Microsoft SQL Serverhttps://doi.org/10.1007/978-1-4842-8048-5_11

11. Partitioning

Edward Pollack1  
(1)
Albany, NY, USA
 

As an analytic table grows in size, it becomes apparent that newer data is read far more often than older data. While columnstore metadata and rowgroup elimination provide the ability to quickly filter out large amounts of columnstore index data, managing a table with millions or billions of rows can become cumbersome.

Equally important is the fact that older data tends to not change often. For a typical fact table, data is added onto the end of it in the order it is created, whereas older data remains untouched. If older data is modified, it is usually the result of software releases, upgrades, or other processes that fall squarely into the world of the data architect (that’s us!) to manage.

Table partitioning is a natural fit for a clustered columnstore index, especially if it is large. Partitioning allows data to be split into multiple filegroups within a single database. These filegroups can be stored in different data files that reside in whatever locations are ideal for the data contained within them. The beauty of partitioning is that the table is logically unchanged, with its physical structure being influenced by the details of how it partitioned. This means that application code does not need to change in order to benefit from it. There are many benefits to partitioning, each of which is described in this chapter.

Maintain Hot/Warm/Cold Data

If older data is used far less often, then it can be stored in separate files that reside on slower storage. For example, reporting data from 10 years ago that is maintained for posterity but rarely used can be placed on inexpensive NAS storage, whereas new data can go on speedy SSD or flash storage.

The details will vary depending on the organization, but Figure 11-1 provides a simple representation of how an analytic table could be tiered to take advantage of different service levels.
Figure 11-1

How partitioning can influence storage speed and cost

Partitioning allows storage to be tiered based on expected SLAs (service-level agreements). For partitions containing new data that is expected to be highly available and have low latency, fast and expensive storage can be used. For partitions containing older data that is rarely accessed, slower and cheaper storage can be used. The details are up to an organization, but the ability to efficiently divide up a large table automatically into different storage tiers is exceptionally useful and can save significant money in the long run.

Faster Data Movement/Migration

Having data divided up physically means that it can be copied and moved with ease. For example, consider a columnstore index with 50 billion rows that consumes 1TB of storage. If there was a need to migrate this table from its current server to a new server with minimal downtime, how would it be done? The simplest solutions involve database backups, copying data files, or using ETL to move the data slowly from one server to the other. All of these options work, but would be time-consuming and would involve the need to incur some downtime or make the table read-only once the data movement process starts.

Partitioning allows older/unchanging data to reside in separate data files, which in turn means that those files can be freely backed up/copied to the new server ahead of time. Assuming nothing changes within them, that process can occur days, weeks, or months prior to the migration. ETL or similar processes could be used on the day of the migration to catch up the target database with new data prior to permanently cutting over from the old data source to the new one.

Figure 11-2 illustrates the difference between migrating a large/monolithic table vs. a partitioned one.
Figure 11-2

Migration of a partitioned table vs. a nonpartitioned table

Partitioning opens up the ability to subdivide the migration logically knowing that the physical storage of the table will facilitate the ability to copy/move each file one by one when needed. Instead of having to move a terabyte of data all at once or being forced to write ETL against the entire table, the table can be subdivided into smaller pieces, each of which is smaller and easier to move.

Partition Elimination

The logical definitions for each partition are not solely used for storage purposes. They also assist the query optimizer and allow it to automatically skip data when the filter in a query aligns with the partition function. For a query against a columnstore index, its filter can allow the metadata for rowgroups outside of the target partition to be ignored.

For example, if a columnstore index contains data ranging from 2010 through 2021 and is partitioned by year (with a single partition per year), then a query requesting rows from January 2021 would be able to automatically skip all rowgroups in partitions prior to 2021. Figure 11-3 shows how partition elimination can reduce reads and further speed up queries against columnstore indexes.
Figure 11-3

Reading metadata in a partitioned columnstore index

Partition functions are evaluated prior to columnstore metadata; therefore, data in irrelevant partitions is ignored when a query is executed. While columnstore metadata may be relatively lightweight, being able to skip the metadata for thousands of rowgroups can further improve analytic performance while reducing IO and memory consumption.

Database Maintenance

Some database maintenance tasks , such as backups and index maintenance, can be executed on a partition-by-partition basis. Since the data in older partitions rarely changes, it is less likely to require maintenance as often as newer data. Therefore, maintenance can be focused on the specific partitions that need it most. This also means that maintenance speeds can be greatly improved by no longer needing to operate on the entire table at one time.

In addition, how data is managed can vary partition to partition. Some ways in which data can be handled differently depending on the partition include
  • Compression can be customized by partition. In a columnstore index, archive compression can be applied to older and less used partitions, whereas standard columnstore compression can be the default for newer data.

  • Partitions can be truncated individually, allowing data to be removed from a portion of a table quickly, without having to use a DELETE statement.

  • Partition switching allows data to be swapped in and out of a partitioned table quickly and efficiently. This can greatly speed up migrations, data archival, and other processes where a large volume of data needs to be moved at one time.

Partitioning in Action

To visualize how partitioning works, a new version of the test table Fact.Sale_CCI will be created. This version will be ordered by Invoice Date Key and also partitioned by year. Each partition needs to target a filegroup , which in turn will contain a data file. For this demonstration, a new filegroup and file will be created for each year represented in the table.

Listing 11-1 shows how new filegroups can be created that will be used to store data for the partitioned table.
ALTER DATABASE WideWorldImportersDW ADD FILEGROUP WideWorldImportersDW_2013_fg;
ALTER DATABASE WideWorldImportersDW ADD FILEGROUP WideWorldImportersDW_2014_fg;
ALTER DATABASE WideWorldImportersDW ADD FILEGROUP WideWorldImportersDW_2015_fg;
ALTER DATABASE WideWorldImportersDW ADD FILEGROUP WideWorldImportersDW_2016_fg;
ALTER DATABASE WideWorldImportersDW ADD FILEGROUP WideWorldImportersDW_2017_fg;
Listing 11-1

Script That Creates a New Filegroup for Each Year of Data in the Table

Once executed, the presence of the new database filegroups can be confirmed by checking the Filegroups menu within the database’s properties, as seen in Figure 11-4.
Figure 11-4

New filegroups that will be used to store partitioned data

While the five new filegroups are present in the database, they contain no files. The next step in this process is to add files to these filegroups (one file each). Listing 11-2 contains the code needed to add these files.
ALTER DATABASE WideWorldImportersDW ADD FILE
       (NAME = WideWorldImportersDW_2013_data, FILENAME = 'C:SQLDataWideWorldImportersDW_2013_data.ndf',
        SIZE = 200MB, MAXSIZE = UNLIMITED, FILEGROWTH = 1GB)
TO FILEGROUP WideWorldImportersDW_2013_fg;
ALTER DATABASE WideWorldImportersDW ADD FILE
       (NAME = WideWorldImportersDW_2014_data, FILENAME = 'C:SQLDataWideWorldImportersDW_2014_data.ndf',
        SIZE = 200MB, MAXSIZE = UNLIMITED, FILEGROWTH = 1GB)
TO FILEGROUP WideWorldImportersDW_2014_fg;
ALTER DATABASE WideWorldImportersDW ADD FILE
       (NAME = WideWorldImportersDW_2015_data, FILENAME = 'C:SQLDataWideWorldImportersDW_2015_data.ndf',
        SIZE = 200MB, MAXSIZE = UNLIMITED, FILEGROWTH = 1GB)
TO FILEGROUP WideWorldImportersDW_2015_fg;
ALTER DATABASE WideWorldImportersDW ADD FILE
       (NAME = WideWorldImportersDW_2016_data, FILENAME = 'C:SQLDataWideWorldImportersDW_2016_data.ndf',
        SIZE = 200MB, MAXSIZE = UNLIMITED, FILEGROWTH = 1GB)
TO FILEGROUP WideWorldImportersDW_2016_fg;
ALTER DATABASE WideWorldImportersDW ADD FILE
       (NAME = WideWorldImportersDW_2017_data, FILENAME = 'C:SQLDataWideWorldImportersDW_2017_data.ndf',
        SIZE = 200MB, MAXSIZE = UNLIMITED, FILEGROWTH = 1GB)
TO FILEGROUP WideWorldImportersDW_2017_fg;
Listing 11-2

Script to Add Files to Each New Filegroup

For this demonstration, each file has the same size and growth settings, but for a real-world table, these numbers will vary. By inspecting the size of the data that is to be partitioned, the amount of space needed for each year should be relatively easy to calculate. Note that once a table’s data is migrated to a partitioned table, the data files that used to contain its data will now have additional free space that can be reclaimed, if needed. Figure 11-5 shows the new files listed under the Files menu within the database properties.
Figure 11-5

New database files that will be used to store partitioned data

New database files can be placed on any storage available to the server. This is where the table’s data can be customized to meet whatever SLAs it is subject to. For this example, if data from prior to 2016 is rarely accessed, it could be placed into files on slower storage. Similarly, more recent data could be maintained on faster storage to support frequent analytics.

The next step in configuring partitioning is to determine how to slice up data from the table into the newly created files. This is achieved using a partition function. When working with a columnstore index, ensure that the column that the table is ordered by is also the same data type used for the partition function. This convention is the same with clustered rowstore indexes as well. If the data type for the partition function does not match the target column to order by in the table, an error will be thrown when the table is created. For this demonstration, the partition function will split up data using the DATE data type, which corresponds to the Invoice Date Key column that Fact.Sale_CCI is ordered by, as seen in Listing 11-3.
CREATE PARTITION FUNCTION fact_Sale_CCI_years_function (DATE)
AS RANGE RIGHT FOR VALUES
       ('1/1/2014', '1/1/2015', '1/1/2016', '1/1/2017');
Listing 11-3

Creation of a Partition Function That Organizes Data by Date

RANGE RIGHT specifies that the boundaries created will be defined using the dates provided as starting points. The result is that data will be divided into five buckets, like this:
  1. 1.

    All dates < 1/1/2014

     
  2. 2.

    Dates >= 1/1/2014 and < 1/1/2015

     
  3. 3.

    Dates >= 1/1/2015 and < 1/1/2016

     
  4. 4.

    Dates >= 1/1/2016 and < 1/1/2017

     
  5. 5.

    Dates >= 1/1/2017

     
If used, RANGE LEFT would result in date ranges where the inequality operators are adjusted so that boundaries are checked with greater than and less than or equal to, rather than what was presented earlier. If unsure of which to use, consider implementing RANGE RIGHT for time-based dimensions as it is typically a more natural division that cleanly divides up units of months, quarters, years, etc. The following list shows how the boundaries would be defined in the partition function in Listing 11-3 if RANGE LEFT were used:
  1. 1.

    All dates <= 1/1/2014

     
  2. 2.

    Dates > 1/1/2014 and <= 1/1/2015

     
  3. 3.

    Dates > 1/1/2015 and <= 1/1/2016

     
  4. 4.

    Dates > 1/1/2016 and <= 1/1/2017

     
  5. 5.

    Dates > 1/1/2017

     

This boundary configuration is somewhat counterintuitive for dates, but might be relevant for other data types.

The final step before partitioning can be implemented on a table is to create a partition scheme. A partition scheme is used to map the data boundaries defined in the partition function onto the filegroups defined earlier. This is where it can be decided which date ranges correspond to which physical database files. Listing 11-4 provides a partition scheme that assigns each date range defined by the partition function to one of the five newly created filegroups.
CREATE PARTITION SCHEME fact_Sale_CCI_years_scheme
AS PARTITION fact_Sale_CCI_years_function
TO (WideWorldImportersDW_2013_fg, WideWorldImportersDW_2014_fg, WideWorldImportersDW_2015_fg, WideWorldImportersDW_2016_fg, WideWorldImportersDW_2017_fg);
Listing 11-4

Creation of a Partition Scheme to Assign Dates to Database Filegroups

Note that the partition function will always specify one less boundary than there are filegroups. In this example, the partition function provides four dates that form date boundaries that define five distinct date ranges which are subsequently mapped onto the partition scheme and assigned filegroups. If the number of boundaries provided by the partition function is not one less than the partition scheme, an error will be returned, similar to what is seen in Figure 11-6.
Figure 11-6

Error received if partition scheme contains too few filegroup entries

The error is verbose enough to remind the user that the function generates too many or too few partitions than are afforded by the partition scheme. By writing out the time periods desired for the target table, the task of generating a partition function and partition scheme become far simpler.

With database files and filegroups defined, as well as a partition function and scheme, the final step to implementing partitioning is to create a table using the newly created partition scheme. Listing 11-5 creates a new version of Fact.Sale and partitions it using the Invoice Date Key column on fact_Sale_CCI_years_scheme.
CREATE TABLE Fact.Sale_CCI_PARTITIONED
(      [Sale Key] [bigint] NOT NULL,
       [City Key] [int] NOT NULL,
       [Customer Key] [int] NOT NULL,
       [Bill To Customer Key] [int] NOT NULL,
       [Stock Item Key] [int] NOT NULL,
       [Invoice Date Key] [date] NOT NULL,
       [Delivery Date Key] [date] NULL,
       [Salesperson Key] [int] NOT NULL,
       [WWI Invoice ID] [int] NOT NULL,
       [Description] [nvarchar](100) NOT NULL,
       [Package] [nvarchar](50) NOT NULL,
       [Quantity] [int] NOT NULL,
       [Unit Price] [decimal](18, 2) NOT NULL,
       [Tax Rate] [decimal](18, 3) NOT NULL,
       [Total Excluding Tax] [decimal](18, 2) NOT NULL,
       [Tax Amount] [decimal](18, 2) NOT NULL,
       [Profit] [decimal](18, 2) NOT NULL,
       [Total Including Tax] [decimal](18, 2) NOT NULL,
       [Total Dry Items] [int] NOT NULL,
       [Total Chiller Items] [int] NOT NULL,
       [Lineage Key] [int] NOT NULL)
       ON fact_Sale_CCI_years_scheme ([Invoice Date Key]);
Listing 11-5

Creation of a New Table That Is Partitioned by Year on a Date Column

The only difference in the definition of a partitioned table vs. a nonpartitioned table is the last line, where an ON clause defines the partition scheme to use and which column will be evaluated to determine which filegroups the data will be stored in. The data type of the partition column and partition function must match or an error will be returned, similar to what is shown in Figure 11-7.
Figure 11-7

Error received if the data type of the partition column and partition function do not match

Data types between the partition function and column must match exactly. DATE and DATETIME are not compatible, nor are other data types that may seem similar. SQL Server does not automatically convert between data types when evaluating partition functions and will instead throw an error when the table is created.

Once created, Fact.Sale_CCI_PARTITIONED is managed in the same fashion as Fact.Sale_CCI_ORDERED was:
  1. 1.

    A clustered rowstore index is created, ordered by Invoice Date Key.

     
  2. 2.

    Data is inserted into the table.

     
  3. 3.

    A clustered columnstore index is created, replacing the rowstore index.

     
The code in Listing 11-6 walks through each of these steps.
CREATE CLUSTERED INDEX CCI_fact_Sale_CCI_PARTITIONED ON Fact.Sale_CCI_PARTITIONED ([Invoice Date Key]);
INSERT INTO Fact.Sale_CCI_PARTITIONED
       ([Sale Key], [City Key], [Customer Key], [Bill To Customer Key], [Stock Item Key], [Invoice Date Key], [Delivery Date Key],
        [Salesperson Key], [WWI Invoice ID], Description, Package, Quantity, [Unit Price], [Tax Rate],
        [Total Excluding Tax], [Tax Amount], Profit, [Total Including Tax], [Total Dry Items],
        [Total Chiller Items], [Lineage Key])
SELECT
       Sale.[Sale Key], Sale.[City Key], Sale.[Customer Key], Sale.[Bill To Customer Key], Sale.[Stock Item Key], Sale.[Invoice Date Key], Sale.[Delivery Date Key],
       Sale.[Salesperson Key], Sale.[WWI Invoice ID], Sale.Description, Sale.Package, Sale.Quantity, Sale.[Unit Price], Sale.[Tax Rate],
       Sale.[Total Excluding Tax], Sale.[Tax Amount], Sale.Profit, Sale.[Total Including Tax], Sale.[Total Dry Items],
       Sale.[Total Chiller Items], Sale.[Lineage Key]
FROM Fact.Sale
CROSS JOIN
Dimension.City
WHERE City.[City Key] >= 1 AND City.[City Key] <= 110;
CREATE CLUSTERED COLUMNSTORE INDEX CCI_Fact_Sale_CCI_PARTITIONED ON Fact.Sale_CCI_PARTITIONED WITH (MAXDOP = 1, DROP_EXISTING = ON);
Listing 11-6

Populating a Partitioned Table with an Ordered Data Set

When complete, two tables will exist for demonstration purposes that are identical, except that one is partitioned and the other is not. Both are ordered by Invoice Date Key and will benefit from rowgroup elimination whenever filtered by that column. Listing 11-7 shows two queries that aggregate Quantity for a single month against each table.
SELECT
       SUM([Quantity])
FROM Fact.Sale_CCI_ORDERED
WHERE [Invoice Date Key] >= '1/1/2016'
AND [Invoice Date Key] < '2/1/2016';
SELECT
       SUM([Quantity])
FROM Fact.Sale_CCI_PARTITIONED
WHERE [Invoice Date Key] >= '1/1/2016'
AND [Invoice Date Key] < '2/1/2016';
Listing 11-7

Narrow Analytic Queries Executed Against Nonpartitioned and Partitioned Columnstore Indexes

While the queries are identical aside from the table name, the STATISTICS IO output illustrates the differences in execution between each, as seen in Figure 11-8.
Figure 11-8

STATISTICS IO for a nonpartitioned columnstore index vs. a partitioned columnstore index

The output of STATISTICS IO shows multiple ways in which query execution varied for each table. The most significant difference is in the reported segment reads. The nonpartitioned table read 2 segments while skipping 22, whereas the partitioned table read 1 segment while skipping 3. This IO reflects the original query that calculates a sum using only rows with an Invoice Date Key within January 2016.

For the table that is ordered and not partitioned, metadata needs to be reviewed from the entire table prior to using rowgroup elimination to skip segments. In the partitioned table, though, rowgroups that do not contain data from 2016 are automatically skipped. Since partition functions are evaluated prior to evaluating columnstore metadata, unrelated rowgroups are never read, including their metadata. This represents partition elimination in action. While partitioning is not intended to be a query optimization solution, it will have a positive impact on columnstore index IO and query speeds, especially on larger tables.

Partitioning also impacts the query optimizer , which will evaluate a query across less rowgroups and rows, which can simplify execution plans. Figure 11-9 shows the execution plans for the analytic queries demonstrated in Listing 11-7.
Figure 11-9

Query execution plans for a nonpartitioned columnstore index vs. a partitioned columnstore index

The execution plan for the nonpartitioned table is more complex as SQL Server determines that parallelism may help in processing the large number of rows. The execution plan for the partitioned table is simpler, as the optimizer chooses to not use parallelism. Note that the use of a hash match is to support the aggregate pushdown for the SUM operator into the columnstore index scan step. Also interesting to note is that the count of rows read in the execution plan is lower on a partitioned table (vs. a nonpartitioned table) when less partitions need to be read.

While the output of each query is identical and each execution plan produces the same result set, the ability to forgo parallelism saves computing resources, as is implied by the greatly reduced query cost for the partitioned table.

In addition to partition elimination, index maintenance can be adjusted to skip older, less updated partitions. For example, if a rebuild was deemed necessary for this columnstore index to clean it up after some recent software releases, the nonpartitioned table would need to be rebuilt en masse, as seen in Listing 11-8.
ALTER INDEX CCI_fact_Sale_CCI_ORDERED ON Fact.Sale_CCI_ORDERED REBUILD;
Listing 11-8

Query to Rebuild a Columnstore Index

The rebuild operation takes 61 seconds to complete. For a larger columnstore index, it could be significantly longer. If the portion of data impacted by the software release is limited to newer data only, it is very likely that only the most current partition needs to be rebuilt. Listing 11-9 shows how an index rebuild can be executed against a single partition.
ALTER INDEX CCI_fact_Sale_CCI_PARTITIONED ON Fact.Sale_CCI_PARTITIONED REBUILD PARTITION = 5;
Listing 11-9

Query to Rebuild the Current Partition in a Columnstore Index

This rebuild only takes 1 second. This is because the most recent partition does not contain a full year of data. To provide a fairer assessment of rebuild times, Listing 11-10 shows a rebuild operation against a partition that contains a full year of data.
ALTER INDEX CCI_fact_Sale_CCI_PARTITIONED ON Fact.Sale_CCI_PARTITIONED REBUILD PARTITION = 4;
Listing 11-10

Query to Rebuild a Full Partition in a Columnstore Index

This rebuild operation requires 11 seconds to complete. Each partition rebuild is significantly faster than being forced to rebuild the entire index. This can greatly reduce the performance impact of index maintenance, when it is needed.

Another unique feature that can be demonstrated is the use of the SWITCH option to move the contents of a partition from one table to another using a single DDL operation. Because the operation is a DDL metadata operation and not a fully logged data movement process, it will be significantly faster. This opens up a variety of options for speedy data load and maintenance processes that otherwise would be prohibitively expensive.

For example, if it were determined that all data in Fact.Sale_CCI_PARTITIONED had an error starting on 1/1/2016, correcting it would be a challenge if left in place. Updating a columnstore index is a slow and expensive process that results in heavy fragmentation and is therefore worth avoiding if possible.

An alternative to this would be to switch out the partition, modify the data, and then reinsert the data back into the table. This allows the data that is to be modified to be isolated prior to making any changes. The script in Listing 11-11 creates a new staging table with the same schema as Fact.Sale_CCI_PARTITIONED that will be used as the target for a partition switch.
CREATE TABLE Fact.Sale_CCI_STAGING
(      [Sale Key] [bigint] NOT NULL,
       [City Key] [int] NOT NULL,
       [Customer Key] [int] NOT NULL,
       [Bill To Customer Key] [int] NOT NULL,
       [Stock Item Key] [int] NOT NULL,
       [Invoice Date Key] [date] NOT NULL,
       [Delivery Date Key] [date] NULL,
       [Salesperson Key] [int] NOT NULL,
       [WWI Invoice ID] [int] NOT NULL,
       [Description] [nvarchar](100) NOT NULL,
       [Package] [nvarchar](50) NOT NULL,
       [Quantity] [int] NOT NULL,
       [Unit Price] [decimal](18, 2) NOT NULL,
       [Tax Rate] [decimal](18, 3) NOT NULL,
       [Total Excluding Tax] [decimal](18, 2) NOT NULL,
       [Tax Amount] [decimal](18, 2) NOT NULL,
       [Profit] [decimal](18, 2) NOT NULL,
       [Total Including Tax] [decimal](18, 2) NOT NULL,
       [Total Dry Items] [int] NOT NULL,
       [Total Chiller Items] [int] NOT NULL,
              [Lineage Key] [int] NOT NULL)
ON WideWorldImportersDW_2016_fg;
CREATE CLUSTERED COLUMNSTORE INDEX CCI_fact_Sale_CCI_STAGING ON Fact.Sale_CCI_STAGING;
Listing 11-11

Script to Create a Staging Table for Use in a Partition Switching Operation

Note that the staging table is created on the same filegroup as the source data that is to be switched. If desired, the staging table could be created using the same partition scheme as Fact.Sale_CCI_PARTITIONED, which would allow the partition number to be specified in the partition switch, rather than having to explicitly provide a filegroup in the table create statement. The syntax is arbitrary and can be left to the convenience of the operator as to which is easier to implement.

Once the staging table is created, a partition on the 2016 filegroup can be switched from the source table into the staging table using the script in Listing 11-12.
ALTER TABLE Fact.Sale_CCI_PARTITIONED SWITCH PARTITION 4 TO Fact.Sale_CCI_STAGING;
Listing 11-12

Script to Switch a Partition from a Partitioned Columnstore Index into a Staging Table

The partition switch executes in milliseconds as it is a minimally logged DDL operation. The results can be measured by validating the row count in the staging table, as shown in Listing 11-13.
SELECT COUNT(*) FROM Fact.Sale_CCI_STAGING;
Listing 11-13

Validating the Row Count in the Staging Table After the Partition Switch Operation Completes

The results of the count operation are shown in Figure 11-10.
Figure 11-10

Count of rows in the newly populated staging table

The result of the count query shows that over 3 million rows were switched from Fact.Sale_CCI_PARTITIONED to the staging table. From here, data can be modified in the staging table as needed to resolve whatever the identified issues are. Once the data is cleaned up, it can be inserted back into the original columnstore index, ensuring that the resulting data is clean with no fragmentation. Listing 11-14 shows a sample process for how this data modification could occur.
UPDATE Sale_CCI_STAGING
       SET [Total Dry Items] = [Total Dry Items] + 1
FROM Fact.Sale_CCI_STAGING;
INSERT INTO Fact.Sale_CCI_PARTITIONED
       ([Sale Key], [City Key], [Customer Key], [Bill To Customer Key], [Stock Item Key], [Invoice Date Key], [Delivery Date Key], [Salesperson Key],
     [WWI Invoice ID], Description, Package, Quantity, [Unit Price], [Tax Rate], [Total Excluding Tax], [Tax Amount], Profit, [Total Including Tax],
     [Total Dry Items], [Total Chiller Items], [Lineage Key])
SELECT
       [Sale Key], [City Key], [Customer Key], [Bill To Customer Key], [Stock Item Key], [Invoice Date Key], [Delivery Date Key], [Salesperson Key],
     [WWI Invoice ID], Description, Package, Quantity, [Unit Price], [Tax Rate], [Total Excluding Tax], [Tax Amount], Profit, [Total Including Tax],
     [Total Dry Items], [Total Chiller Items], [Lineage Key]
FROM Fact.Sale_CCI_STAGING;
DROP TABLE Fact.Sale_CCI_STAGING;
Listing 11-14

Example of Data Modification Using Staging Data. Data Is Moved Back into the Parent Table Using an INSERT Operation

The update is an expensive operation, but by isolating it to a staging table, the contention of the operation will not impact the larger columnstore index. Once the update is complete, the data can be inserted back into the partitioned table and the staging table dropped.

Alternatively, the INSERT operation can be replaced with another partition switch. The code in Listing 11-15 shows an alternative to the code in Listing 11-14.
UPDATE Sale_CCI_STAGING
       SET [Total Dry Items] = [Total Dry Items] + 1
FROM Fact.Sale_CCI_STAGING;
ALTER TABLE Fact.Sale_CCI_STAGING SWITCH PARTITION 4 TO Fact.Sale_CCI_PARTITIONED;
DROP TABLE Fact.Sale_CCI_STAGING;
Listing 11-15

Example of Data Modification Using Staging Data. Data Is Moved Back into the Parent Table Using Partition Switching

This alternative code will be significantly faster as the partition switch is a speedy, minimally logged DDL operation, whereas the large INSERT in Listing 11-14 needs to incur the cost of writing all of this data back to Fact.Sale_CCI_Partitioned.

Partition switching is a versatile tool that can be used in a wide variety of data modification, archival, and deletion scenarios. Its use will vary between applications, but provides a fast way to shift data in and out of tables without the fragmentation and expense of a large write operation.

Partitioning Guidelines

While partitioning may sound like a win for any columnstore index, it should not be automatically implemented without research and foresight. Partitioning works well in larger tables, but does not provide value to small columnstore indexes. Therefore, capacity planning should be a step in the decision-making process to determine whether partitioning is a good fit or not for any given table.

Partition and Rowgroup Sizing

Because partitioning is applied to a table before rowgroups are created, partition boundaries will force rowgroups to be cut off. For example, if a nonpartitioned columnstore index contains 5,000,000 rows, those rows will likely reside in 5 rowgroups. If a new version of that table were created and partitioned on a date column that evenly spanned 10 years, then the result would be 500,000 rows per partition.

Rowgroups cannot span partitions, though. The new partitioned table would contain ten partitions, each of which consists of a single rowgroup. The result is a table with ten rowgroups instead of five.

Generally speaking, partitioning is not appropriate for a columnstore index unless the table contains at least tens or hundreds of millions of rows. Equally important, partitions need to be large enough to facilitate the use of filled-up rowgroups. Therefore, the partition function and partition scheme need to ensure that each partition contains at least 220 (1,048,576) rows each. Undersized partitions will result in undersized rowgroups, which will lead to fragmentation over time.

The larger a columnstore index, the more it will benefit from partitioning. Aligning partition boundaries to organizational needs can also help in determining how to implement partitioning on a columnstore index. If an organization archives a billion rows of data each quarter, then partitioning by quarter makes perfect sense. If instead, the archival is organized by year, then partitioning by year would be more relevant.

Partition Column Choice

The column chosen for the partition function should be the same column that the columnstore index is ordered by. Partitioning and rowgroup elimination work best when they operate on the same data sets and that can only be accomplished when they use the same sort criteria. Therefore, if a columnstore index is ordered by a particular date column, then partitioning should also be configured on that column. Partitioning by another column may result in worse performance as SQL Server needs to scan more rowgroups across more partitions to retrieve the data it needs.

Storage Choice

If data within a columnstore index is accessed differently, the data files used for the partitioned table can mirror that usage. If older data is rarely accessed and can tolerate more latency, then it can be moved onto slower/less expensive hardware. If newer data needs to be highly available with minimal latency, then it can be placed on faster and more expensive storage.

As a result, a large table can be split up to save money. Every terabyte of data that lands on cheaper storage represents cost savings that can be easily quantified. While partitioning’s primary purpose is not to save money, the ability to shift workloads via the strategic placement of data files can achieve this without significant effort.

When building a partitioned table, identify if the table has data that follows different usage patterns and assign data files based on that usage to slower or faster storage, when possible. If performance requirements change with time, data files can be moved between storage locations to ensure that SLAs are still met, even if data that was once rarely needed becomes critical to frequent analytic processes.

Additional Benefits

One of the greatest benefits of table partitioning is that it requires no application code changes. All partitioning structures are internal to SQL Server and have no bearing on an application beyond the performance experienced as the table is accessed. This also means that partitioning can be tested out for analytic data and kept or rolled back depending on the results of that testing. A “rollback” would consist of creating a second copy of data and swapping it into the production location, but is a reasonable process with analytic data where data load processes are well within the confines of data architecture. This allows partitioning to be tested with minimal impact on the code that developers write to consume this data.

Partitioning is an optional step when implementing a columnstore index, but can improve maintenance, speed up analytic workloads, and potentially save money. This feature should be targeted at larger tables with at least tens of millions of rows and that are expected to grow rapidly over time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.171.20