© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
E. PollackAnalytics Optimization with Columnstore Indexes in Microsoft SQL Serverhttps://doi.org/10.1007/978-1-4842-8048-5_14

14. Columnstore Index Maintenance

Edward Pollack1  
(1)
Albany, NY, USA
 

Depending on its usage, a columnstore index may require no maintenance at all, infrequent maintenance, or regular maintenance to ensure optimal storage, resource consumption, and performance.

Understanding how and when to use index maintenance will ensure that analytic queries continue to perform well both when a columnstore index is first created and years into the future.

What Causes Fragmentation?

There are three scenarios that can be used to describe fragmentation in columnstore indexes:
  1. 1.

    Wasted space due to deleted rows

     
  2. 2.

    Data that is inserted out of order

     
  3. 3.

    Multiple delta rowgroups present in the index

     

When data is deleted in a columnstore index, no rows are actually removed. The cost to decompress, delete rows, and recompress rowgroups is prohibitively expensive at runtime. Instead, SQL Server flags the rows as deleted using the delete bitmap , and when queries access that data, they use the bitmap to identify any rows that are deleted and need to be skipped.

Chapter 9 provides extensive detail as to how DELETE and UPDATE operations perform on columnstore indexes and outlines ways of managing them to avoid persistent performance challenges and minimize fragmentation. The net impact of deletion on columnstore indexes will be
  • Space consumed by deleted data.

  • More data needs to be read than is needed at runtime.

  • Data is spread over more rowgroups than necessary.

The other cause of fragmentation is out-of-order data. A columnstore index should always be built and maintained with data that is ordered by a key column (or columns) that are used most frequently in filtering and aggregating queries.

Chapter 10 describes the importance of data order in columnstore indexes and how maintaining a data order that lines up with common query filters and aggregation can allow for exceptionally fast query performance.

Data that is inserted regularly in the desired columnstore index order will facilitate rowgroup elimination, thus allowing vast amounts of the index to be automatically skipped when filtered using the same column(s). When rows are inserted out of order, this forces more data to be read than should be necessary. Similarly, since an UPDATE operation is executed as the combination of a DELETE and INSERT, updating any older data in a columnstore index will result in it being reinserted at the tail end of the index alongside newer data.

Consider the example columnstore index in Figure 14-1, which is ordered by Order_Date.
Figure 14-1

A columnstore index that is ordered by the date dimension

This columnstore index is pristinely ordered by Order Year, with each set of rowgroups containing rows in ascending order by date. If a query is executed that calculates metrics for 2016, it can read the first ten rowgroups while ignoring the other 140.

Figure 14-2 shows the result of an UPDATE operation that alters data for 100,000 rows in 2016, 50,000 rows in 2017, and 25,000 rows in 2018.
Figure 14-2

Changes in rowgroup structure after an UPDATE operation

When rows were updated, they remained in the columnstore index in their previous locations, but were flagged as deleted. A new rowgroup was created that contains the new versions of the deleted rows. The result is that there is now an open rowgroup that contains data for 2016, 2017, and 2018. In addition, when new data is inserted into the table, it will also be added to the open rowgroup, resulting in another year’s worth of data being crammed into a single rowgroup.

Going forward, any query that requires data from 2016, 2017, 2018, or the current year will need to read this unordered rowgroup. If updates like this are common, then older rowgroups will quickly become logjammed with deleted rows, while newer rowgroups become unordered messes of data from many dates. The result will be wasted storage, wasted memory, and slower queries that need to read far more data than should be needed to return results.

Unordered inserts will have a similar impact as the INSERT portion of an UPDATE statement. Inserting data into the columnstore index in Figure 14-1 from 2015 would result in new rowgroups that contain both new and old data. Unordered inserts will, over a long period of time, result in the inability of SQL Server to take advantage of rowgroup elimination as more and more rowgroups contain data from all different time periods.

Delta rowgroups are a key part of columnstore index architecture and ensure that write operations can occur as quickly as possible. They slow down reads slightly, though, as rows reside in a b-tree structure and are not compressed with columnstore compression. The impact of the delta store on read performance is not significant, but administrators interested in maximizing columnstore read performance would be interested in compressing them as soon as possible after a data load process completes.

How Much Fragmentation Is Too Much?

Like with rowstore indexes, fragmentation does not require attention until it becomes significant enough to impact performance and waste a tangible amount of space. A columnstore index with 5 billion rows of which 1,000 are deleted should not be seen as problematic. Alternatively, if one billion of those rows were deleted, then the resulting fragmentation would be worth addressing.

Fragmentation should be broken out into two distinct calculations, of which each can be evaluated clinically and acted on appropriately. One is the percentage of a table that is comprised of deleted rows, and the other is a measure of unordered data within a columnstore index.

Quantifying Deleted Rows

Deleted rows are included in the view sys.column_store_row_groups and are relatively easy to report on. The query in Listing 14-1 returns a row per rowgroup within a single clustered columnstore index.
SELECT
       tables.name AS table_name,
       indexes.name AS index_name,
       partitions.partition_number,
       column_store_row_groups.row_group_id,
       column_store_row_groups.total_rows,
       column_store_row_groups.deleted_rows
FROM sys.column_store_row_groups
INNER JOIN sys.indexes
ON indexes.index_id = column_store_row_groups.index_id
AND indexes.object_id = column_store_row_groups.object_id
INNER JOIN sys.tables
ON tables.object_id = indexes.object_id
INNER JOIN sys.partitions
ON partitions.partition_number = column_store_row_groups.partition_number
AND partitions.index_id = indexes.index_id
AND partitions.object_id = tables.object_id
WHERE tables.name = 'Sale_CCI'
ORDER BY indexes.index_id, column_store_row_groups.row_group_id;
Listing 14-1

Query to Return Deleted Rows per Rowgroup

The results are shown in Figure 14-3.
Figure 14-3

Deleted row counts per rowgroup for a columnstore index

This detail shows how many rows are deleted per rowgroup. The results can be aggregated to show deleted rows per partition or for the entire table. If a table is partitioned, then knowing if the deleted rows exist only in one partition or all of them is useful for determining if all or only some partitions require attention.

In the example results outlined in Figure 14-3, about 5% of the columnstore index is comprised of deleted rows that are spread across the index relatively evenly. As a rough guideline, there is no urgent need for index maintenance to address this unless the percentage of deleted rows is at least 10% of the total rows in a partition or in the table.

Keep in mind that the performance impact of deleted rows is gradual over time. There will never be a scenario where a threshold is reached in which performance suddenly plummets. Therefore, automating index maintenance to occur when deleted rows exceed some set percentage is an effective way to avoid accidentally allowing an index to become absurdly fragmented .

Detailing Unordered Data

Understanding the effectiveness of ordered data is easy to see via querying segment metadata, but can be challenging to quantify. The query in Listing 14-2 shows how to retrieve the segment minimum and maximum data IDs for a given columnstore index and a single column.
SELECT
       tables.name AS table_name,
       indexes.name AS index_name,
       columns.name AS column_name,
       partitions.partition_number,
       column_store_segments.segment_id,
       column_store_segments.min_data_id,
       column_store_segments.max_data_id,
       column_store_segments.row_count
FROM sys.column_store_segments
INNER JOIN sys.partitions
ON column_store_segments.hobt_id = partitions.hobt_id
INNER JOIN sys.indexes
ON indexes.index_id = partitions.index_id
AND indexes.object_id = partitions.object_id
INNER JOIN sys.tables
ON tables.object_id = indexes.object_id
INNER JOIN sys.columns
ON tables.object_id = columns.object_id
AND column_store_segments.column_id = columns.column_id
WHERE tables.name = 'Sale_CCI'
AND columns.name = 'Invoice Date Key'
ORDER BY tables.name, columns.name, column_store_segments.segment_id;
Listing 14-2

Query to Retrieve Min/Max Data IDs for a Given Column in a Columnstore Index

The results of this query provide useful insight into the data order within this index, as shown in Figure 14-4.
Figure 14-4

Segment minimum/maximum value metadata for an unordered columnstore index

In the metadata, note that the values for min_data_id and max_data_id are the same for each rowgroup. This means that queries that filter on the column Invoice Date Key will be forced to scan the entire table to return results as any value could be found in any rowgroup.

Consider an ordered version of this table, where the results are as seen in Figure 14-5.
Figure 14-5

Segment minimum/maximum value metadata for an ordered columnstore index

This version of the table shows a progression of values for min_data_id and max_data_id that increases as the segment_id increases. Because each segment contains a distinct grouping of column values, this metadata can be effectively used to skip any rowgroups that contain values that are irrelevant to a query. The query in Listing 14-3 returns a complete list of all segments in a columnstore index that have any overlapping values. The inequalities are not inclusive as it is common that the start and end values in a rowgroup will overlap those in the next rowgroup.
WITH CTE_SEGMENTS AS (
       SELECT
              tables.name AS table_name,
              indexes.name AS index_name,
              columns.name AS column_name,
              partitions.partition_number,
              column_store_segments.segment_id,
              column_store_segments.min_data_id,
              column_store_segments.max_data_id,
              column_store_segments.row_count
       FROM sys.column_store_segments
       INNER JOIN sys.partitions
       ON column_store_segments.hobt_id = partitions.hobt_id
       INNER JOIN sys.indexes
       ON indexes.index_id = partitions.index_id
       AND indexes.object_id = partitions.object_id
       INNER JOIN sys.tables
       ON tables.object_id = indexes.object_id
       INNER JOIN sys.columns
       ON tables.object_id = columns.object_id
       AND column_store_segments.column_id = columns.column_id
       WHERE tables.name = 'Sale_CCI_ORDERED'
       AND columns.name = 'Invoice Date Key')
SELECT
       CTE_SEGMENTS.table_name,
       CTE_SEGMENTS.index_name,
       CTE_SEGMENTS.column_name,
       CTE_SEGMENTS.partition_number,
       CTE_SEGMENTS.segment_id,
       CTE_SEGMENTS.min_data_id,
       CTE_SEGMENTS.max_data_id,
       CTE_SEGMENTS.row_count,
       OVERLAPPING_SEGMENT.partition_number AS overlapping_partition_number,
       OVERLAPPING_SEGMENT.segment_id AS overlapping_segment_id,
       OVERLAPPING_SEGMENT.min_data_id AS overlapping_min_data_id,
       OVERLAPPING_SEGMENT.max_data_id AS overlapping_max_data_id
FROM CTE_SEGMENTS
INNER JOIN CTE_SEGMENTS OVERLAPPING_SEGMENT
ON (OVERLAPPING_SEGMENT.min_data_id > CTE_SEGMENTS.min_data_id
AND OVERLAPPING_SEGMENT.min_data_id < CTE_SEGMENTS.max_data_id)
OR (OVERLAPPING_SEGMENT.max_data_id > CTE_SEGMENTS.min_data_id
AND OVERLAPPING_SEGMENT.max_data_id < CTE_SEGMENTS.max_data_id)
OR (OVERLAPPING_SEGMENT.min_data_id < CTE_SEGMENTS.min_data_id
AND OVERLAPPING_SEGMENT.max_data_id > CTE_SEGMENTS.max_data_id)
ORDER BY CTE_SEGMENTS.partition_number, CTE_SEGMENTS.segment_id
Listing 14-3

Query to Return a List of Overlapping Segments

This query evaluates the boundaries for the minimum and maximum value of one column within each rowgroup and determines if any other rowgroups overlap those values.

The list returned by the query in Figure 14-6 may appear long at first glance, but it is important to note that any columnstore index that has been the target of UPDATE operations or unordered inserts will have entries here. Looking at the data returned, it can be seen that the out-of-order data is spread somewhat evenly across segments, with each segment containing 1–4 other segments that overlap at least one value with it.
Figure 14-6

List of overlapping values within rowgroups for the Invoice Date Key column

While there is no precise way to measure the percentage of unordered data in the same way that it was possible to measure the percentage of rows in a columnstore index that are deleted, it is possible to gauge how effectively data order impacts query performance by performing metadata tests using COUNT(*) queries against the columnstore index. This could be done for every date in the table, which would result in a very thorough experiment. For the sake of a simple demonstration, eight sample dates will be chosen at random to test, as given by the query in Listing 14-4.
SELECT
        [Invoice Date Key],
       COUNT(*) AS Sale_Count
FROM Fact.Sale_CCI_ORDERED
WHERE [Invoice Date Key] IN ('5/1/2013', '9/5/2013', '1/17/2014', '6/30/2014', '3/14/2015', '12/12/2015', '1/1/2016', '2/29/2016')
GROUP BY [Invoice Date Key]
ORDER BY [Invoice Date Key]
Listing 14-4

Sample Dates to Test How Ordered Data Is in a Columnstore Index

The results in Figure 14-7 provide row counts for each data.
Figure 14-7

List of sample dates for use in testing the effectiveness of columnstore data order

Each data point chosen contains at most 0.1% of the data in the table as there are 25,109,150 total rows, of which these encompass eight of its dates. Based on the row counts and table size, an ideal ordered table would only require reading 1–2 rowgroups to retrieve data for any of those given dates. The query in Listing 14-5 executes separate COUNT(*) queries for each date identified earlier.
SET NOCOUNT ON;
SELECT COUNT(*) FROM Fact.Sale_CCI_ORDERED WHERE [Invoice Date Key] = '5/1/2013';
SELECT COUNT(*) FROM Fact.Sale_CCI_ORDERED WHERE [Invoice Date Key] = '9/5/2013';
SELECT COUNT(*) FROM Fact.Sale_CCI_ORDERED WHERE [Invoice Date Key] = '1/17/2014';
SELECT COUNT(*) FROM Fact.Sale_CCI_ORDERED WHERE [Invoice Date Key] = '6/30/2014';
SELECT COUNT(*) FROM Fact.Sale_CCI_ORDERED WHERE [Invoice Date Key] = '3/14/2015';
SELECT COUNT(*) FROM Fact.Sale_CCI_ORDERED WHERE [Invoice Date Key] = '12/12/2015';
SELECT COUNT(*) FROM Fact.Sale_CCI_ORDERED WHERE [Invoice Date Key] = '1/1/2016';
SELECT COUNT(*) FROM Fact.Sale_CCI_ORDERED WHERE [Invoice Date Key] = '2/29/2016';
Listing 14-5

COUNT Queries to Test Rowgroup Reads

Figure 14-8 shows the STATISTICS IO output for each preceding query.
Figure 14-8

STATISTICS IO output for COUNT queries in Listing 14-5

For each of the eight COUNT(*) sample queries, 3–4 rowgroups are read in order to retrieve the count. Based on the knowledge that these queries should read 1–2 rowgroups, it can be deduced that queries are generally reading two to three times more rowgroups than is necessary. Depending on how often updates and unordered inserts occur on the table, this may be acceptable or it may be unusual. Knowledge of the table’s usage helps in understanding how extreme these numbers are and if reading three rowgroups instead of one is normal or worthy of attention.

Testing rowgroups read and skipped using STATISTICS IO is an effective way to gauge how ordered a columnstore index is. If there is a desire to be complete about this test and use count queries against many (or all) dates in the table, consider treating them as maintenance and performing that research at a predetermined time when running a lengthy maintenance script is acceptable.

The No-Maintenance Columnstore Index

An ideal columnstore index rarely requires maintenance. This is possible when the following are true:
  • Rows are never deleted.

  • Rows are never updated.

  • Rows are always inserted in order.

If those three conditions can be met, then a columnstore index can be spared of nearly all maintenance. Whether any is performed is up to the whim of its administrator. Realistically, the only suboptimal scenario that can arise when the columnstore index is not the target of deletes, updates, or unordered inserts is that rowgroups may be undersized due to the use of the delta store to process small INSERT operations.

The impact of undersized rowgroups resulting from delta store usage is minor and should not be seen as an urgent problem in need of an immediate resolution. In these scenarios, waiting for infrequent maintenance periods to perform columnstore index maintenance would be more than effective enough to ensure that undersized rowgroups are merged effectively. Quarterly or yearly is likely often enough.

Now that the causes of fragmentation have been thoroughly detailed, using index maintenance operations to resolve fragmentation can be discussed.

Columnstore Reorganize

The fastest and simplest operation available for a columnstore index is the REORGANIZE. This is an online operation that accomplishes the following tasks:
  • Combines undersized rowgroups via a columnstore index merge. This occurs when multiple rowgroups can combine to fit into a single new rowgroup (less than 1,024,576 or 220 rows).

  • May remove deleted rows via a self-merge. This only happens when more than 102,400 rows in a rowgroup are deleted.

If both of these tasks apply to a set of rowgroups, then a merge operation will be prioritized over a self-merge. Rowgroups that were trimmed due to dictionary pressure cannot be combined with other rowgroups, regardless of their row counts.

To demonstrate the merge and self-merge operations that can be used by a columnstore REORGANIZE operation, a large set of rows will be deleted from a columnstore index, as seen in Listing 14-6
DELETE
FROM Fact.Sale_CCI_ORDERED
WHERE [Invoice Date Key] <= '1/17/2013';
Listing 14-6

Query to Delete a Large Portion of a Rowgroup

Figure 14-9 shows the resulting set of deleted rows within the columnstore index’s rowgroup metadata using the same query as provided in Listing 14-1.
Figure 14-9

Deleted rows per rowgroup for a columnstore index

The second column to the right shows total rows in the rowgroup, whereas the rightmost column provides the deleted row count. Of the rowgroups with deleted rows, only rowgroups 23 and 24 have more than 102,400 rows deleted and would qualify for a self-merge operation. These rowgroups are also valid targets for a columnstore merge operation as they can be combined, with the resulting rowgroup containing less than the row cap (220 rows) for a columnstore rowgroup.

The syntax for a columnstore REORGANIZE operation is shown in Listing 14-7.
ALTER INDEX CCI_fact_Sale_CCI_ORDERED ON Fact.Sale_CCI_ORDERED REORGANIZE;
Listing 14-7

Syntax for a Columnstore Reorganize Operation

After the REORGANIZE operation is complete, the rowgroup metadata can be reviewed again, as shown in Figure 14-10.
Figure 14-10

Rowgroup metadata following an index REORGANIZE operation

Note that rowgroups 23 and 24 are now flagged as TOMBSTONE and will be cleaned up by the tuple mover at some point in the near future. Two new rowgroups were created (25 and 26) that replace them, with the deleted rows removed. The self-merge operation essentially creates new rowgroups, copies all nondeleted rows into them, and swaps them in as the active rowgroups while the previous versions are flagged for cleanup. The resulting rowgroups are free of the burden of deleted rows. Remember that the self-merge only occurs when more than 102,400 rows in a rowgroup are deleted.

After a minute passes, the rowgroup metadata confirms that the rowgroups labeled as TOMBSTONE are now removed from the columnstore index, as seen in Figure 14-11. This is an automatic cleanup process that requires no operator intervention to trigger.
Figure 14-11

Rowgroup metadata after the tuple mover removes TOMBSTONE rowgroups

Rowgroups can only be combined if the reason they are undersized is not related to dictionary pressure. Figure 14-12 shows additional metadata from sys.dm_db_column_store_row_group_physical_stats for this columnstore index.
Figure 14-12

Rowgroup metadata for a columnstore index, including rowgroup trim reason

Because rowgroup 25 was trimmed due to dictionary pressure, it cannot be combined with rowgroup 26, even though their combined row counts could fit them into a single rowgroup.

Reorganize to Remove Delta Rowgroups

An additional option is available when issuing a columnstore index REORGANIZE command: the COMPRESS_ALL_ROW_GROUPS option. When used, SQL Server will initialize the tuple mover to process the contents of the delta store and move them into compressed rowgroups. This effectively empties the delta store, moving all data into compressed rowgroups.

The upside of this operation is that it ensures faster read operations, as there is no need to read the rowstore structure of the delta store when processing queries. The downside is that it is an additional maintenance option that requires time and resources to execute. Consider the INSERT operation shown in Listing 14-8.
INSERT INTO Fact.Sale_CCI_ORDERED
    ([Sale Key], [City Key], [Customer Key], [Bill To Customer Key], [Stock Item Key], [Invoice Date Key], [Delivery Date Key],
     [Salesperson Key], [WWI Invoice ID], Description, Package, Quantity, [Unit Price], [Tax Rate], [Total Excluding Tax], [Tax Amount],
     Profit, [Total Including Tax], [Total Dry Items], [Total Chiller Items], [Lineage Key])
VALUES
(   6769, 69490, 0, 0, 26, '2013-02-10', '2013-02-11', 36, 2081, 'Coffee Mug', 'Each', 17, 12.42, 8.00, 211.14, 16.89, 75.00, 228.03,
       17, 0, 11);
Listing 14-8

Small INSERT Operation into a Columnstore Index

One row is inserted into the columnstore index. Rowgroup metadata can confirm the new row that resides in an open delta rowgroup, as seen in Figure 14-13, using the same query as in Listing 14-1.
Figure 14-13

Rowgroup metadata for a newly inserted row into a delta rowgroup

With a single row in the delta store, a REORGANIZE operation will be run using the COMPRESS_ALL_ROW_GROUPS option, as seen in Listing 14-9.
ALTER INDEX CCI_fact_Sale_CCI_ORDERED ON Fact.Sale_CCI_ORDERED REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON);
Listing 14-9

REORGANIZE Using the COMPRESS_ALL_ROW_GROUPS option

The resulting columnstore metadata after the index maintenance is shown in Figure 14-14.
Figure 14-14

Rowgroup metadata immediately after index maintenance

The metadata shows a single row in a compressed rowgroup, and the old delta store set to TOMBSTONE, awaiting cleanup. Figure 14-15 shows the rowgroup metadata after the TOMBSTONE rowgroup is cleaned up.
Figure 14-15

Rowgroup metadata after garbage collection

A compressed rowgroup of a single row is silly, so another REORGANIZE will be executed against the columnstore index. The resulting metadata can be seen in Figure 14-16.
Figure 14-16

Rowgroup metadata after undersized rowgroups are combined

Rowgroup 25 remains undersized due to dictionary pressure, but rowgroups 26 and 28 have been combined into the newly formed rowgroup 29. The trim reason description allows an operator to understand that this rowgroup was created and populated via an index REORGANIZE operation.

Index REORGANIZE operations are an excellent way to combine undersized rowgroups, assuming they do not suffer from dictionary size limitations. They can also remove deleted rows, assuming the number of deleted rows is greater than 102,400. Lastly, with the COMPRESS_ALL_ROW_GROUPS option, an index REORGANIZE can process the delta store contents into compressed rowgroups, which can later be combined via further REORGANIZE operations, if desired.

For a columnstore index that is not the target of UPDATE operations and does not experience out-of-order inserts, using index REORGANIZE operations will handle most fragmentation effectively and preclude the need for other regular index maintenance tasks. The simplest application of index maintenance is to run an index REORGANIZE operation after each data load and include the COMPRESS_ALL_ROW_GROUPS option. This will ensure that rowgroups remain relatively full and that delta stores are processed and not slowing down columnstore queries. For a stickler that wants perfection, a second REORGANIZE operation can be executed to further combine rowgroups resulting from processed delta stores. These are online, fast, and effective ways to manage fragmentation on an ongoing basis as a columnstore index grows via regular data load processes.

Columnstore Rebuild

Rebuilding a columnstore index functions similarly to rebuilding a rowstore index. When a REBUILD is issued, a completely new copy of the columnstore index is created, replacing the old index. While an expensive process, the results are
  • Rowgroups that are filled to capacity, whenever possible.

  • All deleted rows are eliminated.

  • All delta stores are processed and compressed.

  • Vertipaq optimization is applied, even if it previously was not.

In essence, a REBUILD is roughly equivalent to creating a columnstore index anew using a CREATE INDEX statement. The only difference is that the existing index structure will remain intact until the new index is built. For ONLINE rebuilds, this ensures that queries can continue to use the columnstore index, even as a rebuild operation is running.

Rebuilding a clustered columnstore index can only be accomplished as an online operation starting in SQL Server 2019. Nonclustered indexes can be rebuilt online regardless of SQL Server version.

Consider the columnstore index that has been tested recently in this chapter. The T-SQL in Listing 14-10 issues a REBUILD against that index.
ALTER INDEX CCI_fact_Sale_CCI_ORDERED ON Fact.Sale_CCI_ORDERED REBUILD;
Listing 14-10

REBUILD Statement Against a Clustered Columnstore Index

A REBUILD operation takes significantly longer than a REORGANIZE as the entire index must be re-created in its entirety. Alternatively, a REORGANIZE follows specific rules to determine what work to do and will usually only operate on a small portion of a columnstore index. Figure 14-17 shows a columnstore index after a REBUILD operation completes.
Figure 14-17

Rowgroup metadata after a columnstore index REBUILD

After a REBUILD, the columnstore index has no deleted rows, and rowgroups are mostly full. Oddly enough, the last two rowgroups (23 and 24) are undersized residuals and can be cleaned up via an index REORGANIZE operation. The metadata in Figure 14-18 shows the final results after the index is subject to an additional REORGANIZE after the REBUILD.
Figure 14-18

Rowgroup metadata after a columnstore index REBUILD and REORGANIZE

Finally, the index is pristine, with 23 rowgroups that are completely full and 1 additional rowgroup that is leftover from the index maintenance operations.

Index REBUILD operations should be used infrequently to manage one of a few scenarios:
  • Undersized rowgroups accumulated over a long period of time that cannot be combined via REORGANIZE operations.

  • Extensive deleted rows that number less than 102,400 rows per rowgroup and cannot be addressed by REORGANIZE operations.

  • Large amounts of out-of-order data exist in the columnstore index. This must be addressed in conjunction with work to reorder data, as shown in the next section.

Note that when an index REBUILD is issued, the compression type may be changed to or from columnstore archive compression, if needed. Because an index REBUILD is an expensive operation and is offline prior to SQL Server 2019, care should be taken to execute rebuilds during maintenance windows when such work is more tolerable to processes that use this data.

Sometimes, data can become heavily fragmented via software releases that make significant changes to the underlying schema and data. If this is anticipated, then scheduling an index REBUILD as a postprocessing step after the software release would be an excellent way to ensure that data continues to be efficiently delivered to analytic processes, even after a disruptive software release.

REBUILD operations may target a specific partition, thus allowing only data that is heavily fragmented to be rebuilt. For a large columnstore index in which only a small portion is actively written to, this is an excellent way to speed up rebuilds and minimize disruptions to the users of analytic data.

Columnstore Reorder and Rebuild

Clustered columnstore indexes do not enforce data order. It is up to the data architect that designs these tables to determine how to order data and to ensure that order is enforced via any processes that write to the table. The only way to effectively resolve the challenge of unordered columnstore data (without building new tables and structures) is to perform the following tasks:
  1. 1.

    Drop the columnstore index.

     
  2. 2.

    Create a clustered rowstore index with key columns that match the data order for the table.

     
  3. 3.

    Create a new clustered columnstore index with the DROP_EXISTING=ON option.

     

The clustered rowstore index is used to enforce a new data order on the contents of the table, whereas the new columnstore index replaces it, retaining the new data order.

This is an expensive and disruptive process that will result in analytic queries being unable to take advantage of the columnstore index from the point when the index is dropped until the new columnstore index is completely built. Therefore, it is a worthwhile process to implement during a maintenance window when causing trouble is more acceptable.

The script in Listing 14-11 performs these operations to reorder data from its current state to data that is ordered perfectly by Invoice Date Key.
/*     Step 1 */
DROP INDEX CCI_fact_Sale_CCI_ORDERED ON Fact.Sale_CCI_ORDERED;
/*     Step 2 */
CREATE CLUSTERED INDEX CCI_fact_Sale_CCI_ORDERED ON Fact.Sale_CCI_ORDERED ([Invoice Date Key]);
/*     Step 3 */
CREATE CLUSTERED COLUMNSTORE INDEX CCI_fact_Sale_CCI_ORDERED ON Fact.Sale_CCI_ORDERED WITH (DROP_EXISTING=ON, MAXDOP=1);
Listing 14-11

Process to Fix Poorly Ordered Data in a Clustered Columnstore Index

Note that dropping a clustered columnstore index is not a speedy operation. It requires converting the columnstore structure into a b-tree/heap structure, which requires more computing resources than dropping a clustered rowstore index.

The process to fix data order on a columnstore index should be reserved for scenarios where fragmentation is high enough to negatively impact analytic processes and should be an infrequent operation.

If data becomes unordered quickly in a columnstore index, then consider ways to alter how data load processes operate to reduce fragmentation via those data loads. The following are a few helpful guidelines for doing so:
  • Remove all UPDATE operations on columnstore indexes. Perform updates on staging tables instead (when possible).

  • Avoid inserting data out of order. Use temporary or staging tables to ensure that data is ordered correctly prior to inserting new data.

Despite its complexity, the process to fix unordered data can be executed on a partition-by-partition basis, thus allowing only active partitions containing warm/hot data to be targeted with expensive index maintenance operations. For a columnstore index with many partitions, this can save considerable time and reduce disruption and downtime to analytic processes.

Columnstore Index Maintenance by Partition

The syntax to rebuild a specific partition of a columnstore index is shown in Listing 14-12.
ALTER INDEX CCI_fact_Sale_CCI_PARTITIONED ON Fact.Sale_CCI_PARTITIONED REBUILD PARTITION = 5;
Listing 14-12

Query to Rebuild a Single Partition of a Columnstore Index on a Partitioned Table

In this example, if partition number 5 is identified as the only partition containing data that is being loaded/modified regularly, then rebuilding only it will save significant maintenance time as the other partitions can be skipped.

Index Maintenance in Nonclustered Columnstore Indexes

Nonclustered columnstore indexes are afforded less flexibility with regard to index maintenance. Data order is enforced by the clustered rowstore index, and therefore the quality of data order within a columnstore index is prescribed by its corresponding clustered rowstore index. If the table’s clustered rowstore index happens to be the ordering column(s) for the columnstore index and that table is not subject to UPDATE operations, then the nonclustered columnstore index will maintain data order effectively.

Otherwise, index maintenance for nonclustered columnstore indexes will operate similarly to maintenance on clustered columnstore indexes:
  • Use REORGANIZE operations to combine undersized rowgroups, process the delta store, and eliminate larger groups of deleted rows.

  • Use REBUILD operations to eliminate undersized rowgroups and eliminate unordered data due to UPDATE operations.

The primary difference here is that a REBUILD can eliminate out-of-order data resulting from UPDATE operations. When rebuilt, a nonclustered columnstore index will be created using the order prescribed by the clustered rowstore index, which is not negatively impacted by UPDATE operations in the same way as a columnstore index.

Both REORGANIZE and REBUILD operations are available as online operations for nonclustered indexes, providing more flexibility when trying to schedule recurring (or one-time) maintenance. This means that real-time operational analytics that target a nonclustered columnstore index can continue efficiently, even as that index is being rebuilt.

Like with clustered columnstore indexes, maintenance on nonclustered columnstore indexes can be issued against any or all partitions, allowing active data to be targeted with maintenance while unchanging data can be skipped.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.182.45