Depending on its usage, a columnstore index may require no maintenance at all, infrequent maintenance, or regular maintenance to ensure optimal storage, resource consumption, and performance.
Understanding how and when to use index maintenance will ensure that analytic queries continue to perform well both when a columnstore index is first created and years into the future.
What Causes Fragmentation?
- 1.
Wasted space due to deleted rows
- 2.
Data that is inserted out of order
- 3.
Multiple delta rowgroups present in the index
When data is deleted in a columnstore index, no rows are actually removed. The cost to decompress, delete rows, and recompress rowgroups is prohibitively expensive at runtime. Instead, SQL Server flags the rows as deleted using the delete bitmap , and when queries access that data, they use the bitmap to identify any rows that are deleted and need to be skipped.
Space consumed by deleted data.
More data needs to be read than is needed at runtime.
Data is spread over more rowgroups than necessary.
The other cause of fragmentation is out-of-order data. A columnstore index should always be built and maintained with data that is ordered by a key column (or columns) that are used most frequently in filtering and aggregating queries.
Chapter 10 describes the importance of data order in columnstore indexes and how maintaining a data order that lines up with common query filters and aggregation can allow for exceptionally fast query performance.
Data that is inserted regularly in the desired columnstore index order will facilitate rowgroup elimination, thus allowing vast amounts of the index to be automatically skipped when filtered using the same column(s). When rows are inserted out of order, this forces more data to be read than should be necessary. Similarly, since an UPDATE operation is executed as the combination of a DELETE and INSERT, updating any older data in a columnstore index will result in it being reinserted at the tail end of the index alongside newer data.
This columnstore index is pristinely ordered by Order Year, with each set of rowgroups containing rows in ascending order by date. If a query is executed that calculates metrics for 2016, it can read the first ten rowgroups while ignoring the other 140.
When rows were updated, they remained in the columnstore index in their previous locations, but were flagged as deleted. A new rowgroup was created that contains the new versions of the deleted rows. The result is that there is now an open rowgroup that contains data for 2016, 2017, and 2018. In addition, when new data is inserted into the table, it will also be added to the open rowgroup, resulting in another year’s worth of data being crammed into a single rowgroup.
Going forward, any query that requires data from 2016, 2017, 2018, or the current year will need to read this unordered rowgroup. If updates like this are common, then older rowgroups will quickly become logjammed with deleted rows, while newer rowgroups become unordered messes of data from many dates. The result will be wasted storage, wasted memory, and slower queries that need to read far more data than should be needed to return results.
Unordered inserts will have a similar impact as the INSERT portion of an UPDATE statement. Inserting data into the columnstore index in Figure 14-1 from 2015 would result in new rowgroups that contain both new and old data. Unordered inserts will, over a long period of time, result in the inability of SQL Server to take advantage of rowgroup elimination as more and more rowgroups contain data from all different time periods.
Delta rowgroups are a key part of columnstore index architecture and ensure that write operations can occur as quickly as possible. They slow down reads slightly, though, as rows reside in a b-tree structure and are not compressed with columnstore compression. The impact of the delta store on read performance is not significant, but administrators interested in maximizing columnstore read performance would be interested in compressing them as soon as possible after a data load process completes.
How Much Fragmentation Is Too Much?
Like with rowstore indexes, fragmentation does not require attention until it becomes significant enough to impact performance and waste a tangible amount of space. A columnstore index with 5 billion rows of which 1,000 are deleted should not be seen as problematic. Alternatively, if one billion of those rows were deleted, then the resulting fragmentation would be worth addressing.
Fragmentation should be broken out into two distinct calculations, of which each can be evaluated clinically and acted on appropriately. One is the percentage of a table that is comprised of deleted rows, and the other is a measure of unordered data within a columnstore index.
Quantifying Deleted Rows
Query to Return Deleted Rows per Rowgroup
This detail shows how many rows are deleted per rowgroup. The results can be aggregated to show deleted rows per partition or for the entire table. If a table is partitioned, then knowing if the deleted rows exist only in one partition or all of them is useful for determining if all or only some partitions require attention.
In the example results outlined in Figure 14-3, about 5% of the columnstore index is comprised of deleted rows that are spread across the index relatively evenly. As a rough guideline, there is no urgent need for index maintenance to address this unless the percentage of deleted rows is at least 10% of the total rows in a partition or in the table.
Keep in mind that the performance impact of deleted rows is gradual over time. There will never be a scenario where a threshold is reached in which performance suddenly plummets. Therefore, automating index maintenance to occur when deleted rows exceed some set percentage is an effective way to avoid accidentally allowing an index to become absurdly fragmented .
Detailing Unordered Data
Query to Retrieve Min/Max Data IDs for a Given Column in a Columnstore Index
In the metadata, note that the values for min_data_id and max_data_id are the same for each rowgroup. This means that queries that filter on the column Invoice Date Key will be forced to scan the entire table to return results as any value could be found in any rowgroup.
Query to Return a List of Overlapping Segments
This query evaluates the boundaries for the minimum and maximum value of one column within each rowgroup and determines if any other rowgroups overlap those values.
Sample Dates to Test How Ordered Data Is in a Columnstore Index
COUNT Queries to Test Rowgroup Reads
For each of the eight COUNT(*) sample queries, 3–4 rowgroups are read in order to retrieve the count. Based on the knowledge that these queries should read 1–2 rowgroups, it can be deduced that queries are generally reading two to three times more rowgroups than is necessary. Depending on how often updates and unordered inserts occur on the table, this may be acceptable or it may be unusual. Knowledge of the table’s usage helps in understanding how extreme these numbers are and if reading three rowgroups instead of one is normal or worthy of attention.
Testing rowgroups read and skipped using STATISTICS IO is an effective way to gauge how ordered a columnstore index is. If there is a desire to be complete about this test and use count queries against many (or all) dates in the table, consider treating them as maintenance and performing that research at a predetermined time when running a lengthy maintenance script is acceptable.
The No-Maintenance Columnstore Index
Rows are never deleted.
Rows are never updated.
Rows are always inserted in order.
If those three conditions can be met, then a columnstore index can be spared of nearly all maintenance. Whether any is performed is up to the whim of its administrator. Realistically, the only suboptimal scenario that can arise when the columnstore index is not the target of deletes, updates, or unordered inserts is that rowgroups may be undersized due to the use of the delta store to process small INSERT operations.
The impact of undersized rowgroups resulting from delta store usage is minor and should not be seen as an urgent problem in need of an immediate resolution. In these scenarios, waiting for infrequent maintenance periods to perform columnstore index maintenance would be more than effective enough to ensure that undersized rowgroups are merged effectively. Quarterly or yearly is likely often enough.
Now that the causes of fragmentation have been thoroughly detailed, using index maintenance operations to resolve fragmentation can be discussed.
Columnstore Reorganize
Combines undersized rowgroups via a columnstore index merge. This occurs when multiple rowgroups can combine to fit into a single new rowgroup (less than 1,024,576 or 220 rows).
May remove deleted rows via a self-merge. This only happens when more than 102,400 rows in a rowgroup are deleted.
If both of these tasks apply to a set of rowgroups, then a merge operation will be prioritized over a self-merge. Rowgroups that were trimmed due to dictionary pressure cannot be combined with other rowgroups, regardless of their row counts.
Query to Delete a Large Portion of a Rowgroup
The second column to the right shows total rows in the rowgroup, whereas the rightmost column provides the deleted row count. Of the rowgroups with deleted rows, only rowgroups 23 and 24 have more than 102,400 rows deleted and would qualify for a self-merge operation. These rowgroups are also valid targets for a columnstore merge operation as they can be combined, with the resulting rowgroup containing less than the row cap (220 rows) for a columnstore rowgroup.
Syntax for a Columnstore Reorganize Operation
Note that rowgroups 23 and 24 are now flagged as TOMBSTONE and will be cleaned up by the tuple mover at some point in the near future. Two new rowgroups were created (25 and 26) that replace them, with the deleted rows removed. The self-merge operation essentially creates new rowgroups, copies all nondeleted rows into them, and swaps them in as the active rowgroups while the previous versions are flagged for cleanup. The resulting rowgroups are free of the burden of deleted rows. Remember that the self-merge only occurs when more than 102,400 rows in a rowgroup are deleted.
Because rowgroup 25 was trimmed due to dictionary pressure, it cannot be combined with rowgroup 26, even though their combined row counts could fit them into a single rowgroup.
Reorganize to Remove Delta Rowgroups
An additional option is available when issuing a columnstore index REORGANIZE command: the COMPRESS_ALL_ROW_GROUPS option. When used, SQL Server will initialize the tuple mover to process the contents of the delta store and move them into compressed rowgroups. This effectively empties the delta store, moving all data into compressed rowgroups.
Small INSERT Operation into a Columnstore Index
REORGANIZE Using the COMPRESS_ALL_ROW_GROUPS option
Rowgroup 25 remains undersized due to dictionary pressure, but rowgroups 26 and 28 have been combined into the newly formed rowgroup 29. The trim reason description allows an operator to understand that this rowgroup was created and populated via an index REORGANIZE operation.
Index REORGANIZE operations are an excellent way to combine undersized rowgroups, assuming they do not suffer from dictionary size limitations. They can also remove deleted rows, assuming the number of deleted rows is greater than 102,400. Lastly, with the COMPRESS_ALL_ROW_GROUPS option, an index REORGANIZE can process the delta store contents into compressed rowgroups, which can later be combined via further REORGANIZE operations, if desired.
For a columnstore index that is not the target of UPDATE operations and does not experience out-of-order inserts, using index REORGANIZE operations will handle most fragmentation effectively and preclude the need for other regular index maintenance tasks. The simplest application of index maintenance is to run an index REORGANIZE operation after each data load and include the COMPRESS_ALL_ROW_GROUPS option. This will ensure that rowgroups remain relatively full and that delta stores are processed and not slowing down columnstore queries. For a stickler that wants perfection, a second REORGANIZE operation can be executed to further combine rowgroups resulting from processed delta stores. These are online, fast, and effective ways to manage fragmentation on an ongoing basis as a columnstore index grows via regular data load processes.
Columnstore Rebuild
Rowgroups that are filled to capacity, whenever possible.
All deleted rows are eliminated.
All delta stores are processed and compressed.
Vertipaq optimization is applied, even if it previously was not.
In essence, a REBUILD is roughly equivalent to creating a columnstore index anew using a CREATE INDEX statement. The only difference is that the existing index structure will remain intact until the new index is built. For ONLINE rebuilds, this ensures that queries can continue to use the columnstore index, even as a rebuild operation is running.
Rebuilding a clustered columnstore index can only be accomplished as an online operation starting in SQL Server 2019. Nonclustered indexes can be rebuilt online regardless of SQL Server version.
REBUILD Statement Against a Clustered Columnstore Index
Finally, the index is pristine, with 23 rowgroups that are completely full and 1 additional rowgroup that is leftover from the index maintenance operations.
Undersized rowgroups accumulated over a long period of time that cannot be combined via REORGANIZE operations.
Extensive deleted rows that number less than 102,400 rows per rowgroup and cannot be addressed by REORGANIZE operations.
Large amounts of out-of-order data exist in the columnstore index. This must be addressed in conjunction with work to reorder data, as shown in the next section.
Note that when an index REBUILD is issued, the compression type may be changed to or from columnstore archive compression, if needed. Because an index REBUILD is an expensive operation and is offline prior to SQL Server 2019, care should be taken to execute rebuilds during maintenance windows when such work is more tolerable to processes that use this data.
Sometimes, data can become heavily fragmented via software releases that make significant changes to the underlying schema and data. If this is anticipated, then scheduling an index REBUILD as a postprocessing step after the software release would be an excellent way to ensure that data continues to be efficiently delivered to analytic processes, even after a disruptive software release.
REBUILD operations may target a specific partition, thus allowing only data that is heavily fragmented to be rebuilt. For a large columnstore index in which only a small portion is actively written to, this is an excellent way to speed up rebuilds and minimize disruptions to the users of analytic data.
Columnstore Reorder and Rebuild
- 1.
Drop the columnstore index.
- 2.
Create a clustered rowstore index with key columns that match the data order for the table.
- 3.
Create a new clustered columnstore index with the DROP_EXISTING=ON option.
The clustered rowstore index is used to enforce a new data order on the contents of the table, whereas the new columnstore index replaces it, retaining the new data order.
This is an expensive and disruptive process that will result in analytic queries being unable to take advantage of the columnstore index from the point when the index is dropped until the new columnstore index is completely built. Therefore, it is a worthwhile process to implement during a maintenance window when causing trouble is more acceptable.
Process to Fix Poorly Ordered Data in a Clustered Columnstore Index
Note that dropping a clustered columnstore index is not a speedy operation. It requires converting the columnstore structure into a b-tree/heap structure, which requires more computing resources than dropping a clustered rowstore index.
The process to fix data order on a columnstore index should be reserved for scenarios where fragmentation is high enough to negatively impact analytic processes and should be an infrequent operation.
Remove all UPDATE operations on columnstore indexes. Perform updates on staging tables instead (when possible).
Avoid inserting data out of order. Use temporary or staging tables to ensure that data is ordered correctly prior to inserting new data.
Despite its complexity, the process to fix unordered data can be executed on a partition-by-partition basis, thus allowing only active partitions containing warm/hot data to be targeted with expensive index maintenance operations. For a columnstore index with many partitions, this can save considerable time and reduce disruption and downtime to analytic processes.
Columnstore Index Maintenance by Partition
Query to Rebuild a Single Partition of a Columnstore Index on a Partitioned Table
In this example, if partition number 5 is identified as the only partition containing data that is being loaded/modified regularly, then rebuilding only it will save significant maintenance time as the other partitions can be skipped.
Index Maintenance in Nonclustered Columnstore Indexes
Nonclustered columnstore indexes are afforded less flexibility with regard to index maintenance. Data order is enforced by the clustered rowstore index, and therefore the quality of data order within a columnstore index is prescribed by its corresponding clustered rowstore index. If the table’s clustered rowstore index happens to be the ordering column(s) for the columnstore index and that table is not subject to UPDATE operations, then the nonclustered columnstore index will maintain data order effectively.
Use REORGANIZE operations to combine undersized rowgroups, process the delta store, and eliminate larger groups of deleted rows.
Use REBUILD operations to eliminate undersized rowgroups and eliminate unordered data due to UPDATE operations.
The primary difference here is that a REBUILD can eliminate out-of-order data resulting from UPDATE operations. When rebuilt, a nonclustered columnstore index will be created using the order prescribed by the clustered rowstore index, which is not negatively impacted by UPDATE operations in the same way as a columnstore index.
Both REORGANIZE and REBUILD operations are available as online operations for nonclustered indexes, providing more flexibility when trying to schedule recurring (or one-time) maintenance. This means that real-time operational analytics that target a nonclustered columnstore index can continue efficiently, even as that index is being rebuilt.
Like with clustered columnstore indexes, maintenance on nonclustered columnstore indexes can be issued against any or all partitions, allowing active data to be targeted with maintenance while unchanging data can be skipped.