A key component to how columnstore indexes function is that every segment within the index is its own building block. Each may be read individually or in groups, but the number of segments read via any query can be reduced by efficient architecture and optimal query patterns. Reducing segments read directly reduces IO, improves query speed and memory-related performance metrics, such as page life expectancy.
Segment Elimination
Each segment represents data for a single column over a set of rows. When a query is executed against a columnstore index, SQL Server needs to determine which rowgroups are required to return the requested result set. Within those rowgroups, only the segments containing data for the selected columns will be read. Therefore, queries that select less columns will read fewer segments, thereby reducing IO, memory usage, and query runtimes.
By querying for only two of the possible eight columns, the number of segments read was reduced by 75%. This convention holds true for columnstore indexes with any number of columns. If this table had 24 columns, then reading only 2 of them would mean that 22 columns would not be read. The result of this feature is that the number of segments read in a columnstore index query will be proportional to the number of columns involved in the query.
Sample Columnstore Query Using a Filter and Aggregate
This analytic query calculates a count of sales per city for a given time period. Of the 21 columns in Sale_CCI, only 2 were required to complete this calculation: City Key and Invoice Date Key. As a result, 19 of the 21 column’s worth of segments will be ignored when executing this query.
Alternatively, rowstore tables store rows sequentially with each column for each row stored together on pages. While reading less columns in rowstore indexes can reduce the amount of data presented to an application, it does not reduce the quantity of pages read into memory in order to retrieve the specific columns for a query. Therefore, reducing the number of columns queried in a columnstore index will provide immediate performance benefits that are not as pronounced in rowstore indexes.
Segment elimination is a simple and powerful tool that can be simplified into a single optimization technique: Write queries to only include the required columns. Since rowgroups can contain up to 220 rows, the cost of querying unnecessary columns can be quite high. Whereas transactional queries against rowstore tables often operate on small numbers of rows, analytic queries can access millions of rows at a time. Therefore, the perceived convenience of SELECT * queries will hinder performance on a columnstore index.
A SELECT * Query Against a Columnstore Index
A Query Against a Columnstore Index Requesting Only Three Columns
When the column list was reduced to only those needed for the query, LOB logical reads were reduced from 68,030 to 31,689. The performance improvement for omitting columns will vary based on the data type, compression, and contents of each column. Omitting a text column with no repeated values will reduce IO far more than omitting a BIT column.
Segment elimination is an easy way to improve query speeds while also reducing resource consumption. Anyone writing queries against columnstore indexes should consider which columns are required for their workloads and ensure that no extra columns are returned.
Rowgroup Elimination
Unlike rowstore indexes, columnstore indexes have no built-in order. The data that is inserted into a columnstore index is added in the order it is received by SQL Server. As a result, the order of data within rowgroups is the direct result of the order it was inserted. Equally important is that UPDATE operations will reorder a columnstore index, removing rows from one set of rowgroups and inserting the new versions into the most current open rowgroups.
Because compressing rowgroups is a computationally expensive process, the cost to intrinsically maintain any form of data order would be prohibitively high. This is one of the most important concepts when architecting a columnstore index. Since data order is not enforced by SQL Server, it is the responsibility of the architect to determine data order up front and ensure that both data load processes and common query patterns maintain that agreed-upon data order.
Query to Add Three New Employees to the Dimension.Employee Table
There are three rows inserted into the table, with clustered index ID values of -1, 213, and 1017. When inserted, SQL Server will place each row in the b-tree index in order with the rest of the rows, based on those Employee Key values. As a result, the table will remain ordered by the clustered index after the INSERT operation.
Imagine for a moment that this table did not have a rowstore index, but instead had a clustered columnstore index. In that scenario, the three rows would be inserted at the end of the open rowgroup(s) without any regard to the value of Employee Key. A query that searches for a specific range of IDs will need to examine any rowgroup that contains the IDs.
Query to Examine Metadata for a Single Column of a Columnstore Index
Note that min_data_id and max_data_id are identical for each rowgroup. This means that the data contained for that column is unordered. If queries commonly filtered using Invoice Date Key, they would need to scan all rowgroups in the columnstore index in order to appropriately filter out the requested rows. As a columnstore index grows over time, the cost to scan all rowgroups will become burdensome. Even on a well-compressed columnstore index, queries will become slow, and the IO required to service an unfilterable query will be high.
Query to Illustrate Rowgroup Reads in a Columnstore Index Scan
This is a classic analytic query that calculates the total sales quantity for a given month.
While this hypothetical index contains six rowgroups, only a single one is required to satisfy the filter criteria of the query. The power of rowgroup elimination is that it scales effectively as a columnstore index grows in size. A query that requests a narrow week of analytic data from a table with a month of data will perform similarly to that same query against a table with 10 years of data. This is the primary feature that allows columnstore indexes to scale effectively, even when billions of rows are present in a table.
Query to Create a New Columnstore Index Ordered by Invoice Date Key
Note that the data in the table created in Listing 10-7 is identical to the data demonstrated earlier in this chapter, but has been subject to a clustered rowstore index prior to being given a columnstore index. This additional step ensures that the initial data set is ordered by Invoice Date Key. MAXDOP is intentionally set to 1 to avoid parallelism as parallel threads may risk inserting data into the columnstore index in multiple ordered streams rather than a single ordered stream.
Going forward, new data would be regularly inserted into this table via standard data load processes. Assuming the new data contains the most recent values for Invoice Date Key, then the columnstore index will remain ordered in the future as new data is added to it.
The values of min_data_id and max_data_id for each rowgroup show a drastic change from that of an unordered columnstore index. Instead of the values being the same for each rowgroup, they progress from low values to high values as data progresses from the first rowgroup to the latter ones. To put this in perspective, if a hypothetical query required data for a data ID of 735270, it would only need to read the rowgroups associated with segment_id = 7 from this list of segments. Since the metadata indicates that the remaining segments do not contain this value, they (and their associated rowgroups) can automatically be skipped.
Query to Retrieve Data Space Used for Two Columnstore Indexes
The data space used for each table is dramatically different, with the ordered table consuming less than 10% of the space that the unordered table uses. This is an exceptionally dramatic example of how ordered data can be stored more efficiently than unordered data. Ordered data saves space because, typically, data is more similar to other data captured within a short timeframe of it than when compared to data that was collected years apart. Within any application, usage patterns change over time as new features are released, old features are retired, and user behavior changes. Because of this, data samples will look more and more different as time passes between them. These similarities translate into compression algorithms being able to take advantage of a data set with less distinct values for common dimensions. This also reduces dictionary size, which also helps to prevent dictionaries from filling up and forcing the creation of undersized rowgroups.
Real-world data may not compress as impressively as the sample here, but expect nontrivial savings that will have a positive impact on data load processes and on analytics speeds. It is important to remember that saving storage space also saves memory as data remains compressed until needed by an application. Therefore, if an ordered data set were to decrease in size by 25%, that would result in 25% less memory being consumed in the buffer pool by columnstore index pages. Furthermore, other common measures of server performance such as page life expectancy and latching would improve as smaller objects can be retrieved more quickly and will impact other data in memory less than larger objects.
Query to Display Sample Row Counts for Two Columnstore Indexed Tables
Query to Update 33,110 Rows in Two Columnstore Indexed Tables
- 1.
Locate all rows matching the filter criteria.
- 2.
Read all columns for all rows matching the filter criteria.
- 3.
Mark these rows as deleted in the delete bitmap.
- 4.
Insert new versions of these rows into the delta store.
In order to insert new versions of the updated rows, SQL Server needs to read the existing rows in their entirety, which is not a trivial operation. Once read, those rows are marked as deleted and the new versions are inserted into the delta store. This is an expensive process, but ordered data allows for far fewer rowgroups to be read, thereby reducing the work needed to set up the necessary data for the insert into the delta store.
- 1.
Locate all rows matching the filter criteria.
- 2.
Mark these rows as deleted in the delete bitmap.
For both cases, an ordered columnstore index will immensely improve UPDATE and DELETE performance when the filter criteria honors the order used in the table. As an added bonus, an ordered columnstore index will allow for DELETE and UPDATE operations that cause less fragmentation. Instead of flagging rows as deleted in most (or all) rowgroups, the deletes can be isolated to a smaller number of rowgroups.
Combining Segment and Rowgroup Elimination
Segment elimination allows analytics that use less columns to automatically require less segments to be read, thereby reducing IO. Rowgroup elimination allows an ordered data set to facilitate analytic queries to use columnstore metadata to remove vast swaths of rowgroups from the result set, reducing IO significantly.
Only querying for columns A and B allows columns C–H to be automatically skipped and all segments for those columns eliminated (36 segments in total). Querying for a narrow date range that only requires rows in rowgroup 4 allows rowgroups 1–3 and 5–6 to also be automatically skipped, eliminating another 10 segments. The result is a query that required only 2 out of the possible 48 segments in the table!
Combining segment and rowgroup elimination allows columnstore index queries to scale effectively, even as data grows larger or as more columns are added to the table.