© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
E. PollackAnalytics Optimization with Columnstore Indexes in Microsoft SQL Serverhttps://doi.org/10.1007/978-1-4842-8048-5_10

10. Segment and Rowgroup Elimination

Edward Pollack1  
(1)
Albany, NY, USA
 

A key component to how columnstore indexes function is that every segment within the index is its own building block. Each may be read individually or in groups, but the number of segments read via any query can be reduced by efficient architecture and optimal query patterns. Reducing segments read directly reduces IO, improves query speed and memory-related performance metrics, such as page life expectancy.

Segment Elimination

Each segment represents data for a single column over a set of rows. When a query is executed against a columnstore index, SQL Server needs to determine which rowgroups are required to return the requested result set. Within those rowgroups, only the segments containing data for the selected columns will be read. Therefore, queries that select less columns will read fewer segments, thereby reducing IO, memory usage, and query runtimes.

Consider the columnstore index shown in Figure 10-1.
Figure 10-1

Sample columnstore index with six rowgroups and eight columns

With 6 rowgroups and 8 columns, this columnstore index contains a total of 48 segments. Because each column is broken into its own set of segments (one per rowgroup), it is not necessary to read segments for columns that are not needed for a query. If a sample query were to only read columns A and B, the result would be that the remaining segments can be automatically skipped, as shown in Figure 10-2.
Figure 10-2

Impact of segment elimination on a query for two columns

By querying for only two of the possible eight columns, the number of segments read was reduced by 75%. This convention holds true for columnstore indexes with any number of columns. If this table had 24 columns, then reading only 2 of them would mean that 22 columns would not be read. The result of this feature is that the number of segments read in a columnstore index query will be proportional to the number of columns involved in the query.

The columns needed to satisfy a query also include any referenced by the WHERE clause, as well as in aggregates (GROUP BY/HAVING). If a query includes views or functions, then their contents will be evaluated to determine which columns are required to execute a query. Listing 10-1 provides an example query against a columnstore index.
SELECT
       [City Key],
       COUNT(*)
FROM fact.Sale_CCI
WHERE [Invoice Date Key] >= '1/1/2016'
GROUP BY [City Key]
ORDER BY COUNT(*) DESC;
Listing 10-1

Sample Columnstore Query Using a Filter and Aggregate

This analytic query calculates a count of sales per city for a given time period. Of the 21 columns in Sale_CCI, only 2 were required to complete this calculation: City Key and Invoice Date Key. As a result, 19 of the 21 column’s worth of segments will be ignored when executing this query.

Alternatively, rowstore tables store rows sequentially with each column for each row stored together on pages. While reading less columns in rowstore indexes can reduce the amount of data presented to an application, it does not reduce the quantity of pages read into memory in order to retrieve the specific columns for a query. Therefore, reducing the number of columns queried in a columnstore index will provide immediate performance benefits that are not as pronounced in rowstore indexes.

Segment elimination is a simple and powerful tool that can be simplified into a single optimization technique: Write queries to only include the required columns. Since rowgroups can contain up to 220 rows, the cost of querying unnecessary columns can be quite high. Whereas transactional queries against rowstore tables often operate on small numbers of rows, analytic queries can access millions of rows at a time. Therefore, the perceived convenience of SELECT * queries will hinder performance on a columnstore index.

Consider the query in Listing 10-2.
SELECT
       *
FROM fact.Sale_CCI
WHERE [Invoice Date Key] = '2/17/2016';
Listing 10-2

A SELECT * Query Against a Columnstore Index

This SELECT * query returns all columns for sales on a specific date. While only 30,140 rows are returned, all 21 columns need to have their segments retrieved as part of the operation. Figure 10-3 provides the output of STATISTICS IO for this query.
Figure 10-3

IO for a sample SELECT * query

Note that a total of 68,030 LOB logical reads are recorded. Listing 10-3 contains an alternative query in which only a small set of columns required by an application are returned.
SELECT
       [Sale Key],
       [City Key],
       [Invoice Date Key]
FROM fact.Sale_CCI
WHERE [Invoice Date Key] = '2/17/2016';
Listing 10-3

A Query Against a Columnstore Index Requesting Only Three Columns

Here, only three of the columns are requested, instead of all of them. The resulting IO can be viewed in Figure 10-4.
Figure 10-4

IO for a sample query requesting data from three columns

When the column list was reduced to only those needed for the query, LOB logical reads were reduced from 68,030 to 31,689. The performance improvement for omitting columns will vary based on the data type, compression, and contents of each column. Omitting a text column with no repeated values will reduce IO far more than omitting a BIT column.

Segment elimination is an easy way to improve query speeds while also reducing resource consumption. Anyone writing queries against columnstore indexes should consider which columns are required for their workloads and ensure that no extra columns are returned.

Rowgroup Elimination

Unlike rowstore indexes, columnstore indexes have no built-in order. The data that is inserted into a columnstore index is added in the order it is received by SQL Server. As a result, the order of data within rowgroups is the direct result of the order it was inserted. Equally important is that UPDATE operations will reorder a columnstore index, removing rows from one set of rowgroups and inserting the new versions into the most current open rowgroups.

Because compressing rowgroups is a computationally expensive process, the cost to intrinsically maintain any form of data order would be prohibitively high. This is one of the most important concepts when architecting a columnstore index. Since data order is not enforced by SQL Server, it is the responsibility of the architect to determine data order up front and ensure that both data load processes and common query patterns maintain that agreed-upon data order.

Consider the table Dimension.Employee that contains a clustered rowstore index on the Employee Key column. If three new employees start work and are added to the table, the INSERT operations to add them could be represented by the T-SQL in Listing 10-4.
INSERT INTO Dimension.Employee
       ([Employee Key], [WWI Employee ID], Employee, [Preferred Name], [Is Salesperson], Photo, [Valid From], [Valid To], [Lineage Key])
VALUES
(   -1, -- Clustered Index
    289, N'Ebenezer Scrooge', N'Scrooge', 0, NULL, GETUTCDATE(), '9999-12-31 23:59:59.9999999', 3),
(   213, -- Clustered Index
    400, N'Captain Ahab', N'Captain', 0, NULL, GETUTCDATE(), '9999-12-31 23:59:59.9999999', 3),
(   1017, -- Clustered Index
    501, N'Holden Caulfield', N'Phony', 0, NULL, GETUTCDATE(), '9999-12-31 23:59:59.9999999', 3);
Listing 10-4

Query to Add Three New Employees to the Dimension.Employee Table

There are three rows inserted into the table, with clustered index ID values of -1, 213, and 1017. When inserted, SQL Server will place each row in the b-tree index in order with the rest of the rows, based on those Employee Key values. As a result, the table will remain ordered by the clustered index after the INSERT operation.

Imagine for a moment that this table did not have a rowstore index, but instead had a clustered columnstore index. In that scenario, the three rows would be inserted at the end of the open rowgroup(s) without any regard to the value of Employee Key. A query that searches for a specific range of IDs will need to examine any rowgroup that contains the IDs.

Columnstore metadata helps SQL Server locate rows based on the range of values for each column present in each rowgroup. Consider the metadata for the Invoice Date Key column in Fact.Sale_CCI using the query in Listing 10-5.
SELECT
       tables.name AS table_name,
       indexes.name AS index_name,
       columns.name AS column_name,
       partitions.partition_number,
       column_store_segments.segment_id,
       column_store_segments.min_data_id,
       column_store_segments.max_data_id,
       column_store_segments.row_count
FROM sys.column_store_segments
INNER JOIN sys.partitions
ON column_store_segments.hobt_id = partitions.hobt_id
INNER JOIN sys.indexes
ON indexes.index_id = partitions.index_id
AND indexes.object_id = partitions.object_id
INNER JOIN sys.tables
ON tables.object_id = indexes.object_id
INNER JOIN sys.columns
ON tables.object_id = columns.object_id
AND column_store_segments.column_id = columns.column_id
WHERE tables.name = 'Sale_CCI'
AND columns.name = 'Invoice Date Key'
ORDER BY tables.name, columns.name, column_store_segments.segment_id;
Listing 10-5

Query to Examine Metadata for a Single Column of a Columnstore Index

The results are shown in Figure 10-5.
Figure 10-5

Metadata for the Invoice Date Key column of a columnstore index

Note that min_data_id and max_data_id are identical for each rowgroup. This means that the data contained for that column is unordered. If queries commonly filtered using Invoice Date Key, they would need to scan all rowgroups in the columnstore index in order to appropriately filter out the requested rows. As a columnstore index grows over time, the cost to scan all rowgroups will become burdensome. Even on a well-compressed columnstore index, queries will become slow, and the IO required to service an unfilterable query will be high.

STATISTICS IO provides a useful guide to the number of rowgroups read as part of a columnstore index scan. To demonstrate this, the query in Listing 10-6 will be used.
SELECT
       SUM([Quantity])
FROM Fact.Sale_CCI
WHERE [Invoice Date Key] >= '1/1/2016'
AND [Invoice Date Key] < '2/1/2016';
Listing 10-6

Query to Illustrate Rowgroup Reads in a Columnstore Index Scan

This is a classic analytic query that calculates the total sales quantity for a given month.

When a columnstore index is scanned, STATISTICS IO will include a message in the output that indicates how many segments were read and skipped. This is a measure of the number of rowgroups that had to be scanned to determine which rows met the filter criteria and which could be automatically skipped. Figure 10-6 indicates that 26 segments were read and 0 were skipped. Because the underlying data is completely unordered, SQL Server cannot use columnstore metadata to skip any rowgroups. This represents a common real-world challenge, but one that is easy to solve.
Figure 10-6

STATISTICS IO for a sample unordered columnstore index scan

Columnstore metadata allows rowgroups to be skipped when query filters do not include rows present in them. This process is called rowgroup elimination and is key to optimizing columnstore index performance. The easiest way to achieve this is by ordering data and maintaining that order over time. Data may be ordered by one or more columns that will represent the most common filters used in analytic queries. The most common dimension to order OLAP data by is time. Analytics often filter, aggregate, and visualize data using time-based units, such as hours, days, weeks, months, quarters, and years. Figure 10-7 shows the impact of rowgroup elimination on a query that requires only a narrow range of data based on filter criteria that honors ordered data.
Figure 10-7

Impact of rowgroup elimination on a narrow analytic query

While this hypothetical index contains six rowgroups, only a single one is required to satisfy the filter criteria of the query. The power of rowgroup elimination is that it scales effectively as a columnstore index grows in size. A query that requests a narrow week of analytic data from a table with a month of data will perform similarly to that same query against a table with 10 years of data. This is the primary feature that allows columnstore indexes to scale effectively, even when billions of rows are present in a table.

In the example in Listing 10-6, a simple analytic query filtered on Invoice Date Key, but the unordered columnstore index data forced a full scan of the data to determine which rows met the filter criteria. If Invoice Date Key is the most common filter criteria for analysis of this data, then ordering by that column would allow for effective rowgroup elimination.
CREATE TABLE Fact.Sale_CCI_ORDERED
(      [Sale Key] [bigint] NOT NULL,
       [City Key] [int] NOT NULL,
       [Customer Key] [int] NOT NULL,
       [Bill To Customer Key] [int] NOT NULL,
       [Stock Item Key] [int] NOT NULL,
       [Invoice Date Key] [date] NOT NULL,
       [Delivery Date Key] [date] NULL,
       [Salesperson Key] [int] NOT NULL,
       [WWI Invoice ID] [int] NOT NULL,
       [Description] [nvarchar](100) NOT NULL,
       [Package] [nvarchar](50) NOT NULL,
       [Quantity] [int] NOT NULL,
       [Unit Price] [decimal](18, 2) NOT NULL,
       [Tax Rate] [decimal](18, 3) NOT NULL,
       [Total Excluding Tax] [decimal](18, 2) NOT NULL,
       [Tax Amount] [decimal](18, 2) NOT NULL,
       [Profit] [decimal](18, 2) NOT NULL,
       [Total Including Tax] [decimal](18, 2) NOT NULL,
       [Total Dry Items] [int] NOT NULL,
       [Total Chiller Items] [int] NOT NULL,
       [Lineage Key] [int] NOT NULL);
CREATE CLUSTERED INDEX CCI_fact_Sale_CCI_ORDERED ON Fact.Sale_CCI_ORDERED ([Invoice Date Key]);
INSERT INTO Fact.Sale_CCI_ORDERED
       ([Sale Key], [City Key], [Customer Key], [Bill To Customer Key], [Stock Item Key], [Invoice Date Key], [Delivery Date Key],
        [Salesperson Key], [WWI Invoice ID], Description, Package, Quantity, [Unit Price], [Tax Rate],
        [Total Excluding Tax], [Tax Amount], Profit, [Total Including Tax], [Total Dry Items],
        [Total Chiller Items], [Lineage Key])
SELECT
       Sale.[Sale Key], Sale.[City Key], Sale.[Customer Key], Sale.[Bill To Customer Key], Sale.[Stock Item Key], Sale.[Invoice Date Key], Sale.[Delivery Date Key],
       Sale.[Salesperson Key], Sale.[WWI Invoice ID], Sale.Description, Sale.Package, Sale.Quantity, Sale.[Unit Price], Sale.[Tax Rate],
       Sale.[Total Excluding Tax], Sale.[Tax Amount], Sale.Profit, Sale.[Total Including Tax], Sale.[Total Dry Items],
       Sale.[Total Chiller Items], Sale.[Lineage Key]
FROM fact.Sale
CROSS JOIN
Dimension.City
WHERE City.[City Key] >= 1 AND City.[City Key] <= 110;
-- Create a clustered columnstore index on the table, removing the existing clustered rowstore index.
CREATE CLUSTERED COLUMNSTORE INDEX CCI_fact_Sale_CCI_ORDERED ON Fact.Sale_CCI_ORDERED WITH (MAXDOP = 1, DROP_EXISTING = ON);
GO
Listing 10-7

Query to Create a New Columnstore Index Ordered by Invoice Date Key

Note that the data in the table created in Listing 10-7 is identical to the data demonstrated earlier in this chapter, but has been subject to a clustered rowstore index prior to being given a columnstore index. This additional step ensures that the initial data set is ordered by Invoice Date Key. MAXDOP is intentionally set to 1 to avoid parallelism as parallel threads may risk inserting data into the columnstore index in multiple ordered streams rather than a single ordered stream.

Going forward, new data would be regularly inserted into this table via standard data load processes. Assuming the new data contains the most recent values for Invoice Date Key, then the columnstore index will remain ordered in the future as new data is added to it.

To test the impact of data order on Fact.Sale_CCI_ORDERED, the query from Listing 10-6 will be executed against it, with the output tab displayed in Figure 10-8.
Figure 10-8

STATISTICS IO for a sample filtered columnstore index scan

Instead of reading every rowgroup in the columnstore index, SQL Server only needed to read two of them, with the remainder being skipped. Skipped segments in STATISTICS IO indicate that rowgroup elimination is being successfully implemented. The metadata query in Listing 10-5 can also be rerun against this ordered table to illustrate how data order impacts columnstore metadata, with the results being shown in Figure 10-9.
Figure 10-9

Metadata for the Invoice Date Key column of an ordered columnstore index

The values of min_data_id and max_data_id for each rowgroup show a drastic change from that of an unordered columnstore index. Instead of the values being the same for each rowgroup, they progress from low values to high values as data progresses from the first rowgroup to the latter ones. To put this in perspective, if a hypothetical query required data for a data ID of 735270, it would only need to read the rowgroups associated with segment_id = 7 from this list of segments. Since the metadata indicates that the remaining segments do not contain this value, they (and their associated rowgroups) can automatically be skipped.

Ordering data by a key column is an easy way to enable rowgroup elimination, thereby reducing query resource consumption and improving the speed of any queries that can make use of that data order. Effective data order doesn’t just reduce reads, but it can also save on storage space by improving the compression ratios of the underlying data. Listing 10-8 contains a script that will retrieve the data space used by the two columnstore indexes featured in this chapter.
CREATE TABLE #storage_data
(      table_name VARCHAR(MAX),
       rows_used BIGINT,
       reserved VARCHAR(50),
       data VARCHAR(50),
       index_size VARCHAR(50),
       unused VARCHAR(50));
INSERT INTO #storage_data
       (table_name, rows_used, reserved, data, index_size, unused)
EXEC sp_MSforeachtable "EXEC sp_spaceused '?'";
UPDATE #storage_data
       SET reserved = LEFT(reserved, LEN(reserved) - 3),
              data = LEFT(data, LEN(data) - 3),
              index_size = LEFT(index_size, LEN(index_size) - 3),
              unused = LEFT(unused, LEN(unused) - 3);
SELECT
       table_name,
       rows_used,
       reserved / 1024 AS data_space_reserved_mb,
       data / 1024 AS data_space_used_mb,
       index_size / 1024 AS index_size_mb,
       unused AS free_space_kb,
       CAST(CAST(data AS DECIMAL(24,2)) / CAST(rows_used AS DECIMAL(24,2)) AS DECIMAL(24,4)) AS kb_per_row
FROM #storage_data
WHERE rows_used > 0
AND table_name IN ('Sale_CCI', 'Sale_CCI_ORDERED')
ORDER BY CAST(reserved AS INT) DESC;
DROP TABLE #storage_data;
Listing 10-8

Query to Retrieve Data Space Used for Two Columnstore Indexes

The results are shown in Figure 10-10.
Figure 10-10

Space used by an ordered and unordered columnstore index

The data space used for each table is dramatically different, with the ordered table consuming less than 10% of the space that the unordered table uses. This is an exceptionally dramatic example of how ordered data can be stored more efficiently than unordered data. Ordered data saves space because, typically, data is more similar to other data captured within a short timeframe of it than when compared to data that was collected years apart. Within any application, usage patterns change over time as new features are released, old features are retired, and user behavior changes. Because of this, data samples will look more and more different as time passes between them. These similarities translate into compression algorithms being able to take advantage of a data set with less distinct values for common dimensions. This also reduces dictionary size, which also helps to prevent dictionaries from filling up and forcing the creation of undersized rowgroups.

Real-world data may not compress as impressively as the sample here, but expect nontrivial savings that will have a positive impact on data load processes and on analytics speeds. It is important to remember that saving storage space also saves memory as data remains compressed until needed by an application. Therefore, if an ordered data set were to decrease in size by 25%, that would result in 25% less memory being consumed in the buffer pool by columnstore index pages. Furthermore, other common measures of server performance such as page life expectancy and latching would improve as smaller objects can be retrieved more quickly and will impact other data in memory less than larger objects.

An ordered columnstore index also improves UPDATE and DELETE speeds by allowing those operations to target less pages. For example, consider the queries on the ordered and unordered sales tables shown in Listing 10-9.
SELECT COUNT(*) AS row_count FROM Fact.Sale_CCI WHERE [Invoice Date Key] = '1/1/2015';
SELECT COUNT(*) AS row_count FROM Fact.Sale_CCI_ORDERED WHERE [Invoice Date Key] = '1/1/2015';
Listing 10-9

Query to Display Sample Row Counts for Two Columnstore Indexed Tables

The results show that in each table, the count of rows affected is identical, as seen in Figure 10-11.
Figure 10-11

Row counts for a sample query against ordered and unordered columnstore indexes

For each table, 33,110 rows would be affected by an update using the same filter. Listing 10-10 provides a simple UPDATE statement against each table.
UPDATE Sale_CCI
       SET [Total Dry Items] = [Total Dry Items] - 1,
              [Total Chiller Items] = [Total Chiller Items] + 1
FROM Fact.Sale_CCI_ORDERED -- Unordered
WHERE [Invoice Date Key] = '1/1/2015';
UPDATE Sale_CCI
       SET [Total Dry Items] = [Total Dry Items] - 1,
              [Total Chiller Items] = [Total Chiller Items] + 1
FROM Fact.Sale_CCI_ORDERED -- Ordered
WHERE [Invoice Date Key] = '1/1/2015';
Listing 10-10

Query to Update 33,110 Rows in Two Columnstore Indexed Tables

The results in Figure 10-12 show the resulting IO and rowgroup usage for each UPDATE operation.
Figure 10-12

STATISTICS IO output for updates against an ordered and unordered columnstore index

Note the dramatic difference in IO, as well as segment reads for each operation. Because an UPDATE consists of both a DELETE and an INSERT, SQL Server had to perform the following tasks to complete each update:
  1. 1.

    Locate all rows matching the filter criteria.

     
  2. 2.

    Read all columns for all rows matching the filter criteria.

     
  3. 3.

    Mark these rows as deleted in the delete bitmap.

     
  4. 4.

    Insert new versions of these rows into the delta store.

     

In order to insert new versions of the updated rows, SQL Server needs to read the existing rows in their entirety, which is not a trivial operation. Once read, those rows are marked as deleted and the new versions are inserted into the delta store. This is an expensive process, but ordered data allows for far fewer rowgroups to be read, thereby reducing the work needed to set up the necessary data for the insert into the delta store.

DELETE operations are improved similarly, but require far less work as they simply need to
  1. 1.

    Locate all rows matching the filter criteria.

     
  2. 2.

    Mark these rows as deleted in the delete bitmap.

     

For both cases, an ordered columnstore index will immensely improve UPDATE and DELETE performance when the filter criteria honors the order used in the table. As an added bonus, an ordered columnstore index will allow for DELETE and UPDATE operations that cause less fragmentation. Instead of flagging rows as deleted in most (or all) rowgroups, the deletes can be isolated to a smaller number of rowgroups.

Combining Segment and Rowgroup Elimination

Segment elimination allows analytics that use less columns to automatically require less segments to be read, thereby reducing IO. Rowgroup elimination allows an ordered data set to facilitate analytic queries to use columnstore metadata to remove vast swaths of rowgroups from the result set, reducing IO significantly.

Optimal analytic workloads combine segment and rowgroup elimination to slice a table both vertically (rowgroup elimination) and horizontally (segment elimination). Consider the columnstore index introduced previously in Figure 10-1. If an analytic query were designed to only query for columns A and B and to also filter for a narrow date range, the resulting rowgroup reads would look similar to the results in Figure 10-13.
Figure 10-13

Impact of segment and rowgroup elimination on a columnstore index

Only querying for columns A and B allows columns C–H to be automatically skipped and all segments for those columns eliminated (36 segments in total). Querying for a narrow date range that only requires rows in rowgroup 4 allows rowgroups 1–3 and 5–6 to also be automatically skipped, eliminating another 10 segments. The result is a query that required only 2 out of the possible 48 segments in the table!

Combining segment and rowgroup elimination allows columnstore index queries to scale effectively, even as data grows larger or as more columns are added to the table.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.5.12