There are many features that allow columnstore indexes to perform exceptionally well for analytic workloads. Of those features, compression is the most significant driver in both performance and resource consumption. Understanding how SQL Server implements compression in columnstore indexes and how different algorithms are used to shrink the size of this data allows for optimal architecture and implementation of analytic data storage in SQL Server.
Basics of Columnstore Compression
Data written to segments within a columnstore index is heavily compressed prior to storage. The delta store helps buffer incoming writes to a columnstore index, ensuring that the frequency that segments need to be written is not high enough to negatively impact server performance.
Entity types
Money
Dates
Counts
Repetitive dimensions
Eleven rows are represented, of which each column contains repeated values. The Date column contains only three distinct values, whereas Status contains four and Hours contains six. This repetition tends to scale well. If this were the data from a rowgroup containing one million rows, it would be reasonable to expect a column like Status to only contain a small number of values that are repeated frequently throughout the data set. The more repetitive the data is, the more effectively it will compress.
Different data types will have vastly different values. For example, the contents of a string column are not easily comparable to that of a money column.
Columns representing different dimensions will have different ranges of values that correspond to their typical usage. For example, a decimal column containing the tax rates charged on new cars would contain a very different set of common values than a decimal column containing the cost of those cars.
Columnstore indexes align perfectly with these conventions, whereas rowstore indexes struggle because row and page compression are forced to take into account all columns in a row or page, which will comprise a wide variety of data types and dimensions.
Note that the delta store is not columnstore compressed. This data structure needs to be written to as quickly as possible at runtime, and therefore forgoing expensive compression removes the time and overhead required to process writes against it. This will mean that the delta store takes up more space than it may have if columnstore compression were used. Since the delta store tends to be small in comparison to the compressed columnstore index contents and contain a predictable count of rows, the effort needed to maintain it is constant and does not grow over time. Depending on SQL Server version, delta rowgroups may use row or page compression to limit their storage and memory footprint.
Columnstore Compression Algorithms
Columnstore indexes do not repeatedly store strings, decimals, or dates as is in segments. Instead, SQL Server makes use of a dictionary encoding and value encoding to shrink and compress data, even if the raw values are large.
- 1.
Encode all values prior to compression.
- 2.
Calculate optimal row order (aka Vertipaq optimization).
- 3.
Individually compress each segment.
Encoding is an important step in the compression process. The following are details that explain different encoding algorithms, how they are used, and how they impact overall compression effectiveness.
Value Encoding
Sample Integer Column Data to Be Compressed
Values | Minimum Size in Bits |
---|---|
60 | 6 |
150 | 8 |
90 | 7 |
9000 | 13 |
630 | 10 |
300000 | 19 |
Integer Data Compressed Using a Negative Exponent
Values | Value *10-1 | Minimum Size in Bits |
---|---|---|
60 | 6 | 3 |
150 | 15 | 4 |
90 | 9 | 4 |
9000 | 900 | 10 |
630 | 63 | 6 |
300000 | 30000 | 15 |
Integer Data Compressed by Rebasing
Values | Value *10-1 | Value *10-1 - 6 | Minimum Size in Bits |
---|---|---|---|
60 | 6 | 0 | 0 |
150 | 15 | 9 | 4 |
90 | 9 | 3 | 2 |
9000 | 900 | 894 | 10 |
630 | 63 | 57 | 6 |
300000 | 30000 | 29994 | 15 |
Identical Integer Data Compressed by Rebasing
Values | Value - 1700000 | Original Size in Bits | Reduced Size in Bits |
---|---|---|---|
1700000 | 0 | 22 | 0 |
1700000 | 0 | 22 | 0 |
1700000 | 0 | 22 | 0 |
1700000 | 0 | 22 | 0 |
1700000 | 0 | 22 | 0 |
While this example may seem extreme, it is not that unusual as dimensions often contain repeated values that fall into limited ranges.
Note that real-world transformations of data that occur in columnstore indexes are more complex than this, but the process to determine their details is essentially the same.
Value Compression for Decimals
Values | Value * 100 | Value / 5 | Value / 5 - 70 | Original Size in Bits | Reduced Size in Bits |
---|---|---|---|---|---|
15.2 | 1520 | 304 | 234 | 40 | 8 |
8.8 | 880 | 176 | 106 | 40 | 7 |
4.3 | 430 | 86 | 16 | 40 | 5 |
3.5 | 350 | 70 | 0 | 40 | 0 |
6.25 | 625 | 125 | 55 | 40 | 6 |
The original data is stored as DECIMAL(4,2), which consumes 5 bytes (40 bits) per value. By converting to integers prior to compression, multiple transformations are applied that reduce storage from 200 bits to 26 bits. Decimals may not seem the most likely candidates for effective compression, but once they are converted into integer values, they can take advantage of any of the transformations that can be applied to integers.
Dictionary Encoding
Sample Data Set Containing VARCHAR(50) String Data
Values | Original Size in Bytes |
---|---|
Taco | 4 |
Hamburger | 9 |
Fish Fry | 8 |
Taco | 4 |
Hamburger | 9 |
Taco | 4 |
Triple-Layer Chocolate Cake | 27 |
Triple-Layer Chocolate Cake | 27 |
Triple-Layer Chocolate Cake | 27 |
Cup of Coffee | 13 |
Example of a Dictionary That Is Generated from VARCHAR Data
Index ID | VARCHAR Value |
---|---|
0 | Fish Fry |
1 | Hamburger |
2 | Taco |
3 | Triple-Layer Chocolate Cake |
4 | Cup of Coffee |
The original data required 132 bytes in storage, whereas the encoded data required only 30 bits (3 bits per value), plus the space required to store the lookup values in the dictionary. The dictionary itself consumes space for each distinct value in the segment data, but as a bonus, a dictionary may be shared by multiple rowgroups. Therefore, a table with 100 rowgroups may reuse dictionaries for each column across all of those rowgroups.
The key to dictionary compression is cardinality. The less distinct values present in the column, the more efficiently the data will compress. A CHAR(100) column with 5 distinct values will compress significantly better than a CHAR(10) column with 1000 distinct values. The dictionary size can be roughly estimated as the product of distinct values and their initial size. Lower cardinality means that there are less string values to store in the dictionary and the index used to reference the dictionary will be smaller. Since the dictionary index IDs are repeated throughout rows in the table, a smaller size can greatly improve compression.
Query to Return Dictionary Metadata
Dictionaries may be global or local. A local dictionary only applies to a single rowgroup (or on rare occasions, rowgroups compressed via the same process as that one). A global dictionary is shared across many or all segments for a given column.
Dictionaries may be created for numeric columns if SQL Server believes that will provide better compression than value compression. As a result, numeric columns can appear in this metadata, but some will not. Typically, numeric columns with a low cardinality will tend to use dictionary compression, whereas numeric columns that have a high cardinality will use value compression. In the preceding example, Sale Key, the primary key on the underlying data, is value compressed, whereas Bill To Customer Key (containing only three unique values) is dictionary compressed.
Additional metadata may be gleaned from sys.columns, if needed, such as the data type of the underlying data, its length, precision, and scale.
Query to Return More Detailed Dictionary Metadata
This detail differentiates between local and global dictionaries, as well as adds in some column-level detail. Based on the results, the integer column [Bill To Customer Key] compresses quite well as it contains only three distinct values in its dictionary. Sixty-eight bytes for a dictionary is tiny and will provide immense savings for a column that was previously 4 bytes per row across 25 million rows. Each value is reduced from 4 bytes to 2 bits, allowing it to be compressed at a ratio of 16:1!
- 1.
Very long columns may force dictionary splits and result in undersized rowgroups. This will reduce compression efficiency and increase the number of dictionaries required to service the table.
- 2.
Columns with very high cardinality may also approach this limit and result in dictionary splits that lead to the creation of undersized rowgroups.
Numeric columns with high cardinality will typically use value compression to avoid this problem, whereas string columns that are long and/or have a high cardinality may not be able to avoid this issue. Sixteen megabytes may sound small when compared to massive OLAP tables, but dictionaries themselves are compressed and typical analytic data will not approach the dictionary size limit. Testing should always be used to confirm either of these scenarios, rather than assuming they will or will not happen.
Should String Data Be Normalized?
Before advancing to a discussion of the remainder of the compression process, a question that arises from the efficiency of dictionary compression is whether or not normalization of dimensions is necessary or useful for columnstore compressed data that uses dictionary compression.
In the world of transactional data, dimensions are often normalized. This saves space and memory and allows for relational integrity, uniqueness, and other constraints to be implemented at runtime. In analytic data, though, data validation is often implemented via validation, rather than the use of triggers, keys, or check constraints. As a result, normalization was often used more to save resources or provide convenient lookup tables for use by visualization or analytics applications.
Table Structure for Fact.Sale_CCI with a Normalized Description
The table on the left is the original table, whereas the table on the right has the Description column normalized into a SMALLINT. In this specific example, normalization allowed for a significant space savings, about 15:1 compression savings from the original table!
Generally speaking, normalization will help most when the number of distinct values is low and the length of the columns is high, as dictionary encoding for a SMALLINT column will produce superior compression than dictionary encoding for a large VARCHAR column.
The downside to normalization is that data load processes will take longer. Joining additional dimension columns at runtime to load data requires time and resources. The decision as to whether to normalize or not should be carefully considered and testing conducted to determine if sufficient gains are realized to justify the change.
Ultimately, business logic and organizational needs will drive this decision at a high level, whereas technical considerations will allow for further tuning as needed. There is no right or wrong answer to this question; therefore, consider either approach valid until hands-on testing proves otherwise.
Row Order (Vertipaq) Optimization
Because a columnstore index is organized by column, rather than by row, it is possible for SQL Server to rearrange the order of rows within each rowgroup at the time each segment is compressed. This optimization is used whenever possible and attempts to arrange rows such that like values are adjacent to each other.
Columnstore compression is highly sensitive to repeated sequences of values; therefore, this optimization can provide exceptional compression and performance improvements. Vertipaq optimization is the algorithm that seeks to rearrange rows in the order that results in the most effective compression of the underlying data. This algorithm is used in other Microsoft storage technologies, such as SQL Server Analysis Services, PowerPivot, and PowerBI.
Vertipaq optimization is an expensive process, and the more columns a columnstore index has, the longer it takes to find the optimal row order. Therefore, SQL Server will limit the search time to ensure that building the index doesn’t take an excessive amount of time. As a result, Vertipaq optimization will be somewhat less effective on wider tables. This fact should not affect database architecture decisions, but can be a useful troubleshooting step if a columnstore index on a wide table begins to experience suboptimal compression as more columns continue to be added.
- 1.
When the tuple mover merges rows from the delta store into a clustered columnstore index that has at least one nonclustered rowstore index
- 2.
For columnstore indexes that are built on memory-optimized tables
This does not mean that nonclustered rowstore indexes should never be used on clustered columnstore indexes. Nor should memory-optimized columnstore indexes be avoided. Instead, the fact that Vertipaq compression will not operate on these structures should be an additional input into architectural decision-making process. If a nonclustered rowstore index is required to service a specific set of critical queries, then it should be used. If its use is optional, then consider the negative impact it may have on compression.
Query That Returns Whether or Not Vertipaq Optimization Is Used
In this scenario, Vertipaq optimization is in use for all rowgroups in this columnstore index.
Other Compression Algorithms
After segments have been encoded and rowgroups have been reordered, the final step in columnstore compression is to apply any remaining compression algorithms to the resulting data. At this point, it is possible that no further compression gains are possible for a given segment. This will be most often true with values that do not repeat. The details of how columnstore indexes compress data are provided here to assist in more advanced optimization of compression and performance, when possible.
One form of compression used by columnstore indexes is run-length encoding , which seeks to group together like values and a count, rather than listing repetitive values in a segment. Vertipaq optimization is a prerequisite to this step, and without it, run-length encoding will be significantly less effective.
Each value has been provided with a count , shown in parenthesis after it. There is a single instance of the value 0, followed by two instances of 1, three instances of 2, three instances of 3, and one instance of 4. This algorithm thrives on data that is repetitive and works exceptionally well in conjunction with dictionary encoding and Vertipaq compression.
While this may seem like a complicated transformation, the resulting ones and zeros are quite compact and provide a structure that can be quickly decompressed.
There are a handful of other more complex compression algorithms that may be used to progressively shrink the size of each segment of data. The algorithms used can vary by segment and may be used often, sometimes, or not at all. When this process is complete, the final level of compression is applied by SQL Server, which utilized its xVelocity compression algorithm.
Simple Analytic Query to Demonstrate Columnstore IO
There is quite a bit of information returned, but for the moment, the LOB logical reads circled in red provide an indication of the reads required to satisfy this query. Columnstore indexes will not report reads as logical reads , but instead as LOB logical reads, which indicates that their segments are stored as Large Objects and not as traditional SQL Server data structures. If this number of reads seems high, that is because it is! Later chapters in this book will provide additional optimizations that will reduce storage and memory footprint, as well as greatly improve query performance.
Columnstore Archive Compression
Cold or warm storage
Data that is rarely accessed
Data used for applications where performance is not critical
Columnstore archive compression uses the Microsoft Xpress compression algorithm. While this algorithm is complex, it is publicly documented and is not a Microsoft trade secret. This algorithm relies on compressing data with repeated byte sequences and combines the LZ77 and Huffman compression algorithms to further shrink the size of columnstore data. The details of these compression algorithms are too extensive for the pages of this book, but are publicly available to anyone interested.
The type of compression used in a columnstore index may be specified on a partition-by-partition basis, allowing older/less used data to effectively use archive compression, whereas newer data can use standard columnstore compression. Chapter 11 delves into how partitioning can be combined with columnstore indexes to improve data storage, maintenance, and performance.
Table Created with Columnstore Archive Compression
Columnstore archive compression provided an immense space savings when compared to standard columnstore compression. This may seem absurdly impressive, but some of this savings can be achieved by optimizing data order, which will be discussed in detail in Chapter 10.
In practice with typical production data loads, archive compression will deliver an additional 15–30% savings over standard columnstore compression. For a large analytic table, this can equate to a significant reduction in its data footprint! In general, only consider archive compression for data that will not be written to or read often, as the cost to compress and decompress it is nontrivial. For data that is rarely accessed, it is an easy way to save on both storage and memory resources.
The Compression Life Cycle
A key to all compression within SQL Server is that compressed pages remain compressed until needed by a query. Therefore, pages are compressed on storage and remain compressed there for the lifetime of an index. When pages are read into memory, they remain compressed. Therefore, any storage space saved by the use of compression will equate into memory savings when the data is needed.
This fact is critical to the efficiency of columnstore indexes. Compressed segments remain compressed on disk as well as in memory and are not decompressed until a query requires their data. Any optimizations that further decrease the size of compressed segments will translate directly into memory savings. The remainder of this book will focus on optimizations, best practices, and features that allow columnstore indexes to be used as efficiently as possible, thus making the most of their compression and ensuring they can scale for any data, regardless of how big it gets.