Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

E. PollackAnalytics Optimization with Columnstore Indexes in Microsoft SQL Serverhttps://doi.org/10.1007/978-1-4842-8048-5_5

5. Columnstore Compression

Edward Pollack¹

(1)

Albany, NY, USA

There are many features that allow columnstore indexes to perform exceptionally well for analytic workloads. Of those features, compression is the most significant driver in both performance and resource consumption. Understanding how SQL Server implements compression in columnstore indexes and how different algorithms are used to shrink the size of this data allows for optimal architecture and implementation of analytic data storage in SQL Server.

Basics of Columnstore Compression

Data written to segments within a columnstore index is heavily compressed prior to storage. The delta store helps buffer incoming writes to a columnstore index, ensuring that the frequency that segments need to be written is not high enough to negatively impact server performance.

Without any additional optimizations, data stored by columns rather than rows will be quite efficient to compress. This is because the values within a given column will usually repeat often and therefore can be more easily compressed that data that repeats less often. Typical analytic tables will contain many rows, but generally repetitive columns, such as

Entity types
Money
Dates
Counts
Repetitive dimensions

Consider the three columns illustrated in Figure 5-1.

Figure 5-1
Example of repeated values within a representative data set

Eleven rows are represented, of which each column contains repeated values. The Date column contains only three distinct values, whereas Status contains four and Hours contains six. This repetition tends to scale well. If this were the data from a rowgroup containing one million rows, it would be reasonable to expect a column like Status to only contain a small number of values that are repeated frequently throughout the data set. The more repetitive the data is, the more effectively it will compress.

As is expected, more values are repeated within the values present in a single column than across different columns. This is true for two primary reasons:

Different data types will have vastly different values. For example, the contents of a string column are not easily comparable to that of a money column.
Columns representing different dimensions will have different ranges of values that correspond to their typical usage. For example, a decimal column containing the tax rates charged on new cars would contain a very different set of common values than a decimal column containing the cost of those cars.

Columnstore indexes align perfectly with these conventions, whereas rowstore indexes struggle because row and page compression are forced to take into account all columns in a row or page, which will comprise a wide variety of data types and dimensions.

Note that the delta store is not columnstore compressed. This data structure needs to be written to as quickly as possible at runtime, and therefore forgoing expensive compression removes the time and overhead required to process writes against it. This will mean that the delta store takes up more space than it may have if columnstore compression were used. Since the delta store tends to be small in comparison to the compressed columnstore index contents and contain a predictable count of rows, the effort needed to maintain it is constant and does not grow over time. Depending on SQL Server version, delta rowgroups may use row or page compression to limit their storage and memory footprint.

Columnstore Compression Algorithms

Columnstore indexes do not repeatedly store strings, decimals, or dates as is in segments. Instead, SQL Server makes use of a dictionary encoding and value encoding to shrink and compress data, even if the raw values are large.

Compression may be summarized in the following sequential steps:

1.
Encode all values prior to compression.
2.
Calculate optimal row order (aka Vertipaq optimization).
3.
Individually compress each segment.

Encoding is an important step in the compression process. The following are details that explain different encoding algorithms, how they are used, and how they impact overall compression effectiveness.

Value Encoding

This process seeks to reduce the footprint of numeric columns by adjusting their storage to be able to use smaller sized objects. For integer data, which consists of TINYINT, SMALLINT, INT, and BIGINT, values are compressed by a variety of mathematical transformations that can divide all values by a common divisor or subtract all values with a common subtrahend. Consider a column that is a 4-bit integer data type and contains the values shown in Table 5-1.

Table 5-1

Sample Integer Column Data to Be Compressed

Values	Minimum Size in Bits
60	6
150	8
90	7
9000	13
630	10
300000	19

Columnstore compression seeks to find common divisors and immediately notes that each number is divisible by 10. For this example, the base is 10 and each value will be referenced using the maximum common negative exponent of that base that it is divisible by, as seen in Table 5-2.

Table 5-2

Integer Data Compressed Using a Negative Exponent

Values	Value *10^-1	Minimum Size in Bits
60	6	3
150	15	4
90	9	4
9000	900	10
630	63	6
300000	30000	15

The next compression step is to take the smallest value in the list as the base, zero it out, and reference other values by subtracting that base from them, as seen in Table 5-3.

Table 5-3

Integer Data Compressed by Rebasing

Values	Value *10^-1	Value *10^-1 - 6	Minimum Size in Bits
60	6	0	0
150	15	9	4
90	9	3	2
9000	900	894	10
630	63	57	6
300000	30000	29994	15

Reviewing the minimum space required to store the values, the original size was 63 bits. After the first transformation, the size is reduced to 42 bits, and after the last transformation, it is reduced to 37 bits. Overall, the reduction in size for this set of values was 38%. The level of value compression seen will vary based on the set of values and can be significantly higher or lower. For example, a set of values that are all the same will compress into a single zeroed-out reference, as seen in Table 5-4.

Table 5-4

Identical Integer Data Compressed by Rebasing

Values	Value - 1700000	Original Size in Bits	Reduced Size in Bits
1700000	0	22	0
1700000	0	22	0
1700000	0	22	0
1700000	0	22	0
1700000	0	22	0

While this example may seem extreme, it is not that unusual as dimensions often contain repeated values that fall into limited ranges.

Note that real-world transformations of data that occur in columnstore indexes are more complex than this, but the process to determine their details is essentially the same.

Decimals are converted into integers prior to compression, at which point they follow the same value encoding rules as integers would. For example, Table 5-5 shows the compression process for a set of tax rates that are stored in decimal format.

Table 5-5

Value Compression for Decimals

Values	Value * 100	Value / 5	Value / 5 - 70	Original Size in Bits	Reduced Size in Bits
15.2	1520	304	234	40	8
8.8	880	176	106	40	7
4.3	430	86	16	40	5
3.5	350	70	0	40	0
6.25	625	125	55	40	6

The original data is stored as DECIMAL(4,2), which consumes 5 bytes (40 bits) per value. By converting to integers prior to compression, multiple transformations are applied that reduce storage from 200 bits to 26 bits. Decimals may not seem the most likely candidates for effective compression, but once they are converted into integer values, they can take advantage of any of the transformations that can be applied to integers.

Dictionary Encoding

String data is stored more efficiently using dictionary encoding . In this algorithm, each distinct string value in a segment is inserted into a dictionary and indexed using an integer pointer to the dictionary. This provides significant space savings for string columns that contain repeated values. Consider a segment that contains foods of the data type VARCHAR(50), as seen in Table 5-6.

Table 5-6

Sample Data Set Containing VARCHAR(50) String Data

Values	Original Size in Bytes
Taco	4
Hamburger	9
Fish Fry	8
Taco	4
Hamburger	9
Taco	4
Triple-Layer Chocolate Cake	27
Triple-Layer Chocolate Cake	27
Triple-Layer Chocolate Cake	27
Cup of Coffee	13

The first step to compressing this data is to create an indexed dictionary, as seen in Table 5-7.

Table 5-7

Example of a Dictionary That Is Generated from VARCHAR Data

Index ID	VARCHAR Value
0	Fish Fry
1	Hamburger
2	Taco
3	Triple-Layer Chocolate Cake
4	Cup of Coffee

Since the segment contains five distinct values, the index ID for the dictionary will consume 3 bits. The resulting mapping in Figure 5-2 shows the resulting dictionary compression of this data.

Figure 5-2
Sample dictionary compression of a set of ten strings

The original data required 132 bytes in storage, whereas the encoded data required only 30 bits (3 bits per value), plus the space required to store the lookup values in the dictionary. The dictionary itself consumes space for each distinct value in the segment data, but as a bonus, a dictionary may be shared by multiple rowgroups. Therefore, a table with 100 rowgroups may reuse dictionaries for each column across all of those rowgroups.

The key to dictionary compression is cardinality. The less distinct values present in the column, the more efficiently the data will compress. A CHAR(100) column with 5 distinct values will compress significantly better than a CHAR(10) column with 1000 distinct values. The dictionary size can be roughly estimated as the product of distinct values and their initial size. Lower cardinality means that there are less string values to store in the dictionary and the index used to reference the dictionary will be smaller. Since the dictionary index IDs are repeated throughout rows in the table, a smaller size can greatly improve compression.

Dictionaries can be viewed in SQL Server using the view sys.column_store_dictionaries. The query in Listing 5-1 returns information about the dictionaries used in the fact.Sale_CCI table created in Chapter 4.

SELECT

partitions.partition_number,

objects.name AS table_name,

columns.name AS column_name,

CASE

WHEN column_store_dictionaries.dictionary_id = 0 THEN 'Global Dictionary'

ELSE 'Local Dictionary'

END AS dictionary_scope,

CASE WHEN column_store_dictionaries.type = 1 THEN 'Hash dictionary containing int values'

WHEN column_store_dictionaries.type = 2 THEN 'Not used' -- Included for completeness

WHEN column_store_dictionaries.type = 3 THEN 'Hash dictionary containing string values'

WHEN column_store_dictionaries.type = 4 THEN 'Hash dictionary containing float values'

END AS dictionary_type,

column_store_dictionaries.entry_count,

column_store_dictionaries.on_disk_size

FROM sys.column_store_dictionaries

INNER JOIN sys.partitions

ON column_store_dictionaries.hobt_id = partitions.hobt_id

INNER JOIN sys.objects

ON objects.object_id = partitions.object_id

INNER JOIN sys.columns

ON columns.column_id = column_store_dictionaries.column_id

AND columns.object_id = objects.object_id

WHERE objects.name = 'Sale_CCI';

Listing 5-1

Query to Return Dictionary Metadata

The results of this query are shown in Figure 5-3.

Figure 5-3
Dictionary metadata results for the table fact.Sale_CCI

This view provides the cardinality of the column in entry_count, as well as the size of the dictionary and its type. There are a number of noteworthy takeaways from this data:

Dictionaries may be global or local. A local dictionary only applies to a single rowgroup (or on rare occasions, rowgroups compressed via the same process as that one). A global dictionary is shared across many or all segments for a given column.
Dictionaries may be created for numeric columns if SQL Server believes that will provide better compression than value compression. As a result, numeric columns can appear in this metadata, but some will not. Typically, numeric columns with a low cardinality will tend to use dictionary compression, whereas numeric columns that have a high cardinality will use value compression. In the preceding example, Sale Key, the primary key on the underlying data, is value compressed, whereas Bill To Customer Key (containing only three unique values) is dictionary compressed.
Additional metadata may be gleaned from sys.columns, if needed, such as the data type of the underlying data, its length, precision, and scale.

This metadata provides detail about each dictionary in the columnstore index, but does not directly relate segments to dictionaries. The query in Listing 5-2 links sys.column_store_segments to sys.column_store_dictionaries to provide this added information.

SELECT

column_store_segments.segment_id,

types.name AS data_type,

types.max_length,

types.precision,

types.scale,

CASE

WHEN PRIMARY_DICTIONARY.dictionary_id IS NOT NULL THEN 1

ELSE 0

END AS does_global_dictionary_exist,

PRIMARY_DICTIONARY.entry_count AS global_dictionary_entry_count,

PRIMARY_DICTIONARY.on_disk_size AS global_dictionary_on_disk_size,

CASE

WHEN SECONDARY_DICTIONARY.dictionary_id IS NOT NULL THEN 1

ELSE 0

END AS does_local_dictionary_exist,

SECONDARY_DICTIONARY.entry_count AS local_dictionary_entry_count,

SECONDARY_DICTIONARY.on_disk_size AS local_dictionary_on_disk_size

FROM sys.column_store_segments

INNER JOIN sys.partitions

ON column_store_segments.hobt_id = partitions.hobt_id

INNER JOIN sys.objects

ON objects.object_id = partitions.object_id

INNER JOIN sys.columns

ON columns.object_id = objects.object_id

AND column_store_segments.column_id = columns.column_id

INNER JOIN sys.types

ON types.user_type_id = columns.user_type_id

LEFT JOIN sys.column_store_dictionaries PRIMARY_DICTIONARY

ON column_store_segments.primary_dictionary_id = PRIMARY_DICTIONARY.dictionary_id

AND column_store_segments.primary_dictionary_id <> -1

AND PRIMARY_DICTIONARY.column_id = columns.column_id

AND PRIMARY_DICTIONARY.hobt_id = partitions.hobt_id

LEFT JOIN sys.column_store_dictionaries SECONDARY_DICTIONARY

ON column_store_segments.secondary_dictionary_id = SECONDARY_DICTIONARY.dictionary_id

AND column_store_segments.secondary_dictionary_id <> -1

AND SECONDARY_DICTIONARY.column_id = columns.column_id

AND SECONDARY_DICTIONARY.hobt_id = partitions.hobt_id

WHERE objects.name = 'Sale_CCI'

AND columns.name = 'Bill To Customer Key';

Listing 5-2

Query to Return More Detailed Dictionary Metadata

Note that if a table is partitioned or has many columns, the result set could become quite large, so this query is filtered by both table and column to focus on a single set of segments, for demonstration purposes. The results help to fill in the blanks for dictionary metadata, as seen in Figure 5-4.

Figure 5-4
Dictionary metadata detail for Sale_CCI. [Bill To Customer Key]

This detail differentiates between local and global dictionaries, as well as adds in some column-level detail. Based on the results, the integer column [Bill To Customer Key] compresses quite well as it contains only three distinct values in its dictionary. Sixty-eight bytes for a dictionary is tiny and will provide immense savings for a column that was previously 4 bytes per row across 25 million rows. Each value is reduced from 4 bytes to 2 bits, allowing it to be compressed at a ratio of 16:1!

It is important to note that a dictionary is limited in size to 16 megabytes. Since a segment may only have one global and one local dictionary assigned to it, if a dictionary reaches the 16MB threshold, then a rowgroup will be split into smaller rowgroups to allow for each to possess dictionaries that are below this limit. If inserting new rows into a table would cause any of its dictionaries to exceed the dictionary size cap, then a new rowgroup will be created to accommodate new rows. There are two key takeaways from this information:

1.
Very long columns may force dictionary splits and result in undersized rowgroups. This will reduce compression efficiency and increase the number of dictionaries required to service the table.
2.
Columns with very high cardinality may also approach this limit and result in dictionary splits that lead to the creation of undersized rowgroups.

Numeric columns with high cardinality will typically use value compression to avoid this problem, whereas string columns that are long and/or have a high cardinality may not be able to avoid this issue. Sixteen megabytes may sound small when compared to massive OLAP tables, but dictionaries themselves are compressed and typical analytic data will not approach the dictionary size limit. Testing should always be used to confirm either of these scenarios, rather than assuming they will or will not happen.

Should String Data Be Normalized?

Before advancing to a discussion of the remainder of the compression process, a question that arises from the efficiency of dictionary compression is whether or not normalization of dimensions is necessary or useful for columnstore compressed data that uses dictionary compression.

In the world of transactional data, dimensions are often normalized. This saves space and memory and allows for relational integrity, uniqueness, and other constraints to be implemented at runtime. In analytic data, though, data validation is often implemented via validation, rather than the use of triggers, keys, or check constraints. As a result, normalization was often used more to save resources or provide convenient lookup tables for use by visualization or analytics applications.

This question can be tested by creating a new version of Fact.Sale_CCI where the Description column is normalized into a lookup table and referenced by a SMALLINT key. Listing 5-3 provides definitions for this new table and its corresponding lookup table.

CREATE TABLE Dimension.Sale_Description

( Description_Key SMALLINT NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED, [Description] NVARCHAR(100) NOT NULL);

CREATE TABLE Fact.Sale_CCI_Normalized

( [Sale Key] [bigint] NOT NULL,

[City Key] [int] NOT NULL,

[Customer Key] [int] NOT NULL,

[Bill To Customer Key] [int] NOT NULL,

[Stock Item Key] [int] NOT NULL,

[Invoice Date Key] [date] NOT NULL,

[Delivery Date Key] [date] NULL,

[Salesperson Key] [int] NOT NULL,

[WWI Invoice ID] [int] NOT NULL,

[Description Key] SMALLINT NOT NULL,

[Package] [nvarchar](50) NOT NULL,

[Quantity] [int] NOT NULL,

[Unit Price] [decimal](18, 2) NOT NULL,

[Tax Rate] [decimal](18, 3) NOT NULL,

[Total Excluding Tax] [decimal](18, 2) NOT NULL,

[Tax Amount] [decimal](18, 2) NOT NULL,

[Profit] [decimal](18, 2) NOT NULL,

[Total Including Tax] [decimal](18, 2) NOT NULL,

[Total Dry Items] [int] NOT NULL,

[Total Chiller Items] [int] NOT NULL,

[Lineage Key] [int] NOT NULL);

Listing 5-3

Table Structure for Fact.Sale_CCI with a Normalized Description

When populated with data, the Dimension.Sale_Description table is loaded first with any new distinct Description values, and then Fact.Sale_CCI_Normalized is loaded with data that joins the dimension table to retrieve values for Description_Key. Figure 5-5 shows the side-by-side comparison of the storage of these tables.

Figure 5-5
Storage comparison between normalized and denormalized columns

The table on the left is the original table, whereas the table on the right has the Description column normalized into a SMALLINT. In this specific example, normalization allowed for a significant space savings, about 15:1 compression savings from the original table!

Generally speaking, normalization will help most when the number of distinct values is low and the length of the columns is high, as dictionary encoding for a SMALLINT column will produce superior compression than dictionary encoding for a large VARCHAR column.

The downside to normalization is that data load processes will take longer. Joining additional dimension columns at runtime to load data requires time and resources. The decision as to whether to normalize or not should be carefully considered and testing conducted to determine if sufficient gains are realized to justify the change.

Ultimately, business logic and organizational needs will drive this decision at a high level, whereas technical considerations will allow for further tuning as needed. There is no right or wrong answer to this question; therefore, consider either approach valid until hands-on testing proves otherwise.

Row Order (Vertipaq) Optimization

Because a columnstore index is organized by column, rather than by row, it is possible for SQL Server to rearrange the order of rows within each rowgroup at the time each segment is compressed. This optimization is used whenever possible and attempts to arrange rows such that like values are adjacent to each other.

Columnstore compression is highly sensitive to repeated sequences of values; therefore, this optimization can provide exceptional compression and performance improvements. Vertipaq optimization is the algorithm that seeks to rearrange rows in the order that results in the most effective compression of the underlying data. This algorithm is used in other Microsoft storage technologies, such as SQL Server Analysis Services, PowerPivot, and PowerBI.

Vertipaq optimization is an expensive process, and the more columns a columnstore index has, the longer it takes to find the optimal row order. Therefore, SQL Server will limit the search time to ensure that building the index doesn’t take an excessive amount of time. As a result, Vertipaq optimization will be somewhat less effective on wider tables. This fact should not affect database architecture decisions, but can be a useful troubleshooting step if a columnstore index on a wide table begins to experience suboptimal compression as more columns continue to be added.

A critically important key to this optimization is that some architectural decisions can prevent it from being used. There are two scenarios in which Vertipaq optimization will not be used:

1.
When the tuple mover merges rows from the delta store into a clustered columnstore index that has at least one nonclustered rowstore index
2.
For columnstore indexes that are built on memory-optimized tables

This does not mean that nonclustered rowstore indexes should never be used on clustered columnstore indexes. Nor should memory-optimized columnstore indexes be avoided. Instead, the fact that Vertipaq compression will not operate on these structures should be an additional input into architectural decision-making process. If a nonclustered rowstore index is required to service a specific set of critical queries, then it should be used. If its use is optional, then consider the negative impact it may have on compression.

While the details of how the algorithm works for a given columnstore index are not exposed through any convenient views, it is possible to check whether or not a rowgroup has Vertipaq optimization with the query in Listing 5-4.

SELECT

objects.name,

partitions.partition_number,

dm_db_column_store_row_group_physical_stats.row_group_id,

dm_db_column_store_row_group_physical_stats.has_vertipaq_optimization

FROM sys.dm_db_column_store_row_group_physical_stats

INNER JOIN sys.objects

ON objects.object_id = dm_db_column_store_row_group_physical_stats.object_id

INNER JOIN sys.partitions

ON partitions.object_id = objects.object_id

AND partitions.partition_number = dm_db_column_store_row_group_physical_stats.partition_number

WHERE objects.name = 'Sale_CCI'

ORDER BY dm_db_column_store_row_group_physical_stats.row_group_id;

Listing 5-4

Query That Returns Whether or Not Vertipaq Optimization Is Used

The column has_vertipaq_optimization in sys.dm_db_column_store_row_group_physical_stats indicates if that rowgroup was compressed using the Vertipaq optimization or not. In general, expect to see this column return a value of 1, unless the columnstore index is either on a memory-optimized table or is used in conjunction with nonclustered rowstore indexes. If has_vertipaq_optimization is unexpectedly zero, further research is warranted to understand why that is the case and determine what changes were made that prevented this optimization from being used. Figure 5-6 shows a sample of the results of the previous query.

Figure 5-6
Results of query to validate if Vertipaq optimization was used

In this scenario, Vertipaq optimization is in use for all rowgroups in this columnstore index.

Other Compression Algorithms

After segments have been encoded and rowgroups have been reordered, the final step in columnstore compression is to apply any remaining compression algorithms to the resulting data. At this point, it is possible that no further compression gains are possible for a given segment. This will be most often true with values that do not repeat. The details of how columnstore indexes compress data are provided here to assist in more advanced optimization of compression and performance, when possible.

One form of compression used by columnstore indexes is run-length encoding , which seeks to group together like values and a count, rather than listing repetitive values in a segment. Vertipaq optimization is a prerequisite to this step, and without it, run-length encoding will be significantly less effective.

Consider the example of foods provided earlier in Figure 5-2. The encoded data has five different values that have been dictionary encoded to 3-bit numeric values. Vertipaq optimization would take the resulting data set and order it, as shown in Figure 5-7.

Figure 5-7
Vertipaq optimization of a set of encoded values

Note that the encoded values are now reordered so that each value is grouped next to the same value. Run-length encoding groups together each like value logically to reduce the overall amount of storage required for the segment. Logically, the resulting structure can be illustrated as shown in Figure 5-8.

Figure 5-8
Run-length encoding applied to a numeric data set

Each value has been provided with a count , shown in parenthesis after it. There is a single instance of the value 0, followed by two instances of 1, three instances of 2, three instances of 3, and one instance of 4. This algorithm thrives on data that is repetitive and works exceptionally well in conjunction with dictionary encoding and Vertipaq compression.

Bit array compression is used when a segment contains a small number of distinct values and the data cannot benefit significantly from Vertipaq compression and run-length encoding. A bit array is an array in which each distinct value is assigned a column and each row is mapped to that column using bits. Figure 5-9 illustrates how this compression would look when used on an unencoded and unsorted data set.

Figure 5-9
Bit array compression applied to an unencoded data set

While this may seem like a complicated transformation, the resulting ones and zeros are quite compact and provide a structure that can be quickly decompressed.

There are a handful of other more complex compression algorithms that may be used to progressively shrink the size of each segment of data. The algorithms used can vary by segment and may be used often, sometimes, or not at all. When this process is complete, the final level of compression is applied by SQL Server, which utilized its xVelocity compression algorithm.

Columnstore segments are stored as Large Objects (LOB) in SQL Server. The details of xVelocity compression are not public, and therefore we cannot delve further into how they work. While the transformations used to convert the structures discussed thus far in this chapter into their final form are not fully known, we can infer their effectiveness by turning on STATISTICS IO and viewing the reads against a columnstore index when a query is issued. Consider the simple analytic query in Listing 5-5.

SELECT

SUM(Quantity)

FROM fact.Sale_CCI

WHERE [Invoice Date Key] >= '1/1/2016'

AND [Invoice Date Key] < '2/1/2016'

Listing 5-5

Simple Analytic Query to Demonstrate Columnstore IO

The result of this query is shown in Figure 5-10.

Clicking the Messages tab as shown in the results provides additional information on the reads and writes performed by the query, as seen in Figure 5-11.

Figure 5-11
Statistics IO output for a sample query

There is quite a bit of information returned, but for the moment, the LOB logical reads circled in red provide an indication of the reads required to satisfy this query. Columnstore indexes will not report reads as logical reads , but instead as LOB logical reads, which indicates that their segments are stored as Large Objects and not as traditional SQL Server data structures. If this number of reads seems high, that is because it is! Later chapters in this book will provide additional optimizations that will reduce storage and memory footprint, as well as greatly improve query performance.

Columnstore Archive Compression

SQL Server provides an additional level of compression that can be used to further shrink the footprint of columnstore index data. While archive compression can provide a higher compression ratio than standard columnstore compression, it costs more computing resources to read and write the data. As a result, this option should be reserved for data that is not frequently used, such as

Cold or warm storage
Data that is rarely accessed
Data used for applications where performance is not critical

Columnstore archive compression uses the Microsoft Xpress compression algorithm. While this algorithm is complex, it is publicly documented and is not a Microsoft trade secret. This algorithm relies on compressing data with repeated byte sequences and combines the LZ77 and Huffman compression algorithms to further shrink the size of columnstore data. The details of these compression algorithms are too extensive for the pages of this book, but are publicly available to anyone interested.

The type of compression used in a columnstore index may be specified on a partition-by-partition basis, allowing older/less used data to effectively use archive compression, whereas newer data can use standard columnstore compression. Chapter 11 delves into how partitioning can be combined with columnstore indexes to improve data storage, maintenance, and performance.

To demonstrate the effectiveness of columnstore archive compression, a new version of the Sale_CCI table will be created, as seen in Listing 5-6.

CREATE TABLE Fact.Sale_CCI_Archive

( [Sale Key] [bigint] NOT NULL,

[City Key] [int] NOT NULL,

[Customer Key] [int] NOT NULL,

[Bill To Customer Key] [int] NOT NULL,

[Stock Item Key] [int] NOT NULL,

[Invoice Date Key] [date] NOT NULL,

[Delivery Date Key] [date] NULL,

[Salesperson Key] [int] NOT NULL,

[WWI Invoice ID] [int] NOT NULL,

[Description] NVARCHAR(100) NOT NULL,

[Package] [nvarchar](50) NOT NULL,

[Quantity] [int] NOT NULL,

[Unit Price] [decimal](18, 2) NOT NULL,

[Tax Rate] [decimal](18, 3) NOT NULL,

[Total Excluding Tax] [decimal](18, 2) NOT NULL,

[Tax Amount] [decimal](18, 2) NOT NULL,

[Profit] [decimal](18, 2) NOT NULL,

[Total Including Tax] [decimal](18, 2) NOT NULL,

[Total Dry Items] [int] NOT NULL,

[Total Chiller Items] [int] NOT NULL,

[Lineage Key] [int] NOT NULL);

CREATE CLUSTERED COLUMNSTORE INDEX CCI_Sale_CCI_Archive ON fact.Sale_CCI_Archive WITH (DATA_COMPRESSION=COLUMNSTORE_ARCHIVE);

Listing 5-6

Table Created with Columnstore Archive Compression

Once populated with data, this table can be compared to the previous version that used standard columnstore compression, as seen in Figure 5-12.

Figure 5-12
Comparison of columnstore and columnstore archive compression

Columnstore archive compression provided an immense space savings when compared to standard columnstore compression. This may seem absurdly impressive, but some of this savings can be achieved by optimizing data order, which will be discussed in detail in Chapter 10.

In practice with typical production data loads, archive compression will deliver an additional 15–30% savings over standard columnstore compression. For a large analytic table, this can equate to a significant reduction in its data footprint! In general, only consider archive compression for data that will not be written to or read often, as the cost to compress and decompress it is nontrivial. For data that is rarely accessed, it is an easy way to save on both storage and memory resources.

The Compression Life Cycle

A key to all compression within SQL Server is that compressed pages remain compressed until needed by a query. Therefore, pages are compressed on storage and remain compressed there for the lifetime of an index. When pages are read into memory, they remain compressed. Therefore, any storage space saved by the use of compression will equate into memory savings when the data is needed.

This fact is critical to the efficiency of columnstore indexes. Compressed segments remain compressed on disk as well as in memory and are not decompressed until a query requires their data. Any optimizations that further decrease the size of compressed segments will translate directly into memory savings. The remainder of this book will focus on optimizations, best practices, and features that allow columnstore indexes to be used as efficiently as possible, thus making the most of their compression and ensuring they can scale for any data, regardless of how big it gets.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5. Columnstore Compression

Create new playlist

Sign In

Sign Up

5. Columnstore Compression

Basics of Columnstore Compression

Columnstore Compression Algorithms

Value Encoding

Dictionary Encoding

Should String Data Be Normalized?

Row Order (Vertipaq) Optimization

Other Compression Algorithms

Columnstore Archive Compression

The Compression Life Cycle

Table of Contents for
5. Columnstore Compression