Columnar storage creation process

SQL Server starts creating the columnar storage by first splitting the data into rowgroups. The maximum number of rows per rowgroup is 1,048,576. The idea here is that the time-consuming row rearranging is done on smaller datasets, just like how the hash join algorithm splits the data into buckets and then performs smaller joins on portions of the data. SQL Server performs row rearranging in each of the groups separately. Then SQL Server encodes and compresses each column. Each column's data in each rowgroup is called a segment. SQL Server stores segments in blobs in database files. Therefore, SQL Server leverages the existing storage for columnar storage. The following figure shows the process:

How SQL Server creates columnar storage

SQL Server implements different compressing algorithms:

SQL Server does bit-packing. Bit-packing is similar to row compression, just pushed one level further. Instead of storing the minimal number of bytes, SQL Server stores the minimal number of bits that can represent the value. For example, with row compression, you would get one byte instead of four bytes for value 5, if this value is an integer. With bit-packing, SQL Server would store this value using three bits only (101).
Then SQL Server encodes the values to integers with value encoding and dictionary encoding. Value encoding is similar to prefix encoding in page compression, and dictionary encoding is the same. Therefore, this part of compression uses the ideas of page compression. However, dictionaries for columnar storage are much more efficient because they are built on more values than dictionaries in page compression. With page compression, you get a separate dictionary for each 8 KB page. With columnar storage, you get one dictionary per rowgroup plus one global dictionary over all rowgroups.
Because of the partial ordering, the run-length encoding algorithm is also used.
Finally, SQL Server can also use the LZ77 algorithm to compress columnar data.

All of these compression algorithms except the LZ77 one are implemented automatically when you create a columnstore index. This is called COLUMNSTORE compression. You must turn LZ77 compression on manually to get so-called the COLUMNSTORE_ARCHIVE compression.

With all of these compression algorithms implemented, you can count on at least 10 times more compression compared to the original, non-compressed row storage. In reality, you can get much better compression levels, especially when you also implement archive compression with the LZ77 algorithm.

However, compression is not the only advantage of large scans. Because each column is stored separately, SQL Server can retrieve only the columns a query is referring to. This is like having a covering nonclustered index. Each segment also has additional metadata associated with it. This metadata includes the minimal and the maximal value in the segment. SQL Server query optimizer can use this metadata for early segment elimination, just as SQL Server can do early partition elimination if a table is partitioned. Finally, you can combine partitioning with columnstore indexes to maintain even very large tables.

Table of Contents for Columnar storage creation process

Create new playlist

Sign In

Sign Up

Table of Contents for
Columnar storage creation process