Columnar storage and compression

Storing the data column by column instead of row by row gives you the opportunity to store each column in a sorted way. Imagine that you have every column totally sorted. Then for every equijoin, the merge join algorithm could be used, which is, as you already know, a very efficient algorithm. In addition, with sorted data, you get one more type of compression for free—run-length encoding (RLE) compression.

The following figure explains the idea graphically:

Sorted columnar storage and RLE

An RDBMS in the first step reorders every single column. Then RLE compression is implemented. For example, if you look at the Color column, you don't need to repeat the value Red three times; you can store it only once and store either the frequency of the value or the index in the form from position to position, as shown in the figure.

Note that SQL Server does not implement total sorting. Total sorting of every single column would simply be too expensive. Creating such storage would take too much time and too many resources. SQL Server uses its own patented row-rearranging algorithm, which rearranges the rows in the most optimal way to order all columns as best as possible, with a single pass through the data. This means that SQL Server does not totally sort any column; however, all columns are at least partially sorted. Therefore, SQL Server does not target the merge join algorithm; the hash join algorithm is preferred. Partial sorts still optimizes the use of bitmap filters because fewer false positives are returned from a bitmap filter when compared to randomly organized values. RLE compression can still reduce the size of the data substantially.

Table of Contents for Columnar storage and compression

Create new playlist

Sign In

Sign Up

Table of Contents for
Columnar storage and compression