© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
E. PollackAnalytics Optimization with Columnstore Indexes in Microsoft SQL Serverhttps://doi.org/10.1007/978-1-4842-8048-5_3

3. What Are Columnstore Indexes?

Edward Pollack1  
(1)
Albany, NY, USA
 

Thus far, a compelling case has been presented to separate OLTP and OLAP workloads into unique data structures and optimize each for their respective use cases. Columnstore indexes are a SQL Server feature that provides native support for large analytic data. This chapter will dive into what they are and why they are an effective solution to analytic data challenges.

The Limits of Transactional Data Storage

Tables that contain millions or billions of rows of analytic data do not scale well when implemented as classic transactional tables. To fully understand why this is the case, it is helpful to compare how data is stored in both transactional and analytic tables. The following example table contains five columns. Figure 3-1 shows how data is stored in an OLTP (rowstore) version of this table.
Figure 3-1

Illustration of data storage in a clustered rowstore index

In a clustered rowstore index, each row is stored on pages sequentially, one row after another in the order prescribed by the clustered index. If many columns from a single row or small subset of rows are required by a query, then this is an exceptionally efficient storage structure. For example, retrieving all columns from row 1 through row -5 would require low effort as the data is contiguous and ordered. Figure 3-2 highlights the data needed to satisfy this query.
Figure 3-2

How an OLTP query retrieving five rows would read a sample table

A table with sequential columns models how typical transactional queries operate for a query such as this:
SELECT
       OrderID, --  An identity/primary key column
       CustomerID,
       SalespersonPersonID,
       ContactPersonID,
       OrderDate
FROM Sales.Orders
WHERE OrderId = 289;

Transactional queries write single rows or small ranges of rows that often correlate to a lookup on a single index. Here, a numeric identity is used to filter out five rows of interest from a larger table.

The basic unit of SQL Server storage is the page. A page is comprised of 8 kilobytes of data. Pages contain all data and index storage for a given table. If a single row is required for a query, then the entire page it is stored on will be read into memory, even if the remaining data on the page is not needed. Therefore, transactional queries rely on clustered or nonclustered indexes to ensure that index seeks can return data in an ordered fashion, such as the example in Figure 3-2. In this scenario, reading those five rows will not require much more effort, even if the table grows to be significantly larger.

Analytic queries, though, are quite different. They often aggregate a select few columns, but do so across many rows. Consider a typical OLAP query against the example table where a single column is aggregated across a large portion of the table that happens to include all of the rows shown here. Figure 3-3 shows how this would look.
Figure 3-3

An analytic query aggregating a single column from a sample table

The physical layout of rows shows that a query requiring only column 2 needs to also read the adjacent columns, even if not needed. The following is an example of a query that accesses a large amount of data in a table, but only aggregates a single column.
SELECT
       SUM(Quantity) AS Total_Quantity
FROM Sales.OrderLines
WHERE OrderID >= 1
AND OrderID < 10000;

Even though only a single column was summed, any page that contains values for that column will need to be read into memory. Even a covering nonclustered index would still necessitate reading data for each row, though the number of pages read can be reduced in that fashion.

Now consider a larger table with 50 million rows that is stored on 250,000 pages using a clustered rowstore index. The transactional query demonstrated in Figure 3-2 would still only read 5 rows that are stored consecutively and could therefore ignore most of the remaining 249,995 rows and the pages they are stored on. Additional rows would be read that happened to be stored on the same page(s) as those five rows, but the added burden of that data is measured in kilobytes and is comparatively insignificant.

The analytic query presented in Figure 3-3 aggregates only a single column, but does so across many rows. If the requested order data spans one-fourth of the table, then processing this query in the transactional table would force SQL Server to read about one-fourth of the pages in the table, since the values for column 2 are dispersed within each row throughout the table. Regardless of the filter used that reduces the row count needed for the query, every page in the range specified by the filter would need to be read.

As an analytic table continues to grow and spans billions of rows and/or terabytes of storage (or more), the ability to read large swaths of its data becomes too slow and resource intensive to be realistic. A better solution is needed that allows analytic queries to read column data without the rest of the underlying columns being brought along for the ride.

Introducing Columnstore Indexes

The technology needed to solve this problem has existed in some form for a long time. Applications such as SQL Server Analysis Services and PowerPivot have used columnar data storage formats for years, but their implementation was hidden from view of the user.

Columnstore indexes allow for data to be stored in tables in an optimal analytic format that provides many valuable benefits, such as
  • Ability to scale to any table size

  • Increase query speeds by 10–100 times

  • Exceptional compression ratios, saving storage and memory

  • Natively supported by SQL Server

  • Can take advantage of bulk loading for fast write speeds

While this may sound like a sales pitch, there are no exaggerations here. Comparing a large analytic data set stored in a classic rowstore clustered index vs. a columnstore indexed table yields vast differences in storage and performance that will be demonstrated and quantified throughout the remainder of this book.

Note that all demonstrations of columnstore indexes in this book are tested on SQL Server 2019. Those running an earlier version should test thoroughly before implementing any suggestions in this book as the features available may be different.

Figure 3-4 illustrates how data stored in a columnstore index is stored using the same table from Figure 3-1.
Figure 3-4

Illustration of data storage in a clustered columnstore index

Note the critical difference: Data is ordered by column rather than by row. Each value for column 1 is stored together in a single structure, whereas each other column is stored in its own sets of pages. In the rowstore table shown in Figure 3-3, a single column was aggregated, but every page had to be read in order to return data for that one column.

Figure 3-5 shows the same query against an analytic columnstore indexed table.
Figure 3-5

Aggregating a single column against a columnstore index

Because data is grouped physically by column, it is no longer necessary to read the entire table to return a single aggregated column. The other four columns can be ignored. This greatly reduces reads against storage systems and reduces memory usage as less pages need to be read into memory. For each new row that is inserted into this table, only a single additional value for the second column needs to be read.

A significant benefit of this data structure is that compression becomes more effective. A set of values for a single column will tend to have more repeated values that will more easily compress, whereas values from different columns are less likely to overlap and compress well. Better compression means more data can fit on pages, saving additional storage and memory.

An important clarification on columnstore indexes is that they are not “just another index.” They are not comparable to nonclustered rowstore indexes, XML indexes, spatial indexes, memory-optimized indexes, or other types of indexing included in SQL Server. They comprise a unique architecture that provides benefits well beyond what a typical index can deliver.

Note that throughout this book, when not otherwise specified, all demonstrations and discussion will reference clustered columnstore indexes. Chapter 11 will delve into non-clustered columnstore indexes in more detail.

Benefits of Columnstore Indexes in SQL Server

There are many reasons why columnstore indexes can be an ideal solution for storing large analytic data. They cross a variety of areas from cost to convenience to speed and illustrate how a seemingly simple feature can provide exceptional value with a low time and resource cost to implement. This is not a complete list of the benefits of columnstore indexes, but highlights the keys to performance and efficiency that make them attractive for use with analytic data. The remainder of this book will dive into greater detail about the benefits and optimal use cases for columnstore indexes.

Native Analytic Data in SQL Server

One of the greatest benefits of columnstore indexes is that they store analytic data directly in SQL Server without any added licensing or configuration changes. Similarly, no hardware or software changes are required to begin testing or implementing columnstore indexes.

A rowstore table containing analytic data could be converted to use a columnstore index, with the entire process taking place on the same SQL Server database instance or even as an in-place index swap. This can allow for easy testing, validation, and implementation, when ready.

Because columnstore indexes are a feature that is core to analytic data storage in SQL Server, it is updated with each new version. Each update provides new functionality and ways in which reading or writing columnstore indexes can become faster and more efficient. The following document from Microsoft outlines in detail the columnstore index features available in each version of SQL Server since their inception:

https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-what-s-new

The list is quite extensive and shows how columnstore indexes have evolved from an inflexible read-only structure into one rich in features and optimizations.

Storing data natively in SQL Server means there is no need for third-party products, no costly migrations, and no need to configure new hardware and software. The time required to implement them is less guided by technology and more by typical development and quality assurance needs. When building a project plan for the storage and maintenance of analytic data, factoring in these resource costs can help in making an accurate, fact-based decision. The following is a summary of these considerations:
  • Need to purchase/license new hardware to support analytic data

  • Cost of software licenses for analytics software and any supporting software

  • Cost of computing resources, whether on premises or in the cloud

  • Time required to architect, develop test, and implement analytic data solution

  • Resources needed to educate people on how to use the new solution

Quantifying each of these factors can assist in comparing and contrasting different solutions and will generally provide favorable results for an organization that already uses SQL Server for its transactional data storage.

Scalability

Given the rapid rate in which analytic data can grow, any data structures used to store it must be capable of efficiently servicing OLAP workloads, even when data depth or width increases unexpectedly fast over time.

The ability for an analytic data storage solution to scale has many important benefits, including
  • Ensure high performance, even during periods of rapid data growth

  • Avoid costly migrations to new technologies

  • Avoid disruptions to reporting/analytic services

  • Reduce maintenance and downtime

  • Reduce the need for hardware upgrades

For an OLAP solution to be effective, it needs to be fast and efficient with one thousand rows, one million rows, one billion rows, or more. If a system is destined to become inefficient when it gets large, then it is also destined to fail.

The growth of analytic data is a metric worthy of measuring and revisiting periodically to fully understand its long-term resource requirements. OLAP data rarely gets smaller, and its growth rate rarely decreases. This growth is ultimately tied to two measurements for any given organization:
  • Growth

  • Complexity

As an organization grows and serves more customers, those customers will generate more data. Similarly, organizational growth almost always leads to both technical and nontechnical processes becoming more complex. This complexity may manifest itself in increased software features, more processes that require tracking, or requests for more types of data to maintain and crunch over time.

The rate of growth of data can be summarized as follows:
  • 1.00 * N * G * C

In this representation, the letters denote the following:
  • N = Natural Growth: This is the growth of data that occurs currently. If nothing else changes and no external influences adjust the data growth rate, then this would be the only factor needed to predict future data growth. This can be measured in rows per unit time and also as an increase in physical data size as average bytes per row times rows per unit time.

  • G = Data Growth: Natural growth provides a baseline of how much data will be added to a table or database over time. This growth is rarely linear, though. New customers, increased sample frequency, and other factors will account for additional growth above and beyond the current baseline. This can also be measured in rows per unit time or as physical data usage per unit time.

  • C = Increased Complexity: This accounts for more columns being added to tables as well as new tables being created and populated. This also increases the amount of data that needs to flow (via more entities) from transactional systems into the target analytic system. This is challenging to measure, but can be estimated as a measure of new metrics over time. That is, on average how many new metrics per unit time will be added to an analytic data source. This can be approximated in storage units as soon as the data types for new metrics are known.

Each of these factors is defined here as linear in nature, but may not be linear in actuality. Therefore, they should be revisited regularly to ensure that unexpected changes in growth are accounted for when predicting future data size. Linear approximations per unit time can be used to approximate nonlinear growth so long as those approximations remain updated with current and future trends.

To provide a sample of the preceding formula, consider an analytic table that contains 100,000,000 rows, currently grows by 250,000 rows per day, and is expected to see its growth accelerate by an additional 25% per year, but not see any new dimensions added in the foreseeable future. The estimated row count in 1 year would be given by
  • 1.00 * N * G * C * Row Count = 1.00 * 1.9125* 1.25 * 1.00 * 100,000,000 = 239,062,500 rows for the upcoming year.

  • 1.00 * N * G * C * Row Count = 1.00 * 1.9125* 1.25 * 1.00 * 239,062,500 = 571,508,789 rows for the 2nd year.

  • 1.00 * N * G * C * Row Count = 1.00 * 1.9125* 1.25 * 1.00 * 571,508,789 = 1,366,263,198 rows for the 3rd year.

Note how quickly the added acceleration grew this data – from 100 million rows to over 1.3 billion rows in 3 years! 1.9125 was calculated as the annual growth factor by multiplying 250,000 rows per day by 365 days per year.

To summarize, any technology that manages analytic data needs to be capable of efficient data access given that typical data growth can cause data sizes to balloon far faster than conventional wisdom might suggest. The architecture of columnstore indexes will be discussed in detail in Chapter 4 and will lay the foundation for explaining why they can scale so effectively for rapidly growing data.

Exceptional Compression

Any analytic data store needs to take full advantage of compression to make its data as compact as possible. Columnstore indexes use multiple compression algorithms to achieve impressively high compression ratios. This can result in data that is 10–100 times smaller than it would be in an uncompressed table.

Compressing data by column rather than row is inherently more efficient for a number of reasons:
  • A single column contains values of the same data type, improving the chances of values being repeated.

  • Dimensions often have many repeated values and will compress exceptionally well.

  • Data is typically created sequentially, ordered by time. Rows that are in close proximity to other rows from similar time periods will tend to contain more similar data and compress better than with data from much earlier or later in time.

Chapter 5 dives into more detail on columnstore index compression and demonstrates how effective these compression algorithms are and why they are crucial to optimal analytic query performance.

Faster Analytic Reads

First and foremost, columnstore indexes are built to provide highly performant analytic query speeds. Whether accessing data from structured stored procedures and code or from ad hoc analytics and visualization, the compact and segmented columnstore index structure allows for the requested data to be returned quickly and efficiently. This is true even when large row counts are scanned.

Faster Data Loads

When data is written millions of rows at a time, processes need to exist that allow for that data to be written as quickly as possible. A fully logged process optimized for transactional data will not perform adequately for data of this size.

Columnstore indexes can make full use of the SQL Server bulk load API and can achieve write speeds far faster than other processes available in SQL Server, and in many other products. Data loads can execute faster, improving OLAP availability and allowing new data to appear more quickly. In addition, the impact of data loads on system resources such as the transaction log and on backups is minimized, ensuring that large volumes of data can be written in short spans of time without the pitfalls of doing so on a fully logged OLTP table.

Chapter 8 will fully explore how data is written to columnstore indexes, including performance measurements, resource consumption details, and best practices for loading data as efficiently as possible.

Analytic data requires a versatile, scalable, and performant solution. Columnstore indexes provide an ideal data structure to create and analyze analytic data, and that can manage data growth over time with ease.

The remainder of this book will discuss columnstore indexes in exhaustive detail, providing architectural details, demonstrations, best practices, and tools that can improve their use.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.33.87