Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

P. A. CarterPro SQL Server 2022 Administrationhttps://doi.org/10.1007/978-1-4842-8864-1_8

8. Indexes and Statistics

Peter A. Carter¹

(1)

SOUTHAMPTON, UK

Recent versions of SQL Server support many different types of index that are used to enhance query performance. These include traditional clustered and nonclustered indexes, which are built on B-tree (balanced-tree) structures and enhance read performance on disk-based tables. There are also indexes that support complex data types, such as XML, JSON, and geospatial data types. These advanced data type indexes are beyond the scope of this book, but a full discussion can be found in the Apress title SQL Server Advanced Data Types, which can be found at www.apress.com/gp/book/9781484239001. DBAs can also create Columnstore indexes to support data warehouse–style queries, where analysis is performed on very large tables. SQL Server also supports in-memory indexes, which enhance the performance of tables that are stored using In-Memory OLTP. This chapter discusses many of the available index types inside the Database Engine.

SQL Server maintains statistics on index and table columns to enhance query performance by improving cardinality estimates. This allows the Query Optimizer to create an efficient query plan. This chapter also discusses how to use and maintain statistics.

Clustered Indexes

A B-tree is a data structure you can use to organize key values so a user can search for the data they are looking for much more quickly than if they had to read the entire table. It is a tree-based structure where each node is allowed more than two child nodes. The tree is balanced, meaning there will always be the same number of steps to retrieve any single row of data.

A clustered index is a B-tree structure that causes the data pages of a table to be logically stored in the order of the clustered index key. The clustered index key can be a single column or a set of columns that enforce uniqueness of each row in the table. This key is often the table’s primary key, and although this is the most typical usage, in some circumstances, you will want to use a different column. This is discussed in more detail later in this chapter.

Tables Without a Clustered Index

When a table exists without a clustered index, it is known as a heap. A heap consists of an IAM (index allocation map) page(s) and a series of data pages that are not linked together or stored in order. The only way SQL Server can determine the pages of the table is by reading the IAM page(s). When a table is stored as a heap, without an index, then every time the table is accessed, SQL Server must read every single page in the table, even if you only want to return one row. The diagram in Figure 8-1 illustrates how a heap is structured.

When data is stored on a heap, SQL Server needs to maintain a unique identifier for each row. It does this by creating a RID (row identifier). A RID has a format of FileID: Page ID: Slot Number, which is a physical location. Even if a table has nonclustered indexes, it is still stored as a heap, unless there is a clustered index. When nonclustered indexes are created on a heap, the RID is used as a pointer so that nonclustered indexes can link back to the correct row in the base table.

Tables with a Clustered Index

When you create a clustered index on a table, a B-tree structure is created. This B-tree is based on the values of the clustered key, and if the clustered index is not unique, it also includes a uniquifier. A uniquifier is a value used to identify rows if their key values are the same. This allows SQL Server to perform more efficient search operations by creating a tiered set of pointers to the data, as illustrated in Figure 8-2. The page at the top level of this hierarchy is called the root node. The bottom level of the structure is called the leaf level, and with a clustered index, the leaf level consists of the actual data pages of the table. B-tree structures can have one or more intermediate levels, depending on the size of the table.

Figure 8-2 shows that although the leaf level is the data itself, the levels above contain pointers to the pages below them in the tree. This allows SQL Server to perform a seek operation, which is a very efficient method of returning a small number of rows. It works by navigating its way down the B-tree, using the pointers, to find the row(s) it requires. In this figure, we can see that if required, SQL Server can still scan all pages of the table in order to retrieve the required rows—this is known as a clustered index scan. Alternatively, SQL Server may decide to combine these two methods to perform a range scan. Here, SQL Server seeks the first value of the required range and then scans the leaf level until it encounters the first value that is not required. SQL Server can do this, because the table is ordered by the index key, which means that it can guarantee that no other matching values appear later in the table.

Clustering the Primary Key

The primary key of a table is often the natural choice for the clustered index, because many OLTP applications access 99% of data through the primary key. In fact, by default, unless you specify otherwise, or unless a clustered index already exists on the table, creating a primary key automatically generates a clustered index on that key. There are circumstances when the primary key is not the correct choice for the clustered index. An example of this that I have witnessed is a third-party application that requires the primary key of the table to be a GUID.

Creating a clustered index on a GUID introduces two major problems if the clustered index is to be built on the primary key. The first is size. A GUID is 16 bytes long. When a table has nonclustered indexes, the clustered index key is stored in every nonclustered index. For unique nonclustered indexes, it is stored for every row at the leaf level, and for nonunique nonclustered indexes, it is also stored at every row in the root and intermediate levels of the index. When you multiple 16 bytes by millions of rows, this drastically increases the size of the indexes, making them less efficient.

The second issue is that when a GUID is generated, it is a random value. Because the data in your table is stored in the order of the clustered index key for good performance, you need the values of this key to be generated in sequential order. Generating random values for your clustered index key results in the index becoming more and more fragmented every time you insert a new row. Fragmentation is discussed later in this chapter.

There is a workaround for the second issue, however. SQL Server has a function called NEWSEQUENTIALID(). This function always generates a GUID value that is higher than previous values generated on the server. Therefore, if you use this function in the default constraint of your primary key, you can enforce sequential inserts.

Caution

After the server has been restarted, NEWSEQUENTIALID() can start with a lower value. This may lead to fragmentation.

If the primary key must be a GUID or another wide column, such as a Social Security Number, or if it must be a set of columns that form a natural key, such as Customer ID, Order Date, and Product ID, then it is highly recommended that you create an additional column in your table. You can make this column an INT or BIGINT, depending on the number of rows you expect the table to have, and you can use either the IDENTITY property or a SEQUENCE in order to create a narrow, sequential key for your clustered index.

Tip

Remember a narrow clustered key is important because it will be included in all other indexes on the table.

Administering Clustered Indexes

You can create a clustered index by using the CREATE CLUSTERED INDEX statement, as shown in Listing 8-1. Other methods you can use to create a clustered index are using the ALTER TABLE statement with a PRIMARY KEY clause and using the INDEX clause in the CREATE TABLE statement, as long as you are using SQL Server 2014 or higher. This script creates a database called Chapter8 and then a table called CIDemo. Finally, it creates a clustered index on the ID column of this table.

Note

Remember to change the file locations to match your own configuration.

--Create Chapter8 Database

CREATE DATABASE Chapter8

ON PRIMARY

( NAME = N'Chapter8', FILENAME =

N'F:Program FilesMicrosoft SQL ServerMSSQL15.PROSQLADMINMSSQLDATAChapter8.mdf'),

FILEGROUP [MEM] CONTAINS MEMORY_OPTIMIZED_DATA DEFAULT

( NAME = N'MEM', FILENAME = N'H:DATACH08')

LOG ON

( NAME = N'Chapter8_log', FILENAME =

N'E:Program FilesMicrosoft SQL ServerMSSQL15.PROSQLADMINMSSQLDATAChapter8_log.ldf') ;

USE Chapter8

--Create CIDemo table

CREATE TABLE dbo.CIDemo

(

ID INT IDENTITY,

DummyText VARCHAR(30)

) ;

--Create clustered index

CREATE UNIQUE CLUSTERED INDEX CI_CIDemo ON dbo.CIDemo([ID]) ;

Listing 8-1

Creating a Clustered Index

When creating an index, you have a number of WITH options that you can specify. These options are outlined in Table 8-1.

Table 8-1

Relational Index WITH Options

Option	Description
MAXDOP	Specifies how many cores are used to build the index. Each core that is used builds its own portion of the index. The trade-off is that a higher MAXOP builds the index faster, but a lower MAXDOP means the index is built with less fragmentation.
FILLFACTOR	Specifies how much free space should be left in each page of the leaf level of the index. This can help reduce fragmentation caused by inserts at the expense of having a wider index, which requires more I/O to read. For a clustered index, with a nonchanging, ever-increasing key, always set this to 0, which means 100% full minus enough space for one row.
PAD_INDEX	Applies the fill factor percentage to the intermediate levels of the B-tree.
STATISTICS_NORECOMPUTE	Turns on or off the automatic updating of distribution statistics. Statistics are discussed later in this chapter.
SORT_IN_TEMPDB	Specifies that the intermediate sort results of the index should be stored in TempDB. When you use this option, you can offload I/O to the spindles hosting TempDB, but this is at the expense of using more disk space. Cannot be ON if RESUMABLE is ON.
STATISTICS_INCREMENTAL	Specifies if statistics should be created per partition. Limitations to this are discussed later in this chapter.
DROP_EXISTING	Used to drop and rebuild the existing index with the same name.
IGNORE_DUP_KEY	When you enable this option, an INSERT statement that tries to insert a duplicate key value into a unique index will not fail. Instead, a warning is generated and only the rows that break the unique constraint fail.
ONLINE	Can be set as ON or OFF, with a default of OFF. Specifies if the entire table and indexes should be locked for the duration of the index build or rebuild. If ON, then queries are still able to access the table during the operation. This is at the expense of the time it takes to build the index. For clustered indexes, this option is not available if the table contains LOB data.*
OPTIMIZE_FOR_SEQUENTIAL_KEY	Optimizes high concurrency inserts, where the index key is sequential. Introduced in SQL Server 2019, this feature is designed for indexes that suffer from last-page insert contention.
RESUMABLE	Can be set as ON or OFF, with a default of OFF. Specifies if the index creation or build can be paused and resumed, or can be resumed after a failure. Can only be set to ON if ONLINE is set to ON.
MAX_DURATION	Specifies, in minutes, the maximum duration that an index rebuild or rebuild will execute for, before pausing. Can only be specified if ONLINE is set to ON and RESUMABLE is set to ON.
ALLOW_ROW_LOCKS	Specifies that you can take row locks out when accessing the table. This does not means that they definitely will be taken.
ALLOW_PAGE_LOCKS	Specifies that you can take page locks out when accessing the table. This does not means that they definitely will be taken.
DATA_COMPRESSION	Allows you to specify ROW or PAGE, which denotes the type of compression that will be used for the index. Compression is discussed in Chapter 7. The default is NONE, which denotes that the index will not be compressed.
XML_COMPRESSION	Specifies if the index should use XML compression. XML Compression is discussed in Chapter 7.

*Spatial data is regarded as LOB data.

As mentioned earlier in this chapter, if you create a primary key on a table, then unless you specify the NONCLUSTERED keyword, or a clustered index already exists, a clustered index is created automatically to cover the column(s) of the primary key. Also, remember that at times you may wish to move the clustered index to a more suitable column if the primary key is wide or if it is not ever-increasing.

In order to achieve this, you need to drop the primary key constraint and then re-create it using the NONCLUSTERED keyword. This forces SQL Server to cover the primary key with a unique nonclustered index. Once this is complete, you are able to create the clustered index on the column of your choosing.

If you need to remove a clustered index that is not covering a primary key, you can do so by using the DROP INDEX statement, as demonstrated in Listing 8-2, which drops the clustered index that we created in the previous example.

DROP INDEX CI_CIDemo ON dbo.CIDemo ;

Listing 8-2

Dropping the Index

Nonclustered Indexes

A nonclustered index is based on a B-tree structure in the same way that a clustered index is. The difference is that the leaf level of a nonclustered index contains pointers to the data pages of the table, as opposed to being the data pages of the table, as illustrated in Figure 8-3. This means that a table can have multiple nonclustered indexes to support query performance.

Just like a clustered index, a nonclustered index supports seek, scan, and range scan operations in order to find the required data. If the index key of the nonclustered index includes all columns that need to be accessed during a query, then you do not need for SQL Server to access the underlying table. This also holds true if the only columns accessed are in the nonclustered index and the clustered index key. This is because the leaf level of a nonclustered index always contains the clustered index key. This is referred to as an index covering the query, which is discussed in the next section.

If the query needs to return columns that are not included in the nonclustered index or clustered index key, SQL Server needs to find the matching rows in the base table. This is done through a process called a key lookup. A key lookup operation accesses the rows required from the base table using either the clustered index key value or the RID if the table does not have a clustered index.

This can be efficient for a small number of rows, but it quickly becomes expensive if many rows are returned by the query. This means that if many rows will be returned, SQL Server may decide that it is less expensive to ignore the nonclustered index and use the clustered index or heap instead. This decision is known as the tipping point of the index. The tipping point varies from table to table, but it is generally between 0.5% and 2% of the table.

Covering Indexes

Although having all required columns within the nonclustered index means that you do not have to retrieve data from the underlying table, the trade-off is that having many columns within a nonclustered index can lead to very wide, inefficient indexes. In order to gain a better balance, SQL Server offers you the option of included columns.

Included columns are included at the leaf level of the index only, as opposed to the index key values, which continue to be included at every level of the B-tree. This feature can help you cover your queries while maintaining the narrowest index keys possible. This concept is illustrated in Figure 8-4. This diagram illustrates that the index has been built using Balance as the index key, but the FirstName and LastName columns have also been included at the leaf level. You can see that CustomerID has also been included at all levels; this is because CustomerID is the clustered index key. Because the clustered index key is included at all levels, this implies that the index is not unique. If it is unique, then the clustered key is only included at the leaf level of the B-tree. This means that unique, nonclustered indexes are always narrower than their nonunique equivalents. This index is perfect for a query that filters on Balance in the WHERE clause and returns the FirstName and LastName columns. It also covers queries that returned CustomerID in the results.

Tip

If both the clustered and nonclustered indexes are nonunique, each level of the nonclustered B-tree includes the clustering uniquifier, as well as the clustered key.

Figure 8-4
Nonclustered index with included columns

You can also use the index illustrated in Figure 8-4 to cover queries that filter on FirstName or LastName in the WHERE clause providing that other columns from the table are not returned. To process the query, however, SQL Server needs to perform an index scan, as opposed to an index seek or range scan, which is, of course, less efficient.

Administering Nonclustered Indexes

You can create nonclustered indexes using the CREATE NONCLUSTERED INDEX T-SQL statement. The script in Listing 8-3 creates a table called Customers and a table called Orders within the Chapter8 database. It then creates a foreign key constraint on the CustomerID column. Finally, a nonclustered index is created on the Balance column of the Customers table. Clustered indexes are created automatically on the primary key columns of each table.

USE Chapter8

--Create and populate numbers table

DECLARE @Numbers TABLE

(

Number INT

)

;WITH CTE(Number)

(

SELECT 1 Number

UNION ALL

SELECT Number + 1

FROM CTE

WHERE Number < 100

)

INSERT INTO @Numbers

SELECT Number FROM CTE ;

--Create and populate name pieces

DECLARE @Names TABLE

(

FirstName VARCHAR(30),

LastName VARCHAR(30)

) ;

INSERT INTO @Names

VALUES('Peter', 'Carter'),

('Michael', 'Smith'),

('Danielle', 'Mead'),

('Reuben', 'Roberts'),

('Iris', 'Jones'),

('Sylvia', 'Davies'),

('Finola', 'Wright'),

('Edward', 'James'),

('Marie', 'Andrews'),

('Jennifer', 'Abraham') ;

--Create and populate Customers table

CREATE TABLE dbo.CustomersDisk

(

CustomerID INT NOT NULL IDENTITY PRIMARY KEY,

FirstName VARCHAR(30) NOT NULL,

LastName VARCHAR(30) NOT NULL,

BillingAddressID INT NOT NULL,

DeliveryAddressID INT NOT NULL,

CreditLimit MONEY NOT NULL,

Balance MONEY NOT NULL

) ;

SELECT * INTO #CustomersDisk

FROM

(SELECT

(SELECT TOP 1 FirstName FROM @Names ORDER BY NEWID()) FirstName,

(SELECT TOP 1 LastName FROM @Names ORDER BY NEWID()) LastName,

(SELECT TOP 1 Number FROM @Numbers ORDER BY NEWID()) BillingAddressID,

(SELECT TOP 1 Number FROM @Numbers ORDER BY NEWID()) DeliveryAddressID,

(SELECT TOP 1

CAST(RAND() * Number AS INT) * 10000

FROM @Numbers

ORDER BY NEWID()) CreditLimit,

(SELECT TOP 1

CAST(RAND() * Number AS INT) * 9000

FROM @Numbers

ORDER BY NEWID()) Balance

FROM @Numbers a

CROSS JOIN @Numbers b

) a ;

INSERT INTO dbo.CustomersDisk

SELECT * FROM #CustomersDisk ;

--Create Numbers table

DECLARE @Numbers TABLE

(

Number INT

)

;WITH CTE(Number)

(

SELECT 1 Number

UNION ALL

SELECT Number + 1

FROM CTE

WHERE Number < 100

)

INSERT INTO @Numbers

SELECT Number FROM CTE ;

--Create the Orders table

CREATE TABLE dbo.OrdersDisk

(

OrderNumber INT NOT NULL IDENTITY PRIMARY KEY,

OrderDate DATE NOT NULL,

CustomerID INT NOT NULL,

ProductID INT NOT NULL,

Quantity INT NOT NULL,

NetAmount MONEY NOT NULL,

DeliveryDate DATE NULL

) ON [PRIMARY] ;

--Populate Orders with data

SELECT * INTO #OrdersDisk

FROM

(SELECT

(SELECT CAST(DATEADD(dd,(SELECT TOP 1 Number

FROM @Numbers

ORDER BY NEWID()),GETDATE())as DATE)) OrderDate,

(SELECT TOP 1 CustomerID FROM CustomersDisk ORDER BY NEWID()) CustomerID,

(SELECT TOP 1 Number FROM @Numbers ORDER BY NEWID()) ProductID,

(SELECT TOP 1 Number FROM @Numbers ORDER BY NEWID()) Quantity,

(SELECT TOP 1 CAST(RAND() * Number AS INT) +10 * 100

FROM @Numbers

ORDER BY NEWID()) NetAmount,

(SELECT CAST(DATEADD(dd,(SELECT TOP 1 Number - 10

FROM @Numbers

ORDER BY NEWID()),GETDATE()) as DATE)) DeliveryDate

FROM @Numbers a

CROSS JOIN @Numbers b

CROSS JOIN @Numbers c

) a ;

INSERT INTO OrdersDisk

SELECT * FROM #OrdersDisk ;

--Clean-up Temp Tables

DROP TABLE #CustomersDisk ;

DROP TABLE #OrdersDisk ;

--Add foreign key on CustomerID

ALTER TABLE dbo.OrdersDisk ADD CONSTRAINT

FK_OrdersDisk_CustomersDisk FOREIGN KEY

(

CustomerID

) REFERENCES dbo.CustomersDisk

(

CustomerID

) ON UPDATE NO ACTION

ON DELETE NO ACTION ;

--Create a nonclustered index on Balance

CREATE NONCLUSTERED INDEX NCI_Balance ON dbo.CustomersDisk(Balance) ;

Listing 8-3

Creating Tables and Then Adding a Nonclustered Index

We can change the definition of the NCI_Balance index to include the FirstName and LastName columns by using the CREATE NONCLUSTERED INDEX statement and specifying the DROP_EXISTING option as demonstrated in Listing 8-4.

CREATE NONCLUSTERED INDEX NCI_Balance ON dbo.CustomersDisk(Balance)

INCLUDE(LastName, FirstName)

WITH(DROP_EXISTING = ON) ;

Listing 8-4

Altering the Index to Include Columns

You can drop the index in the same way that we dropped the clustered index earlier in this chapter—using a DROP INDEX statement. In this case, the full statement would be DROP INDEX NCI_Balance ON dbo.CustomersDisk.

Filtered Indexes

A filtered index is an index built on a subset of the data stored within a table, as opposed to one that is built on all of the data in the table. Because the indexes are smaller, they can lead to improved query performance and reduced storage cost. They also have the potential to cause less overhead for DML operations, since they only need to be updated if the DML operation affects the data within the index. For example, if an index was filtered on OrderDate >= '2019-01-01' AND OrderDate <= '2019-12-31' and subsequently updated all rows in the table where the OrderDate >= '2020-01-01', then the performance of the update would be the same as if the index did not exist.

Filtered indexes are constructed by using a WHERE clause on index creation. There are many things that you can do in the WHERE clause, such as filter on NULL or NOT NULL values; use equality and inequality operators, such as =, >, <, and IN; and use logical operators, such as AND and OR. There are also limitations, however. For example, you cannot use BETWEEN, CASE, or NOT IN. Also, you can only use simple predicates so, for example, using a date/time function is prohibited, so creating a rolling filter is not possible. You also cannot compare a column to other columns.

The statement in Listing 8-5 creates a filtered index on DelieveryDate, where the value is NULL. This allows you to make performance improvements on queries that are run to determine which orders are yet to have their delivery scheduled.

CREATE NONCLUSTERED INDEX NonDeliveredItems ON dbo.OrdersDisk(DeliveryDate)

WHERE DeliveryDate IS NULL ;

Listing 8-5

Creating Filtered Index

Indexes for Specialized Application

In addition to traditional B-tree indexes, SQL Server also provides several types of special indexes to help query performance against memory-optimized tables and Columnstore indexes that help query performance in data warehouse scenarios. The following sections discuss these special indexes. Although beyond the scope of this book, SQL Server also offers special indexes for geospatial data, XML, and JSON.

Columnstore Indexes

As you have seen, traditional indexes store rows of data on data pages. This is known as a rowstore. SQL Server also supports Columnstore indexes. These indexes flip data around and use a page to store a column, as opposed to a set of rows. This is illustrated in Figure 8-5.

A Columnstore index slices the rows of a table into chunks of between 102,400 and 1,048,576 rows each. Each slice is called a rowgroup. Data in each rowgroup is then split down into columns and compressed using VertiPaq technology. Each column within a rowgroup is called a column segment.

Columnstore indexes offer several benefits over traditional indexes, given appropriate usage scenarios. First, because they are highly compressed, they can improve I/O efficiency and reduce memory overhead. They can achieve such a high compression rate because data within a single column is often very similar between rows. Also, because a query is able to retrieve just the data pages of the column it requires, I/O can again be reduced. This is helped even further by the fact that each column segment contains a header with metadata about the data within the segment. This means that SQL Server can access just the segments it needs, as opposed to the whole column. A new query execution mechanism has also been introduced to support Columnstore indexes. It is called batch execution mode, and it allows data to be processed in chunks of 1000 rows, as opposed to on a row-by-row basis. This means that CPU usage is much more efficient. Columnstore indexes are not a magic bullet, however, and are designed to be optimal for data warehouse–style queries that perform read-only operations on very large tables. OLTP-style queries are not likely to see any benefit and, in some cases, may actually execute slower. SQL Server supports both clustered and nonclustered Columnstore indexes, and these are discussed in the following sections.

Clustered Columnstore Indexes

Clustered Columnstore indexes cause the entire table to be stored in a Columnstore format. There is no traditional rowstore storage for a table with a clustered Columnstore index; however, new rows that are inserted into the table may temporarily be placed into a rowstore table, called a deltastore. This is to prevent the Columnstore index from becoming fragmented and to enhance performance for DML operations. The diagram in Figure 8-6 illustrates this.

Figure 8-6
Clustered columnstore index with deltastores

The diagram shows that when data is inserted into a clustered Columnstore index, SQL Server assesses the number of rows. If the number of rows is high enough to achieve a good compression rate, SQL Server treats them as a rowgroup or rowgroups and immediately compresses them and adds them to the Columnstore index. If there are too few rows, however, SQL Server inserts them into the internal deltastore structure. When you run a query against the table, the database engine seamlessly joins the structures together and returns the results as one. Once there are enough rows, the deltastore is marked as closed and a background process called the tuple compresses the rows into a rowgroup in the Columnstore index.

There can be multiple deltastores for each clustered Columnstore index. This is because when SQL Server determines that an insert warrants using a deltastore, it attempts to access the existing deltastores. If all existing deltastores are locked, however, then a new one is created, instead of the query being forced to wait for a lock to be released.

When a row is deleted in a clustered Columnstore index, then the row is only logically removed. The data still physically stays in the rowgroup until the next time the index is reorganized or until the next time a background thread runs. The background thread was introduced in SQL Server 2019 and provides improved index quality, by compressing small deltastores and merging small rowgroups. SQL Server maintains a B-tree structure of pointers to deleted rows in order to easily identify them. If the row being deleted is located in a deltastore, as opposed to the index itself, then it is immediately deleted, both logically and physically. When you update a row in a clustered Columnstore index, then SQL Server marks the row as being logically deleted and inserts a new row into a deltastore, which contains the new values for the row.

You can create clustered Columnstore indexes using a CREATE CLUSTERED COLUMNSTORE INDEX statement. The script in Listing 8-6 copies the contents of the OrdersDisk table to a new table called OrdersColumnstore and then creates a clustered Columnstore index on the table. When you create the index, you do not need to specify a key column; this is because all of the columns are added to column segments within the Columnstore index. Your queries can then use the index to search on whichever column(s) it needs to satisfy the query. The clustered Columnstore index is the only index on the table. You are not able to create traditional nonclustered indexes or a nonclustered Columnstore index. Additionally, the table must not have primary key, foreign key, or unique constraints.

SELECT * INTO dbo.OrdersColumnstore

FROM dbo.OrdersDisk ;

CREATE CLUSTERED COLUMNSTORE INDEX CCI_OrdersColumnstore ON dbo.OrdersColumnstore ;

Listing 8-6

Creating a Clustered Columnstore Index

It is also possible to create nonclustered B-Tree indexes on a Clustered Columnstore index. It provides efficient seek operations against the underlying columnar data.

SQL Server 2022 introduces a performance optimization for clustered Columnstore indexes, called Ordered Clustered Columnstore Indexes. This allows improved performance by offering better elimination of segments, by ordering the data before it is compressed. The script in Listing 8-7 creates the same index as the script in Listing 8-6, except that it uses the ORDER clause to cause the data to be ordered by the OrderDate column.

Tip

If you have run the script in Listing 8-6, you will need to use a DROP INDEX statement to convert the OrdersDisk table back to a heap, before running the script in Listing 8-7.

SELECT * INTO dbo.OrdersColumnstore

FROM dbo.OrdersDisk ;

CREATE CLUSTERED COLUMNSTORE INDEX CCI_OrdersColumnstore ON dbo.OrdersColumnstore ORDER (OrderDate) ;

Listing 8-7

Create an Ordered Clustered Columnstore Index

Note

You cannot order clustered columnstore indexes on string columns.

Nonclustered Columnstore Indexes

A Nonclustered Columnstore index is essentially the same as a Clustered Columnstore Index. The difference is that a copy of the data is made from the columns that the index covers instead of the converting the data that is stored within the table to a columnar format, meaning that a table can have both B-Tree (clustered and nonclustered) indexes, as well as nonclustered Columnstore indexes. Obviously, this increases the storage space required for the data, but it provides the benefit of allowing you to target operational workloads at a clustered B-Tree while targeting analytical workloads against the nonclustered Columnstore. Given that a ten times compression ratio is typical within a Columnstore index, it will likely be worth the additional storage.

The statement in Listing 8-8 creates a nonclustered Columnstore index on the FirstName, LastName, Balance, and CustomerID columns of the CustomersDisk table.

CREATE NONCLUSTERED COLUMNSTORE INDEX NCCI_FirstName_LastName_Balance_CustomerID

ON dbo.CustomersDisk(FirstName,LastName,Balance,CustomerID) ;

Listing 8-8

Creating Nonclustered Columnstore Indexes

In-Memory Indexes

As we saw in Chapter 7, SQL Server provides two types of index for memory-optimized tables: nonclustered and nonclustered hash. Every memory-optimized table must have a minimum of one index. All in-memory indexes cover all columns in the table because they use a memory pointer to link to the data row.

Indexes on memory-optimized tables must be created in the CREATE TABLE statement. There is no CREATE INDEX statement for in-memory indexes. Indexes built on memory-optimized tables are always stored in memory only and are never persisted to disk, regardless of your table’s durability setting. They are then re-created after the instance restarts, from the table’s underlying data. You do not need to worry about fragmentation of in-memory indexes, since they never have a disk-based structure.

In-Memory Nonclustered Hash Indexes

A nonclustered hash index consists of an array of buckets. A hash function is run on each of the index keys, and then the hashed key values are placed into the buckets. The hashing algorithm used is deterministic, meaning that index keys with the same value always have the same hash value. This is important because repeated hash values are always placed in the same hash bucket. When many keys are in the same hash bucket, performance of the index can degrade, because the whole chain of duplicates needs to be scanned to find the correct key. Therefore, if you are building a hash index on a nonunique column with many repeated keys, you should create the index with a much larger number of buckets. This should be in the realm of 20 to 100 times the number of distinct key values, as opposed to 2 times the number of unique keys that is usually recommended for unique indexes. Alternatively, using a nonclustered index on a nonunique column may offer a better solution. The second consequence of the hash function being deterministic is that different versions of the same row are always stored in the same hash bucket.

Even in the case of a unique index where only a single, current row version exits, the distribution of hashed values into buckets is not even, and if there are an equal number of buckets to unique key values, then approximately one-third of the buckets is empty, one-third contains a single value, and one-third contains multiple values. When multiple values share a bucket, it is known as a hash collision, and a large number of hash collisions can lead to reduced performance. Hence, the recommendation for the number of buckets in a unique index is twice the number of unique values expected in the table.

Tip

When you have a unique nonclustered hash index, in some cases, many unique values may hash to the same bucket. If you experience this, then increasing the number of buckets helps, in the same way that a nonunique index does.

As an example, if your table has one million rows, and the indexed column is unique, the optimum number of buckets, known as the BUCKET_COUNT, is two million. If you know that you expect your table to grow to two million rows, however, then it may be prudent to create four million hash buckets. This number of buckets is low enough to not have an impact on memory. It also still allows for the expected increase in rows, without there being too few buckets, which would impair performance. An illustration of potential mappings between index values and hash buckets is illustrated in Figure 8-7.

Figure 8-7
Mappings to a nonclustered hash index

Tip

The amount of memory used by a nonclustered hash index always remains static, since the number of buckets does not change.

Hash indexes are optimized for seek operations with the = predicate. For the seek operation, however, the full index key must be present in the predicate evaluation. If it is not, a full index scan is required. An index scan is also required if inequality predicates, such as < or >, are used. Also, because the index is not ordered, the index cannot return the data in the sort order of the index key.

Note

You may remember that in Chapter 7, we witnessed superior performance from a disk-based table than from a memory-optimized table. This is explained by us using the > predicate in our query; this meant that although the disk-based index was able to perform an index seek, our memory-optimized hash index had to perform an index scan.

Let’s now create a memory-optimized version of our OrdersDisk table, which includes a nonclustered hash index on the OrderID column, using the script in Listing 8-9. Initially, this row has one million rows, but we expect the number to grow to two million, so we use a BUCKET_COUNT of four million.

CREATE TABLE dbo.OrdersMemHash

(

OrderNumber INT NOT NULL IDENTITY PRIMARY KEY

NONCLUSTERED HASH WITH(BUCKET_COUNT = 4000000),

OrderDate DATE NOT NULL,

CustomerID INT NOT NULL,

ProductID INT NOT NULL,

Quantity INT NOT NULL,

NetAmount MONEY NOT NULL,

DeliveryDate DATE NULL,

) WITH(MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA) ;

INSERT INTO dbo.OrdersMemHash(OrderDate,CustomerID,ProductID,Quantity,NetAmount,DeliveryDate)

SELECT OrderDate

,CustomerID

,ProductID

,Quantity

,NetAmount

,DeliveryDate

FROM dbo.OrdersDisk ;

Listing 8-9

Creating a Table with a Nonclustered Hash Index

If we now wish to add an additional index to the table, we need to drop and re-create it. We already have data in the table, however, so we first need to create a temp table and copy the data in so that we can drop and re-create the memory-optimized table. The script in Listing 8-10 adds a nonclustered index to the OrderDate column.

--Create and populate temp table

SELECT * INTO #OrdersMemHash

FROM dbo.OrdersMemHash ;

--Drop existing table

DROP TABLE dbo.OrdersMemHash ;

--Re-create the table with the new index

CREATE TABLE dbo.OrdersMemHash

(

OrderNumber INT NOT NULL IDENTITY PRIMARY KEY

NONCLUSTERED HASH WITH(BUCKET_COUNT = 4000000),

OrderDate DATE NOT NULL INDEX NCI_OrderDate NONCLUSTERED,

CustomerID INT NOT NULL,

ProductID INT NOT NULL,

Quantity INT NOT NULL,

NetAmount MONEY NOT NULL,

DeliveryDate DATE NULL,

) WITH(MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA) ;

--Allow values to be inserted into the identity column

SET IDENTITY_INSERT OrdersMemHash ON ;

--Repopulate the table

INSERT INTO

dbo.OrdersMemHash(OrderNumber,OrderDate,CustomerID,ProductID,Quantity,NetAmount,DeliveryDate)

SELECT *

FROM #OrdersMemHash ;

--Stop further inserts to the identity column and clean up temp table

SET IDENTITY_INSERT OrdersMemHash OFF ;

DROP TABLE #OrdersMemHash ;

Listing 8-10

Adding an Index to a Memory-Optimized Table

We can examine the distribution of the values in our hash index by interrogating the sys.dm_db_xtp_hash_index_stats DMV. The query in Listing 8-11 demonstrates using this DMV to view the number of hash collisions and calculate the percentage of empty buckets.

SELECT

OBJECT_SCHEMA_NAME(HIS.OBJECT_ID) + '.' + OBJECT_NAME(HIS.OBJECT_ID) 'Table Name',

I.name as 'Index Name',

HIS.total_bucket_count,

HIS.empty_bucket_count,

FLOOR((CAST(empty_bucket_count AS FLOAT)/total_bucket_count) * 100) 'Empty Bucket Percentage',

total_bucket_count - empty_bucket_count 'Used Bucket Count',

HIS.avg_chain_length,

HIS.max_chain_length

FROM sys.dm_db_xtp_hash_index_stats AS HIS

INNER JOIN sys.indexes AS I

ON HIS.object_id = I.object_id

AND HIS.index_id = I.index_id ;

Listing 8-11

sys.dm_db_xtp_hash_index_stats

From the results in Figure 8-8, we can see that for our hash index, 78% of the buckets are empty. The percentage is this high because we specified a large BUCKET_COUNT with table growth in mind. If the percentage was less than 33%, we would want to specify a higher number of buckets to avoid hash collisions. We can also see that we have an average chain length of 1, with a maximum chain length of 5. This is healthy. If the average chain count increases, then performance begins to tail off, since SQL Server has to scan multiple values to find the correct key. If the average chain length reaches 10 or higher, then the implication is that the key is nonunique and there are too many duplicate values in the key to make a hash index viable. At this point, we should either drop and re-create the table with a higher bucket count for the index or, ideally, look to implement a nonclustered index instead.

In-Memory Nonclustered Indexes

In-memory nonclustered indexes have a similar structure to a disk-based nonclustered index called a bw-tree. This structure uses a page-mapping table, as opposed to pointers, and is traversed using less than, as opposed to greater than, which is used when traversing disk-based indexes. The leaf level of the index is a singly linked list. Nonclustered indexes perform better than nonclustered hash indexes where a query uses inequality predicates, such as BETWEEN, >, or <. In-memory nonclustered indexes also perform better than a nonclustered hash index, where the = predicate is used, but not all of the columns in the key are used in the filter. Nonclustered indexes can also return the data in the sort order of the index key. Unlike disk-based indexes, however, these indexes cannot return the results in the reverse order of the index key.

Maintaining Indexes

Once indexes have been created, a DBA’s work is not complete. Indexes need to be maintained on an ongoing basis. The following sections discuss considerations for index maintenance.

Missing Indexes

When you run queries, the Database Engine keeps track of any indexes that it would like to use when building a plan to aid your query performance. When you view an execution plan in SSMS, you are provided with advice on missing indexes, but the data is also available later through DMVs.

Tip

Because the suggestions are based on a single plan, you should review them as opposed to implementing them blindly.

In order to demonstrate this functionality, we can execute the query in Listing 8-12 and choose to include the actual execution plan.

Tip

You can see missing index information by viewing the estimated query plan.

SELECT SUM(c.creditlimit) TotalExposure, SUM(o.netamount) 'TotalOrdersValue'

FROM dbo.CustomersDisk c

INNER JOIN dbo.OrdersDisk o

ON c.CustomerID = o.CustomerID ;

Listing 8-12

Generating Missing Index Details

Once we have run this query, we can examine the execution plan and see what it tells us. The execution plan for this query is shown in Figure 8-9.

Figure 8-9
Execution plan showing missing indexes

At the top of the execution plan in Figure 8-9, you can see that SQL Server is recommending that we create an index on the CustomerID column of the OrdersDisk table and include the NetAmount column at the leaf level. We are also advised that this should provide a 75% performance improvement to the query.

As mentioned, SQL Server also makes this information available through DMVs. The sys.dm_db_missing_index_details DMV joins to the sys.dm_db_missing_index_group_stats through the intermediate DMV sys.dm_db_missing_index_groups, which avoids a many-to-many relationship. The script in Listing 8-13 demonstrates how we can use these DMVs to return details on missing indexes.

SELECT

mid.statement TableName

,ISNULL(mid.equality_columns, '')

+ ','

+ ISNULL(mid.inequality_columns, '') IndexKeyColumns

,mid.included_columns

,migs.unique_compiles

,migs.user_seeks

,migs.user_scans

,migs.avg_total_user_cost

,migs.avg_user_impact

FROM sys.dm_db_missing_index_details mid

INNER JOIN sys.dm_db_missing_index_groups mig

ON mid.index_handle = mig.index_handle

INNER JOIN sys.dm_db_missing_index_group_stats migs

ON mig.index_group_handle = migs.group_handle ;

Listing 8-13

Missing Index DMVs

The results of this query are shown in Figure 8-10. They show the following: the name of the table with the missing index; the column(s) that SQL Server recommends should form the index key; the columns that SQL Server recommends should be added as included columns at the leaf level of the B-tree; the number of times that queries that would have benefited from the index have been compiled; how many seeks would have been performed against the index, if it existed; the number of times that the index has been scanned if it existed; the average cost that would have been saved by using the index; and the average percentage cost that would have been saved by using the index. In our case, we can see that the query would have been 95% less expensive if the index existed when we ran our query.

Index Fragmentation

Disk-based indexes are subject to fragmentation. Two forms of fragmentation can occur in B-trees: internal fragmentation and external fragmentation. Internal fragmentation refers to pages having lots of free space. If pages have lots of free space, then SQL Server needs to read more pages than is necessary to return all of the required rows for a query. External fragmentation refers to the pages of the index becoming out of physical order. This can reduce performance, since the data cannot be read sequentially from disk.

For example, imagine that you have a table with one million rows of data and that all of these data rows fit into 5000 pages when the data pages are 100% full. This means that SQL Server needs to read just over 39MB of data in order to scan the entire table (8KB * 5000). If the pages of the table are only 50% full, however, this increases the number of pages in use to 10,000, which also increases the amount of data that needs to be read to 78MB. This is internal fragmentation.

Internal fragmentation can occur naturally when DELETE statements are issued and when DML statements occur, such as when a key value that is not ever-increasing is inserted. This is because SQL Server may respond to this situation by performing a page split. A page split creates a new page, moves half of the data from the existing page to the new page, and leaves the other half on the existing page, thus creating 50% free space on both pages. They can also occur artificially, however, through the misuse of the FILLFACTOR and PAD_INDEX settings.

FILLFACTOR controls how much free space is left on each leaf level page of an index when it is created or rebuilt. By default, the FILLFACTOR is set to 0, which means that it leaves enough space on the page for exactly one row. In some cases, however, when a high number of page splits is occurring due to DML operations, a DBA may be able to reduce fragmentation by altering the FILLFACTOR. Setting a FILLFACTOR of 80, for example, leaves 20% free space in the page, meaning that new rows can be added to the page without page splits occurring. Many DBAs change the FILLFACTOR when they are not required to, however, which automatically causes internal fragmentation as soon as the index is built. PAD_INDEX can be applied only when FILLFACTOR is used, and it applies the same percentage of free space to the intermediate levels of the B-tree.

External fragmentation is also caused by page splits and refers to the logical order of pages, as ordered by the index key, being out of sequence when compared to the physical order of pages on disk. External fragmentation makes it so SQL Server is less able to perform scan operations using a sequential read, because the head needs to move backward and forward over the disk to locate the pages within the file.

Note

This is not the same as fragmentation at the file system level where a data file can be split over multiple, unordered disk sectors.

Detecting Fragmentation

You can identify fragmentation of indexes by using the sys.dm_db_index_physical_stats DMF. This function accepts the parameters listed in Table 8-2.

Table 8-2

sys.dm_db_index_physical_stats Parameters

Parameter	Description
Database_ID	The ID of the database that you want to run the function against. If you do not know it, you can pass in DB_ID('MyDatabase') where MyDatabase is the name of your database.
Object_ID	The Object ID of the table that you want to run the function against. If you do not know it, pass in OBJECT_ID('MyTable') where MyTable is the name of your table. Pass in NULL to run the function against all tables in the database.
Index_ID	The index ID of the index you want to run the function against. This is always 1 for a clustered index. Pass in NULL to run the function against all indexes on the table.
Partition_Number	The ID of the partition that you want to run the function against. Pass in NULL if you want to run the function against all partitions, or if the table is not partitioned.
Mode	Choose LIMITED, SAMPLED, or DETAILED. LIMITED only scans the non–leaf levels of an index. SAMPLED scans 1% of pages in the table, unless the table has 10,000 pages or less, in which case DETAILED mode is used. DETAILED mode scans 100% of the pages in the table. For very large tables, SAMPLED is often preferred due to the length of time it can take to return data in DETAILED mode.

Listing 8-14 demonstrates how we can use sys.dm_db_index_physical_stats to check the fragmentation levels of our OrdersDisk table.

USE Chapter8

SELECT

i.name

,IPS.index_type_desc

,IPS.index_level

,IPS.avg_fragmentation_in_percent

,IPS.avg_page_space_used_in_percent

,i.fill_factor

,CASE

WHEN i.fill_factor = 0

THEN 100-IPS.avg_page_space_used_in_percent

ELSE i.fill_factor-ips.avg_page_space_used_in_percent

END Internal_Frag_With_Fillfactor_Offset

,IPS.fragment_count

,IPS.avg_fragment_size_in_pages

FROM sys.dm_db_index_physical_stats(DB_ID('Chapter8'),OBJECT_ID('dbo.OrdersDisk'),NULL,NULL,'DETAILED') IPS

INNER JOIN sys.indexes i

ON IPS.Object_id = i.object_id

AND IPS.index_id = i.index_id ;

Listing 8-14

sys.dm_db_index_physical_stats

You can see, from the results of this query, that one row is returned for every level of each B-tree. If the table was partitioned, this would also be broken down by partition. The index_level column indicates which level of the B-tree is represented by the row. Level 0 implies the leaf level of the B-tree, whereas Level 1 is either the lowest intermediate level or the root level if no intermediate levels exist, and so on, with the highest number always reflecting the root node. The avg_fragmentation_in_percent column tells us how much external fragmentation is present. We want this value to be as close to zero as possible. The avg_page_space_used_in_percent tells us how much internal fragmentation is present, so we want this value to be as close to 100 as possible. The Internal_Frag_With_FillFactor_Offset column also tells us how much internal fragmentation is present, but this time, it applies an offset to allow for the fill factor that has been applied to the index. The fragment_count column indicates how many chunks of continuous pages exist for the index level, so we want this value to be as low as possible. The avg_fragment_size_in_pages column tells the average size of each fragment, so obviously this number should also be as high as possible.

Removing Fragmentation

You can remove fragmentation by either reorganizing or rebuilding an index. When you reorganize an index, SQL Server reorganizes the data within the leaf level of the index. It looks to see if there is free space on a page that it can use. If there is, then it moves rows from the next page onto this page. If there are empty pages at the end of this process, then they are removed. SQL Server only fills pages to the level of the FillFactor specified. Once this is complete, the data within the leaf level pages is shuffled so that their physical order is a closer match to their logical, key order. Reorganizing an index is always an ONLINE operation, meaning that the index can still be used by other processes while the operation is in progress. Where it is always an ONLINE operation, it will fail if the ALLOW_PAGE_LOCKS option is turned off. The process of reorganizing an index is suitable for removing internal fragmentation and low levels of external fragmentation of 30% or less. However, it makes no guarantees, even with this usage profile, that there will not be fragmentation left after the operation completes.

The script in Listing 8-15 creates an index called NCI_CustomerID on the OrdersDisk table and then demonstrates how we can reorganize it.

--Create the index that will be used in the examples, for the following sections

CREATE NONCLUSTERED INDEX NCI_CustomerID ON dbo.OrdersDisk(CustomerID) ;

--Reorganize the index

ALTER INDEX NCI_CustomerID ON dbo.OrdersDisk REORGANIZE ;

Listing 8-15

Reorganizing an Index

When you rebuild an index, the existing index is dropped and then completely rebuilt. This, by definition, removes internal and external fragmentation since the index is built from scratch. It is important to note, however, that you are still not guaranteed to be 100% fragmentation free after this operation. This is because SQL Server assigns different chunks of the index to each CPU core that is involved in the rebuild. Each CPU core should build its own section in the perfect sequence, but when the pieces are synchronized, there may be a small amount of fragmentation. You can minimize this issue by specifying MAXDOP = 1. Even when you set this option, you may still encounter fragmentation in some cases. For example, if ALLOW_PAGES_LOCKS is configured as OFF, then the workers share the allocation cache, which can cause fragmentation. Additionally, when you set MAXDOP = 1, it is at the expense of the time it takes to rebuild the index.

You can rebuild an index by performing either an ONLINE or OFFLINE operation. If you choose to rebuild the index as an ONLINE operation, then the original version of the index is still accessible while the operation takes place. The ONLINE operation comes at the expense of both time and resource utilization. You need to enable ALLOW_PAGE_LOCKS to make your ONLINE rebuild successful.

The script in Listing 8-16 demonstrates how we can rebuild the NCI_Balance index on the OrdersDisk table. Because we have not specified ONLINE = ON, it uses the default setting of ONLINE = OFF, and the index is locked for the entire operation. Because we specify MAXDOP = 1, the operation is slower, but has no fragmentation.

ALTER INDEX NCI_CustomerID ON dbo.OrdersDisk REBUILD WITH(MAXDOP = 1) ;

Listing 8-16

Rebuilding an Index

If you create a maintenance plan to rebuild or reorganize indexes, then all indexes within the specified database are rebuilt, regardless of whether they need to be—this can be time-consuming and eat resources. You can resolve this issue by using the sys.dm_db_index_physical_stats DMF to create an intelligent script that you can run from SQL Server Agent and use to reorganize or rebuild only those indexes that require it. This is discussed in more detail in Chapter 17.

Tip

There is a myth that using SSDs removes the issue of index fragmentation. This is not correct. Although SSDs reduce the performance impact of out-of-order pages, they do not remove it. They also have no impact on internal fragmentation.

Resumable Index Operations

SQL Server 2019 supports resumable online index creation and index rebuilds for both traditional (clustered and nonclustered) indexes and Columnstore indexes. Resumable index operations allow you to pause an online index operation (build or rebuild), in order to free up system resources and then restart it again, from where it left off, when resource utilization is no longer an issue. These operations also allow an online index operation to be restarted, after it has failed for common reasons, such as lack of disk space.

Some of the advantages that this functionality brings to a DBA are clear. For example, if a large index rebuild needs to fit inside a short maintenance window, then the rebuild can be paused at the end of a maintenance window and restarted at the beginning of the next, as opposed to having to abort the operation, to free up system resources. There are also other, hidden benefits, however. For example, resumable index operations do not consume large amount of log space, even when performed on large indexes. This is because all data required to restart the index operation is stored inside the database. A side note of this is that the index operation does not hold a long-running transaction while paused.

For the most part, there are very few drawbacks to using resumable index operations. The quality of defragmentation achieved is comparable to the quality of a standard online index operation, and there is no real difference in speed between resumable and standard online index operations (excluding the potential pause of course). As always, however, there is no such thing as a free lunch, and during a paused, resumable operation, there will be a degradation of write performance, to affected tables and indexes, due to two versions of the index needing to be updated. This degradation should not be more than 10%, in most cases. There should be no impact to read operations, during the pause, as they continue to use the original version of the index, until the operation has completed.

The command in Listing 8-17 demonstrated how to rebuild the NCI_CustomerID index, on the OrdersDisk table, as a resumable operation.

ALTER INDEX NCI_CustomerID ON dbo.OrdersDisk REBUILD WITH(MAXDOP = 1, ONLINE=ON, RESUMABLE=ON) ;

Listing 8-17

Resumable Index Rebuild

The command in Listing 8-18 will pause the index rebuild, started in Listing 8-17.

Tip

The command in Listing 8-18 will only work if the execution of the command in Listing 8-16 has not yet completed.

ALTER INDEX NCI_CustomerID ON dbo.OrdersDisk PAUSE

Listing 8-18

Pause an Index Rebuild

After running this script, the index rebuild operation will pause, and the message displayed in Figure 8-11 will be displayed.

Figure 8-11
Message thrown, when index operation paused

The script in Listing 8-19 will either resume or abort the index rebuild, based on the value assigned to the @Action variable.

DECLARE @Action NVARCHAR(6) = 'Resume'

IF (@Action = 'Resume')

BEGIN

ALTER INDEX NCI_CustomerID ON dbo.OrdersDisk RESUME

END

ELSE

BEGIN

ALTER INDEX NCI_CustomerID ON dbo.OrdersDisk ABORT

END

Listing 8-19

Resume or Abort an Index Operation

Instead of turning on ONLINE and RESUMABLE options for each, individual index operation, you can turn them on globally, at the database level, by using database scoped configurations. The ELEVATE_ONLINE configuration will change the default value of ONLINE to ON, for supported index operations, within the database. The configuration ELEVATE_RESUMABLE will default the value of RESUMABLE to ON.

Both ELEVATE_ONLINE and ELEVATE_RESUMABLE can be configured as OFF (the default behavior), WHEN_SUPPORTED, or FAIL_UNSUPPORTED. When set to WHEN_SUPPORTED, noncompatible operations, such as rebuilding XML indexes, will be performed offline and unresumable. If set to FAIL_UNSUPPORTED, however, such operations will fail, throwing an error.

The script in Listing 8-20 demonstrates how to set ELEVATE_ONLINE and ELEVATE_RESUMABLE to WHEN_SUPPORTED, for the Chapter8 Database.

USE Chapter8

ALTER DATABASE SCOPED CONFIGURATION SET ELEVATE_ONLINE = WHEN_SUPPORTED ;

ALTER DATABASE SCOPED CONFIGURATION SET ELEVATE_RESUMABLE = WHEN_SUPPORTED ;

Listing 8-20

Default to ONLINE and RESUMABLE

Partitioned Indexes

As mentioned in Chapter 7, it is possible to partition indexes as well as tables. A clustered index always shares the same partition scheme as the underlying table, because the leaf level of the clustered index is made up of the actual data pages of the table. Nonclustered indexes, on the other hand, can either be aligned with the table or not. Indexes are aligned if they share the same partition scheme or if they are created on an identical partition scheme.

In most cases, it is good practice to align nonclustered indexes with the base table, but on occasion, you may wish to deviate from this strategy. For example, if the base table is not partitioned, you can still partition an index for performance. Also, if you index key is unique and does not contain the partitioning key, then it needs to be unaligned. There is also an opportunity to gain a performance boost from unaligned nonclustered indexes if the table is involved in collated joins with other tables on different columns.

You can create a partitioned index by using the ON clause to specify the partition scheme in the same way that you create a partitioned table. If the index already exists, you can rebuild it, specifying the partition scheme in the ON clause. The script in Listing 8-21 creates a partition function and a partition scheme. It then rebuilds the clustered index of the OrdersDisk table to move it to the new partition scheme. Finally, it creates a new nonclustered index, which is partition aligned with the table.

Tip

Before running the script, change the name of the primary key to match your own.

--Create partition function

CREATE PARTITION FUNCTION OrdersPartFunc(int)

AS RANGE LEFT

FOR VALUES(250000,500000,750000) ;

--Create partition scheme

CREATE PARTITION SCHEME OrdersPartScheme

AS PARTITION OrdersPartFunc

ALL TO([PRIMARY]) ;

--Partition OrdersDisk table

ALTER TABLE dbo.OrdersDisk DROP CONSTRAINT PK__OrdersDi__CAC5E7420B016A9F ;

ALTER TABLE dbo.OrdersDisk

ADD PRIMARY KEY CLUSTERED(OrderNumber) ON OrdersPartScheme(OrderNumber) ;

--Create partition aligned nonclustered index

CREATE NONCLUSTERED INDEX NCI_Part_CustID ON dbo.OrdersDisk(CustomerID, OrderNumber)

ON OrdersPartScheme(OrderNumber) ;

Listing 8-21

Rebuilding and Creating Partitioned Indexes

When you rebuild an index, you can also specify that only a certain partition is rebuilt. The example in Listing 8-22 rebuilds only Partition 1 of the NCI_Part_CustID index.

ALTER INDEX NCI_Part_CustID ON dbo.OrdersDisk REBUILD PARTITION = 1 ;

Listing 2-22

Rebuilding a Specific Partition

Statistics

Cardinality refers to how many rows the query optimizer expects to be returned by a query and is a key part of how the optimizer choses the optimal plan for a given query. The primary method that the optimizer uses to estimate cardinality is statistics.

SQL Server maintains statistics regarding the distribution of data within a column or set of columns. These columns can either be within a table or a nonclustered index. When the statistics are built on a set of columns, then they also include correlation statistics between the distributions of values in those columns. The Query Optimizer can then use these statistics to build efficient query plans based on the number of rows that it expects a query to return (cardinality). A lack of statistics can lead to inefficient plans being generated. For example, the Query Optimizer may decide to perform an index scan when a seek operation would be more efficient.

You can allow SQL Server to manage statistics automatically. A databaselevel option called AUTO_CREATE_STATISTICS automatically generates single column statistics, where SQL Server believes better cardinality estimates will help query performance. There are limitations to this however. For example, filtered statistics or multicolumn statistics cannot be created automatically.

Tip

The only exception to this is when an index is created. When you create an index, statistics are always generated, even multicolumn statistics, to cover the index key. It also includes filtered statistics on filtered indexes. This is regardless of the AUTO_CREATE_STATS setting.

Auto Create Incremental Stats causes statistics on partitioned tables to be automatically created on a per-partition basis, as opposed to being generated for the whole table. This can reduce contention by stopping a scan of the full table from being required.

Statistics become out of date as DML operations are performed against a table. The database-level option, AUTO_UPDATE_STATISTICS, rebuilds statistics when they become outdated. The rules in Table 8-3 are used to determine if statistics are out of date.

Table 8-3

Statistics Update Algorithms

No of Rows in Table	Rule
0	Table has greater than 0 rows.
<= 500	500 or more values in the first column of the statistics object have changed.
> 500	500 + 20% or more values in the first column of the statistics object have changed.
Partitioned table with INCREMENTAL statistics	20% or more of values in the first column of the statistics object for a specific partition have changed.

The AUTO_UPDATE_STATISTICS process is very useful and it is normally a good idea to use it. An issue can arise, however, because the process is synchronous and blocking. Therefore, if a query is run, SQL Server checks to see if the statistics need to be updated. If they do, SQL Server updates them, but this blocks the query and any other queries that require the same statistics, until the operation completes. During times of high read/write load, such as an ETL process against very large tables, this can cause performance problems. The workaround for this is another database-level option, called AUTO_UPDATE_STATISTICS_ASYNC. Even when this option is turned on, it only takes effect if AUTO_UPDATE_STATISTICS is also turned on. When enabled, AUTO_UPDATE_STATS_ASYNC forces the update of the statistics object to run as an asynchronous background process. This means that the query that caused it to run and other queries are not blocked, providing that a schema stability lock is not required. The trade-off, however, is that these queries do not benefit from the updated statistics.

The options mentioned earlier can be configured on the Options page of the Database Properties dialog box. Alternatively, you can configure them using ALTER DATABASE commands, as demonstrated in Listing 8-23.

--Turn on Auto_Create_Stats

ALTER DATABASE Chapter8 SET AUTO_CREATE_STATISTICS ON ;

--Turn on Auto_Create_Incremental_Stats

ALTER DATABASE Chapter8 SET AUTO_CREATE_STATISTICS ON (INCREMENTAL=ON) ;

--Turn on Auto_Update_Stats_Async

ALTER DATABASE Chapter8 SET AUTO_UPDATE_STATISTICS ON WITH NO_WAIT ;

--Turn on Auto_Update_Stats_Async

ALTER DATABASE Chapter8 SET AUTO_UPDATE_STATISTICS_ASYNC ON WITH NO_WAIT ;

Listing 8-23

Toggling Automatic Statistics Options

SQL Server 2022 introduces a new database scoped configuration (see Chapter 5 for further details about database scoped configurations), which avoids blocking other transactions that require a schema stability lock. The ASYNC_STATS_UPDATE_WAIT_AT_LOW_PRIORITY option, which can be toggled on and off using the script in Listing 8-24, works by placing the lock request of a low priority queue. It will only have any effect if AUTO_UPDATE_STATISTICS is turned on for the database.

--Turn On

ALTER DATABASE SCOPED CONFIGURATION SET ASYNC_STATS_UPDATE_WAIT_AT_LOW_PRIORITY = ON

--Turn Off

ALTER DATABASE SCOPED CONFIGURATION SET ASYNC_STATS_UPDATE_WAIT_AT_LOW_PRIORITY = ON

Listing 8-24

Toggle ASYNC_STATS_UPDATE_WAIT_AT_LOW_PRIORITY

Filtered Statistics

Filtered statistics allow you to create statistics on a subset of data within a column through the use of a WHERE clause in the statistic creation. This allows the Query Optimizer to generate an even better plan, since the statistics only contain the distribution of values within the well-defined subset of data. For example, if we create filtered statistics on the NetAmount column of our OrdersDisk table filtered by OrderDate being greater than 1 Jan 2019, then the statistics will not include rows that contain old orders, allowing us to search for large, recent orders more efficiently.

Incremental Statistics

Incremental statistics can help reduce table scans caused by statistics updates on large partitioned tables. When enabled, statistics are created and updated on a per-partition basis, as opposed to globally, for the entire table. This can significantly reduce the amount of time you need to update statistics on large partitioned tables, since partitions where the statistics are not outdated are not touched, therefore reducing unnecessary overhead.

Incremental statistics are not supported in all scenarios, however. A warning is generated and the setting is ignored if the option is used with the following types of statistics:

Statistics on views
Statistics on XML columns
Statistics on Geography or Geometry columns
Statistics on filtered indexes
Statistics for indexes that are not partition aligned

Additionally, you cannot use incremental statistics on read-only databases or on databases that are participating in an AlwaysOn Availability Group as a readable secondary replica.

Managing Statistics

In addition to being automatically created and updated by SQL Server, you can also create and update statistics manually using the CREATE STATISTICS statement. If you wish to create filtered statistics, add a WHERE clause at the end of the statement. The script in Listing 8-25 creates a multicolumn statistic on the FirstName and LastName columns of the CustomersDisk table. It then creates a filtered statistic on the NetAmount column of the OrdersDisk table, built only on rows where the OrderDate is greater than 1st Jan 2019.

USE Chapter8

--Create multicolumn statistic on FirstName and LastName

CREATE STATISTICS Stat_FirstName_LastName ON dbo.CustomersDisk(FirstName, LastName) ;

--Create filtered statistic on NetAmount

CREATE STATISTICS Stat_NetAmount_Filter_OrderDate ON dbo.OrdersDisk(NetAmount)

WHERE OrderDate > '2019-01-01' ;

Listing 8-25

Creating Statistics

When creating statistics, you can use the options detailed in Table 8-4.

Table 8-4

Creating Statistics Options

Option	Description
FULLSCAN	Creates the statistic object on a sample of 100% of rows in the table. This option creates the most accurate statistics but takes the longest time to generate.
SAMPLE	Specifies the number of rows or percentage of rows you need to use to build the statistic object. The larger the sample, the more accurate the statistic, but the longer it takes to generate. Specifying 0 creates the statistic but does not populate it.
NORECOMPUTE	Excludes the statistic object from being automatically updated with AUTO_UPDATE_STATISTICS.
INCREMENTAL	Overrides the database-level setting for incremental statistics.

Individual statistics, or all statistics on an individual table, can be updated by using the UPDATE STATISTICS statement. The script in Listing 8-26 first updates the Stat_NetAmount_Filter_OrderDate statistics object that we created on the OrdersDisk table and then updates all statistics on the CustomersDisk table.

--Update a single statistics object

UPDATE STATISTICS dbo.OrdersDisk Stat_NetAmount_Filter_OrderDate ;

--Update all statistics on a table

UPDATE STATISTICS dbo.CustomersDisk ;

Listing 8-26

Updating Statistics

When using UPDATE STATISTICS, in addition to the options specified in Table 8-4 for creating statistics, which are all valid when updating statistics, the options detailed in Table 8-5 are also available.

Table 8-5

Updating Statistics Options

Option	Description
RESAMPLE	Uses the most recent sample rate to update the statistics.
ON PARTITIONS	Causes statistics to be generated for the partitions listed and then merges them together to create global statistics.
ALL \| COLUMNS \| INDEX	Specifies if statistics should be updated for just columns, just indexes, or both. The default is ALL.

You can also update statistics for an entire database by using the sp_updatestats system stored procedure. This procedure updates out-of-date statistics on disk-based tables and all statistics on memory-optimized tables regardless of whether they are out of date or not. Listing 8-27 demonstrates this system stored procedure’s usage to update statistics in the Chapter8 database. Passing in the RESAMPLE parameter causes the most recent sample rate to be used. Omitting this parameter causes the default sample rate to be used.

EXEC sp_updatestats 'RESAMPLE' ;

Listing 8-27

Sp_updatestats

Note

Updating statistics causes queries that use those statistics to be recompiled the next time they run. The only time this is not the case is if there is only one possible plan for the tables and indexes referenced. For example, SELECT * FROM MyTable always performs a clustered index scan, assuming that the table has a clustered index.

SQL Server 2019 introduces additional metadata information, to help diagnose issues that are caused by queries waiting for synchronous statistics updates to occur. Firstly, a new wait type has been added, called WAIT_ON_SYNC_STATISTICS_REFRESH. This wait type denotes the amount of time that queries have spent waiting on the completion of synchronous statistics updates. Secondly, a new command type, called SELECT (STATMAN), has been added to the sys.dm_exec_requests DMV. This command type indicates that a SELECT statement is currently waiting for a synchronous statistics update to complete, before it can continue.

Managing Cardinality

Because cardinality estimation is critical to plan optimization, if the cardinality estimator makes the wrong assumptions, it can have serious implications for query performance. Therefore, major changes to the cardinality estimator are not often made. There was a major update in SQL Server 7. The next large update was in SQL Server 2014.

The update to the cardinality estimator in SQL Server 2014 had two main impacts on behavior. Firstly, the assumption that columns are independent was replaced with an assumption that there will be correlation between values in different columns.

The behavior for estimating if no data will be returned based on joining multiple tables has also changed. The original behavior was to estimate the selectivity in the predicates of each table before joining the histograms to estimate join selectivity. The new behavior is to estimate join selectivity from the base tables before estimating predicate selectivity. This is known as base containment.

The issue is that the new cardinality estimations may be suboptimal for your workloads. If this is the case, then you can force a query to use legacy cardinality estimations by using a plan hint. New in SQL Server 2022, you can also use a Query Store hint to force the behavior without changing the query. This will be discussed in Chapter 22.

There are also Database Scoped Configurations that allow you to change the behavior on a database level, however. You can use the commands in Listing 8-28 to toggle between legacy and standard cardinality estimations for all queries within the database.

--Turn Legacy Cardinality Estimations On

ALTER DATABASE SCOPED CONFIGURATION SET LEGACY_CARDINALITY_ESTIMATION = ON

--Turn Legacy Cardinality Estimations Off

ALTER DATABASE SCOPED CONFIGURATION SET LEGACY_CARDINALITY_ESTIMATION = ON

Listing 8-28

Toggle Legacy Cardinality Estimations

In SQL Server 2022, providing that Query Store is enabled, you can also use feedback to help the optimizer determine what cardinality assumptions should be used for a given query, based on previous optimizations of the same query. You can use the commands in Listing 8-29 to toggle this functionality on or off, within a database.

Tip

Query store is discussed in Chapter 22.

--Turn Cardinality Feedback On

ALTER DATABASE SCOPED CONFIGURATION SET CE_FEEDBACK = ON

--Turn Cardinality Feedback Off

ALTER DATABASE SCOPED CONFIGURATION SET CE_FEEDBACK = ON

Listing 8-29

Toggle Cardinality Feedback

Summary

A table that does not have a clustered index is called a heap and the data pages of the table are stored in no particular order. Clustered indexes build a B-tree structure, based on the clustered index key, and cause the data within the table to be ordered by that key. There can only ever be one clustered index on a table because the leaf level of the clustered index is the actual data pages of the table, and the pages can only be physically ordered in one way. The natural choice of key for a clustered index is the primary key of the table and, by default, SQL Server automatically creates a clustered index on the primary key. There are situations, however, when you may choose to use a different column as the clustered index key. This is usually when the primary key of the table is very wide, is updateable, or is not ever-increasing.

Nonclustered indexes are also B-tree structures built on other columns within a table. The difference is that the leaf level of a nonclustered index contains pointers to the data pages of the table, as opposed to the data pages themselves. Because a nonclustered index does not order the actual data pages of a table, you can create multiple nonclustered indexes. These can improve query performance when you create them on columns that are used in WHERE, JOIN, and GROUP BY clauses. You can also include other columns at the leaf level of the B-tree of a nonclustered index in order to cover a query. A query is covered by a nonclustered index, when you do not need to read the data from the underlying table. You can also filter a nonclustered index by adding a WHERE clause to the definition. This allows for improved query performance for queries that use a well-defined subset of data.

Columnstore indexes compress data and store each column in a distinct set of pages. This can significantly improve the performance of data warehouse–style queries, which perform analysis on large datasets, since only the required columns need to be accessed, as opposed to the entire row. Each column is also split into segments, with each segment containing a header with metadata about the data, in order to further improve performance by allowing SQL Server to only access the relevant segments, in order to satisfy a query. Nonclustered Columnstore indexes are similar to Clustered Columnstore indexes, except that they make a copy of the data. This provides large performance enhancements at the expense of space.

You can create two types of index on memory-optimized tables: nonclustered indexes and nonclustered hash indexes. Nonclustered hash indexes are very efficient for point lookups, but they can be much less efficient when you must perform a range scan. Nonclustered indexes perform better for operations such as inequality comparisons, and they are also able to return the data in the sort order of the index key.

Indexes need to be maintained over time. They become fragmented due to DML statements causing page splits and can be reorganized or rebuilt to reduce or remove fragmentation. When pages become out of sequence, this is known as external fragmentation, and when pages have lots of free space, this is known as internal fragmentation. SQL Server stores metadata regarding index fragmentation and can display this through a DMF called sys.dm_db_index_physical_stats. SQL Server also maintains information on indexes that it regards as missing. A missing index is an index that does not exist in the database but would improve query performance if it were created. DBAs can use this data to help them improve their indexing strategies.

SQL Server maintains statistics about the distribution of values within a column or set of columns to help improve the quality of query plans. Without good-quality statistics, SQL Server may make the wrong choice about which index or index operator to use in order to satisfy a query. For example, it may choose to perform an index scan when an index seek would have been more appropriate. You can update statistics manually or automatically, but either way causes queries to be recompiled. SQL Server also supports incremental statistics, which allow statistics to be created on a per-partition basis, as opposed to globally for an entire table.

As well as managing statistics, SQL Server 2022 also allows you a limited degree of management over the assumptions used for cardinality estimation. The CE_FEEDBACK and LEGACY_CARDINALITY_ESTIMATIONS Database Scoped Configurations can be used. Legacy cardinality estimation can also be set for a specific query, using a Plan Hint or a Query Store hint.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 8. Indexes and Statistics

Create new playlist

Sign In

Sign Up

8. Indexes and Statistics

Clustered Indexes

Tables Without a Clustered Index

Tables with a Clustered Index

Clustering the Primary Key

Administering Clustered Indexes

Nonclustered Indexes

Covering Indexes

Administering Nonclustered Indexes

Filtered Indexes

Indexes for Specialized Application

Columnstore Indexes

Clustered Columnstore Indexes

Nonclustered Columnstore Indexes

In-Memory Indexes

In-Memory Nonclustered Hash Indexes

In-Memory Nonclustered Indexes

Maintaining Indexes

Missing Indexes

Index Fragmentation

Detecting Fragmentation

Removing Fragmentation

Resumable Index Operations

Partitioned Indexes

Statistics

Filtered Statistics

Incremental Statistics

Managing Statistics

Managing Cardinality

Summary

Table of Contents for
8. Indexes and Statistics