Chapter 10 Understanding and designing indexes

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 10
Understanding and designing indexes

Understanding indexing in memory-optimized tables

Understanding other types of indexes

In this chapter, we dive into indexing of all kinds—not just clustered and nonclustered indexes—including practical development techniques for designing indexes. We spend time on memory-optimized tables throughout, including hash indexes for extreme writes, and Columnstore indexes for extreme reads. We review “missing” indexes and index usage, and special types of indexes for niche uses. Finally, we explain statistics, how they are created and updated, and important performance-related options for statistics objects.

Designing clustered indexes

Let’s be clear about what the clustered index is and then state the case for why every table in a relational database should have one, with very few exceptions. In this section, we describe the process for choosing the best clustered index key as well as the case against deliberately choosing not to create a clustered index.

Whether you are inheriting and maintaining a database or designing the objects within it, there are important facts to know about clustered indexes. In the case of rowstore and Columnstore tables, the clustered index stores the data rows for all columns in the table, sorted by the clustered index key. Memory-optimized tables don’t have a clustered index structure inherent to their design but could have a clustered Columnstore index created for them.

Choosing a proper clustered index key

When designing the clustered index key for a table, keep in mind four marks of a good clustered index key, or in the case of a compound clustered index key, the first column listed. Let’s review four key factors that will help you understand what role the clustered index key serves and how best to design one:

Increasing sequential value. A value that increases (such as 1,2,3…, or an increasing point in time, or an increasing alphanumeric) is valuable from a page organization standpoint. This means that the insert pattern of the data as it comes in from the business will match the loading of rows into the table.

A column with the IDENTITY property or populated by a SEQUENCE value matches this perfectly. Use date and time data only if it is highly unlikely to repeat, and then strongly consider using the datetimeoffset data type to avoid repeated data during daylight savings time changes.

Unique. A clustered index key does not need to be unique, but it really should be. (That said, the clustered key does not need to be the Primary Key of the table, or the only uniqueness enforced in the table.) A unique (or near-unique) clustered index means efficient seeks, and if your application will be searching for individual rows out of this table regularly, you and the business should know what makes those searches unique.

Unique constraints, whether they be nonclustered or clustered, can improve performance on the same data, and create a more efficient structure. If a clustered index is declared without the UNIQUE property, a second key value is added in the background, a four-byte integer uniquifier column. Microsoft SQL Server must have some way to uniquely identify each row.
Surrogate. The clustered index key shouldn’t be visible to end-user applications or reports. In general, when the end users can see data, they will eventually see fit to change that data. You do not want clustered index keys to ever change (much less PRIMARY KEY key columns). A system-generated column of sequential values (again, an IDENTITY column is perfect) or one that combines system or application-generated fields such as dates and times, or numbers, is ideal.

The negative impact of changing the clustering key includes the possibility that the first two aforementioned guidelines would be broken. If the clustered key is also a Primary Key, updating the key’s values could also require cascading updates to enforce referential integrity. It is much easier for everyone involved if only columns with business value are exposed to end users and, therefore, could be changed by end users. In normalized database design, we would call these “natural keys” as opposed to “surrogate keys.”
Narrow data type. This is not listed last as an indication that it is least important; quite to the contrary, the decision of data type for your clustered index key can have a large impact on table size, the cost of index maintenance, and the efficiency of queries at scale. The clustered index key value is also stored with every nonclustered index key value, meaning that an unnecessarily wide clustered index key will also cause unnecessarily wide nonclustered indexes on the table. This can have a very large impact on storage on drives and in memory at scale.

The narrow data type guidance should also steer you away from using the uniqueidentifier field, which is 16 bytes per row, or four times the size of an integer column per row, and twice as large as a bigint. It also steers away from using wide strings, such as names, addresses, or URLs.

Inside OUT

Why are uniqueidentifiers an overall bad choice for the clustered index key, even for the “oil rig problem”?

There is a common design challenge to store rows from multiple (perhaps disconnected) data sources in the same table; for example, oil rigs, medical devices, or a supervisory control and data acquisition (SCADA) system. Each source of the data must create unique values for itself, but those values must then be combined into a single table. The uniqueidentifier data type and newid() function appear to be an option because they will generate values uniquely across multiple servers.

This is not a good design for scale, however, because uniqueidentifiers are random, meaning that inserts will perfectly fragment a table with each new row. This will cause page splits (an expensive I/O operation) as the rows naturally merge into the rest with each insert in the “middle” rather than at the end, inserting sequentially. (you can mitigate this, though not completely, by altering the fill factor of each index that uses the uniqueidentier as a key. However, this is also not desirable, because it will further increase the space to store the same data.)

Even the newsequentialid() function, to create sequential uniqueidentifiers, has fatal flaws. After a server restart, the sequence starts again with, meaning that eventually, you will be back to writing new rows in the middle of existing rows, causing page splits again.

Consider instead a solution using multiple integers, one that autoincrements and one that identifies the data source, if you are considering the uniqueidentifier field. Even two four-byte integers are half the size of a uniqueidentifier, and they compress better.

This design problem usually involves these devices merging their data periodically—not continuously. In the case of continuous connected application integration into a single table, consider using the SEQUENCE feature of SQL Server, introduced in SQL Server 2012, instead of a uniqueidentifier. Using the SEQUENCE object will allow for multiple database connections write rows using a unique, autoincrementing, ascending, procedurally generated integer.

It is ironic that a number of Microsoft-developed platforms use uniqueidentifiers heavily, and sometimes to very public failures, for example, the Windows 7 RC download page. (Read Paul Randal’s blog, “Why did the Windows 7 RC download failure happen?” https://www.sqlskills.com/blogs/paul/why-did-the-windows-7-rc-download-failure-happen/) But systems like Microsoft SharePoint and even SQL Server’s own merge replication needed to be developed for utility and versatility across unlimited client environments and a wide array of user expertise. When designing your own systems, take advantage of your knowledge of the business environment to design better clustered index keys that escape the inefficiencies of the uniqueidentifier data type.

The clustered index is an important decision in the structure of a new table. For the vast majority of tables designed for relational database systems, however, the decision is fairly easy. An IDENTITY column with an integer or bigint data type is the ideal key for a clustered index because it satisfies the aforementioned four recommended qualities of an ideal clustered index. A procedurally generated timestamp or other incrementing time-related value, combined with a unique, autoincrementing number also provides for a common albeit less-narrow clustered index key design.

When a table is created with a Primary Key constraint and no other mention of a clustered index, the Primary Key’s columns become the clustered index’s key. This is typically safe, but a table with a compound Primary Key or a Primary Key that does not begin with a sequential column, could result in a suboptimal clustered index. It is important to note that the Primary Key does not need to be the clustered index key. It is possible to create nonunique clustered indexes or to have multiple unique columns or column combinations in a table.

When combining multiple columns into the clustered index key of an index, keep in mind that the column order of an index, clustered or nonclustered, does matter. If you decide to use multiple columns to create a clustered index key, the first column should still align as closely to the other three rules, even if it alone is not unique.

In sys.indexes, the clustered index is always identified as index_id = 1. If the table is a heap, there will instead be a row with index_id = 0. This row represents the heap data.

The case against intentionally designing heaps

Without a clustered index, a rowstore table is known colloquially as a heap. The Database Engine uses a structure known as row identifier (RID), which is set up to uniquely identify every row for internal purposes. The structure of the heap has no order when it is logically stored.

Scans are the only method of access to read from a heap structure. It is not possible to perform a seek against a heap; however, it is possible to perform a seek against a nonclustered index that has been added to a heap. In this way, a nonclustered index can provide an ordered copy for some of the table data in a separate structure.

Of the edge cases for designing a table purposely without a clustered index, a case can be made for the situation in which you would only ever insert into a table. Without any order to the data, you might reap some benefits from rapid, massive data inserts into a heap. Other types of writes to the table (deletes and updates) will likely require table scans to complete and likely be far less efficient than the same writes against a table with a clustered index.

Deletes and updates will also probably leave wasted space within the heap’s structure, which cannot be reclaimed even with an Index Rebuild operation. To reclaim wasted space inside of a heap, you must ironically create a clustered index on the table and drop the clustered index.

The perceived advantage of heaps for workloads exclusively involving inserts can be easily outweighed by the significant disadvantages whenever accessing that data—when query performance would necessitate the creation of a clustered and/or nonclustered index. Table scans and RID lookups are likely to dominate the cost of any execution plan accessing the heap. Without a clustered index, queries reading from a table large enough to gain significant advantage from its inserts would perform poorly.

With Microsoft’s expansion into modern unstructured data platforms such as SQL Server integration with Hadoop, or Microsoft Azure Data Lake, other architectures are likely to be more appropriate when rapid, massive data inserts are required. This is especially true for when you will be continuously collecting massive amounts of data and then only ever analyzing the data in aggregate. It’s likely that these alternatives, integrated with the Database Engine starting with SQL Server 2016 or a focus of new Azure development, would be a superior alternative.

Further, adding a clustered index to optimize the eventual retrieval of data from a heap is nontrivial. Behind the scenes, the Database Engine must write the entire contents of the heap into the new clustered index structure. If any nonclustered indexes exist on the heap, they also will be re-created, using the clustered key instead of the RID. This will likely result in a large amount of transaction log activity and TempDB space consumed.

Designing nonclustered indexes

Although each table should have a clustered index that assumes the organization of the data in the table, nonclustered indexes provide additional copies of the data in vertically-filtered sets, sorted by nonprimary columns.

You should approach the design of nonclustered indexes in response to application query usage, and then verify over time that you are benefitting from indexes (you can read more about index usage statistics later in this chapter.)

Let’s review the properties of good nonclustered indexes:

Broad enough serve multiple queries, not just designed to suit one
Well-ordered keys that eliminate unnecessary sorting in high-value queries
Well-stocked INCLUDE sections prevent Lookups in high-value queries
Proven beneficial usage over time in the sys.dm_db_index_usage_stats dynamic management view (DMV)
Unique when possible (keep in mind a table can have multiple uniqueness criteria)
Key order matters, so the most selective (most distinct) columns should be listed first
The index key list doesn’t overlap other nonclustered indexes

This is the purpose of a nonclustered index; to provide more than one picture of the data in a table. Nonclustered indexes are copies of a rowstore table that take up space on a disk and in memory (when cached). You must back up and maintain them. They are kept transactionally consistent with the data in the table, serving a limited, reordered set of the data in a table. Because the nonclustered index reflects the same data per columns that the clustered index has, it also has a degree of overhead in SQL Server. All writes to the table data must also be written to the nonclustered index (in the case of updates, when any indexed column is modified), to keep it up to date.

The positive benefit they can have on SELECT queries, however, is potentially very significant. Keep in mind also that some write queries might appear to perform faster because accessing the data that is being changed can be optimized, as well, just as accessing the data in a SELECT query. It is not a rule that your applications’ writes will slow, though adding many nonclustered indexes will certain add up to poor write performance.

You should not create nonclustered indexes haphazardly or clumsily; you should plan, modify, and combine them with one another when appropriate, and review them regularly to make sure they are still useful. Nonclustered indexes represent a significant source of potential performance tuning, however, that every developer and database administration should be aware of, especially in transactional databases. Remember always to create indexes when looking at the “big picture”—rarely does a single query rise to the importance level of justifying its own indexes.

Understanding nonclustered index design

Let’s talk about what we meant a moment ago when we said, “you should not create indexes haphazardly or clumsily.” When should you create a nonclustered index, and how should you design them? How many should you add to a table?

Even though adding nonclustered indexes on Foreign Key columns can be beneficial if those referencing columns will frequently be used in queries, it’s rare that a useful nonclustered index will be properly designed with a single column in mind. This is because outside of joins on foreign keys, it is rare that queries will be designed to seek and return a single column from a table.

Choosing a proper nonclustered index key

Earlier, we mentioned that nonclustered index keys shouldn’t overlap with other indexes in the same table. When creating indexes, perhaps one that perfectly suits a very important query, you must always compare the index to existing indexes. The order of the key of the index matters. In Transact-SQL (T-SQL), this looks like this:

Click here to view code image

CREATE INDEX IDX_NC_InvoiceLines_InvoiceID_StockItemID
ON [Sales].[InvoiceLines] (InvoiceID, StockItemID);

In this index, InvoiceID and InvoiceLineID are defined as the key. Via Object Explorer in Management Studio, you can view the index properties to see the same information. This nonclustered index represents a copy of the data of the InvoiceLines table, sorted by the column InvoiceID first, and then the StockItemID.

To emphasize that the order of key columns in a nonclustered index matters, the two indexes that follow are completely different structures, and will best serve different queries. It’s not likely that a single query would have much use for both, though SQL Server can still choose to use an nonclustered index with less than optimal key order than to scan a clustered index:

Click here to view code image

CREATE INDEX IDX_NC_InvoiceLines_InvoiceID_StockItemID
ON [Sales].[InvoiceLines] (InvoiceID, StockItemID);
CREATE INDEX IDX_NC_InvoiceLines_StockItemID_InvoiceID
ON [Sales].[InvoiceLines] (StockItemID, InvoiceID);

The columns with the most distinct values are more selective and will best serve queries if they are listed before less-selective columns in the index order. Note, though, that the order of columns in the INCLUDE portion of a nonclustered index (more on that later) does not matter.

Remember also from the previous section on clustered indexes that the clustered index key is already inside the key of the nonclustered index. There might be scenarios when the Missing Indexes feature (more on this later) suggests adding a clustered key column to your nonclustered index. It does not change the size of a nonclustered index to do this—the clustered key is already in nonclustered index. The only caveat is that the order of the nonclustered index keys still determines the sort order of the index. So, having the clustered index key column(s) in your nonclustered index key won’t change the index’s size, but could change the sort order of the keys, creating what is essentially a different index when compared to an index that doesn’t include the clustered index key column(s).

Implied in this T-SQL is the sort order of each column which by default is ascending. If queries frequently call for data to be sorted by a column in descending order, you could provide that key value like this:

Click here to view code image

CREATE INDEX IDX_NC_InvoiceLines_InvoiceID_StockItemID
ON [Sales].[InvoiceLines] (InvoiceID DESC, StockItemID);

Creating the key’s sort order incorrectly might not matter to some queries. Nested Loops do not require data to be sorted, so different sort orders in the keys of a nonclustered index might not make a significant impact to the execution plan. A Merge Join requires sorted data, however, so changing the sort order of the keys of an index, especially the first key, could simplify an execution plan by eliminating unnecessary sort operators. This is among the strategies of index tuning to consider. Remember to review the query plan performance data that the Query Store collects, to observe the impact of index changes on multiple queries.

Understanding redundant indexes

Because index key order matters, we need to be aware of what is and what isn’t an overlapping index. Consider the following two nonclustered indexes on the same table:

Click here to view code image

CREATE INDEX IDX_NC_InvoiceLines_InvoiceID_StockItemID_UnitPrice_Quantity
ON [Sales].[InvoiceLines] (InvoiceID, StockItemID, UnitPrice, Quantity);

CREATE INDEX IDX_NC_InvoiceLines_InvoiceID_StockItemID
ON [Sales].[InvoiceLines] (InvoiceID, StockItemID);

Both indexes lead with InvoiceID and StockItemID. The first index includes additional data. The second index is completely overlapped. Queries can still use the second index, but because the leading key columns match the other index, the other index will provide very similar performance gains with less to maintain. The space it requires, the space in memory it consumes when used, and the effort it takes to keep the index up-to-date and maintained could all be considered redundant. The index IDX_NC_InvoiceLines_InvoiceID_StockItemID isn’t needed and should be dropped, and queries that used it will use IDX_NC_InvoiceLines_InvoiceID_StockItemID_UnitPrice_Quantity.

Consider then the following two indexes:

Click here to view code image

CREATE INDEX IDX_NC_InvoiceLines_InvoiceID_StockItemID_UnitPrice_Quantity
ON [Sales].[InvoiceLines] (InvoiceID, StockItemID, UnitPrice, Quantity);

CREATE INDEX IDX_NC_InvoiceLines _StockItemID_InvoiceID
ON [Sales].[InvoiceLines] (StockItemID, InvoiceID);

Note that the second index’s keys are in a different order. This is physically and logically a different structure than the first index.

Does that mean both of these indexes are needed? Probably. Some queries might perform best using keys in the second index’s order. The query optimizer can still use an index with columns in a suboptimal order; for example, to scan the smaller structure rather than the entire table. The query optimizer might instead find that an Index Seek and a Key Lookup on a different index is faster than using an index with the columns in the wrong order.

The Query Store can be an invaluable tool to discover queries that have regressed because of changes to indexes that have been dropped, reordered, or resorted.

Understanding the INCLUDE list of an index

In the B-tree structure of a rowstore nonclustered index, key columns are stored through the two major sections of the index object: the branch levels and the leaf levels. The branch levels are where the logic of seeks happen, starting at a narrow “top” where key data is stored so that it can be traversed by SQL Server using binary decisions. A seek moves “down” the tree via binary decisions. The leaf levels are where the seek ends and data is retrieved. Adding a column to the INCLUDE list of a rowstore nonclustered index adds that data only to the leaf level.

Inside OUT

How can I see the properties and storage for each level of the index’s B-tree?

You can view the page_count, record_count, space_used statistics, and more for each level of a B-tree by using the DETAILED mode of the sys.dm_db_index_physical_stats dynamic management function. Only the leaf level (index_level = 0) is visible in other modes. The mode parameter is the fifth parameter passed in, as demonstrated in the following code:

Click here to view code image

USE WideWorldImporters;
GO
SELECT *
FROM sys.dm_db_index_physical_stats
(db_id(), object_id('Sales.Invoices'), null , null, 'DETAILED');

Table 10-1 shows the results.

Table 10-1 Results of sys.dm_db_index_physical_stats to show index levels

`index_id`	`index_level`	`page_count`	`record_count`
1	0	11,357	70,510
1	1	41	11,357
1	2	1	41
2	0	164	70,510
2	1	1	164
3	0	152	70,510
3	1	1	152
4	0	237	70,510
4	1	1	237
5	0	122	70,510
5	1	1	122
6	0	164	70,510
6	1	1	164
7	0	152	70,510
7	1	1	152
8	0	136	70,510
8	1	1	136
9	0	142	70,510
9	1	1	142
10	0	434	70,510
10	1	2	434
10	2	1	2

Notice that the leaf level of each index has the same number of rows, but has a different number of pages (because each index has different columns). Note that the branch levels (where index_level > 0) of each index’s B-tree contain less data. Indexes with more columns and on larger tables will require more levels. Adding a column to the INCLUDE list of a rowstore nonclustered index only adds that data to the leaf level.

The INCLUDE statement of an index allows for data to be retrievable in the leaf level only, but not stored in the branch level. This reduced the overall size and complexity of the index. Consider the following query and execution plan (Figure 10-1) from the WideWorldImporters database:

Click here to view code image

SELECT CustomerID, AccountsPersonID
FROM [Sales].[Invoices]
WHERE CustomerID = 832;

Figure 10-1 This execution plan shows an Index Seek and a Key Lookup on the same table; the Key Lookup represents 99% of the cost of the query.

Assuming that this is a high-priority query, important enough to be considered when modifying indexes in the database, we should consider a change to the nonclustered index being used here. Let’s take a look at the properties of the Index Seek in Figure 10-2.

Figure 10-2 The properties of the Index Seek in the previous example script. Note that CustomerID is in the Seek Predicate and also in the Output List, but that AccountsPersonID is not listed in the Output List.

Note that CustomerID is in the Seek Predicate and also in the Output List, but that AccountsPersonID is not listed in the Output List. Our query is searching for and returning CustomerID (it appears in both the SELECT and WHERE clauses), but our query also returns AccountsPersonID, which is not contained in the index FK_Sales_Invoices_CustomerID. Here is the code of the nonclustered index FK_Sales_Invoices_CustomerID, named because it is for CustomerID, a foreign key reference:

Click here to view code image

CREATE NONCLUSTERED INDEX [FK_Sales_Invoices_CustomerID] ON [Sales].[Invoices]
( [CustomerID] ASC )
ON [USERDATA];

To remove the Key Lookup, let’s add an included column to the nonclustered index, so that the query can retrieve all the data it needs from a single object:

Click here to view code image

DROP INDEX IF EXISTS [FK_Sales_Invoices_CustomerID] ON [Sales].[Invoices];
GO
CREATE NONCLUSTERED INDEX [FK_Sales_Invoices_CustomerID] ON [Sales].[Invoices]
( [CustomerID] ASC )
INCLUDE ( [AccountsPersonID] )
ON [USERDATA];
GO

Let’s run our sample query again (see also Figure 10-3):

Click here to view code image

SELECT CustomerID, AccountsPersonID
FROM [Sales].[Invoices]
WHERE CustomerID = 832;

Figure 10-3 The execution plan now shows only an Index Seek; the Key Lookup that appeared in Figure 10-2 has been eliminated from the execution plan.

The Key Lookup has been eliminated. The query was able to retrieve both CustomerID and AccountsPersonID from the same index and required no second pass through the table for the column AccountsPersonID. The estimated subtree cost, in the properties of the SELECT operator, is now 0.0034015, compared to 0.355919 when the Key Lookup was present. Although this query was a small example for demonstration purposes, eliminating the Key Lookup represents a significant improvement to query performance without changing the query.

Just as you do not want to add too many nonclustered indexes, you also do not want to add too many columns unnecessarily to the INCLUDE list of nonclustered indexes. Columns in the INCLUDE list, as we saw in the previous code example, still require storage space. For small, infrequent queries, the key lookup operator is probably not alone worth the cost of storing additional columns in the INCLUDE list of an index.

In summary, you should craft nonclustered indexes to serve many queries smartly, you should always try to avoid creating overlapping or redundant indexes, and you should regularly review to verify that indexes are still being used as applications or report queries change. Keep this guidance in mind as we move into the next section!

Creating “missing” nonclustered indexes

The concept of combining many similar indexes into one super-index is core to the utility of using SQL Server’s built-in Missing Indexes feature. First introduced in SQL 2005, the Missing Indexes feature revolutionized the ability to see the “big picture” when crafting nonclustered indexes. The missing indexes feature is passive, on by default on any database since SQL Server 2005, as well as in Azure SQL Database.

The Missing Indexes feature collects information from actual query usage. SQL Server passively records when it would have been better to have a nonclustered index; for example, to replace a Scan for a Seek or to eliminate a Lookup. The Missing Indexes feature then aggregates these requests together, counts how many times they have happened, calculates the cost of the statement operations that could be improved, and estimates the percentage of that cost that would be eliminated (this percentage is labeled the “impact”). Think of the Missing Indexes feature as a wish list of nonclustered indexes.

You can look at missing indexes any time, with no performance overhead to the server, by querying a set of DMVs dedicated to this feature. You can find the following query, which concatenates together the CREATE INDEX statement for you, according to a simple, self-explanatory naming convention. As you can see from the use of system views, this query is intended to be run in a single database:

Click here to view code image

SELECT
  mid.[statement]
, create_index_statement = 'CREATE NONCLUSTERED INDEX IDX_NC_'
--The new Translate syntax is supported for SQL 2017+
      + TRANSLATE(ISNULL(replace(mid.equality_columns, ' ' ,''),'') , '],[' ,' _ ')
      + TRANSLATE(ISNULL(replace(mid.inequality_columns, ' ' ,''),''), '],[' ,' _ ')
      + ' ON ' + [statement]
      + ' (' + ISNULL (mid.equality_columns,'')
      + CASE WHEN mid.equality_columns IS NOT NULL
        AND mid.inequality_columns IS NOT NULL THEN ',' ELSE ''
         END
      + ISNULL (mid.inequality_columns, '')  + ')'
      + ISNULL (' INCLUDE (' + mid.included_columns + ')',
       '')  COLLATE SQL_Latin1_General_CP1_CI_AS
, unique_compiles, migs.user_seeks, migs.user_scans
, last_user_seek, migs.avg_total_user_cost
, avg_user_impact, mid.equality_columns
,  mid.inequality_columns, mid.included_columns
FROM sys.dm_db_missing_index_groups mig
INNER JOIN sys.dm_db_missing_index_group_stats migs
ON migs.group_handle = mig.index_group_handle
INNER JOIN sys.dm_db_missing_index_details mid
ON mig.index_handle = mid.index_handle
INNER JOIN sys.tables t
ON t.object_id = mid.object_id
INNER JOIN sys.schemas s
ON s.schema_id = t.schema_id
WHERE mid.database_id = db_id()
-- count of query compilations that needed this proposed index
--AND               migs.unique_compiles > 10
-- count of query seeks that needed this proposed index
--AND               migs.user_seeks > 10
-- average percentage of cost that could be alleviated with this proposed index
--AND               migs.avg_user_impact > 75
-- Sort by indexes that will have the most impact to the costliest queries
ORDER BY avg_user_impact * avg_total_user_cost desc;

This script is available in the “missing indexes.sql” script in the accompanying downloads for this book, which is available at https://aka.ms/SQLServ2017Admin/downloads.

At the bottom of this query are a series of filters that you can use to find only the most-used, highest-value index suggestions. If you have hundreds or thousands of rows returned by this query, consider spending an afternoon crafting together indexes to improve the performance of the actual user activity that generated this data.

This wish list, however, will likely include many overlapping (but not duplicate) suggestions. Each with a very high percentage of impacting high-cost queries. Some indexes returned by the missing indexes queries might not be worth creating because they have a very low impact, or have been part of only one query compilation. For example, you might see these three index suggestions:

Click here to view code image

  CREATE NONCLUSTERED INDEX IDX_NC_Gamelog_Team1 ON dbo.gamelog
(Team1)
INCLUDE (GameYear, GameWeek, Team1Score, Team2Score);

  CREATE NONCLUSTERED INDEX IDX_NC_Gamelog_Team1_GameWeek_GameYear ON dbo.gamelog
(Team1, GameWeek, GameYear)
INCLUDE (Team1Score);

  CREATE NONCLUSTERED INDEX IDX_NC_Gamelog_Team1_GameWeek_GameYear_Team2 ON dbo.gamelog
(Team1, GameWeek, GameYear, Team2)
INCLUDE (GameWeek);

You should not create all three of these indexes. Instead, you should combine them into a single index that matches the order of the needed key columns and covers all the included columns, as well. Here is the properly combined index suggestion:

Click here to view code image

CREATE NONCLUSTERED INDEX IDX_NC_Gamelog_Team1_GameWeek_GameYear_Team2 ON dbo.gamelog
(Team1, GameWeek, GameYear, Team2)
INCLUDE (Team1Score, Team2Score);

This last index is a good combination of the previous suggestions. It will deliver maximum positive benefit to the most queries, and it minimizes the negative impact to writes, storage, and maintenance. Note that the Key columns list overlaps and is in the correct order for each of the previous index suggestions, and that the INCLUDE columns list also covers all the columns needed in the index suggestions. If a column is in the key of the index, it does not need to exist in the INCLUDE of the index.

However, don’t create this index yet. You should still review existing indexes on the table before creating any missing indexes. Perhaps you can combine a new missing index and an existing index, in the Key column list or the INCLUDE column list, further increasing the value of a single index.

Finally, after combining missing index suggestions with one another and with existing indexes, you are ready to create the index and see it in action. Remember always to create indexes when looking at the “big picture”: remember, rarely does a single query rise to the importance level of justifying its own indexes. For example, in SQL Server Management Studio, you will sometimes see green text suggesting a missing index for this query, as illustrated in Figure 10-4 (note that in this figure the text is gray).

Figure 10-4 In the execution plan tab, in the header of each execution plan, text starting with "Missing Index" will alert you to the possible impact. Do not create this index on the
spot!

That’s valuable, but do not create that index on the spot! Always refer to the complete set of index suggestions and other existing indexes on the table, combining overlapping indexes when possible. Consider the green missing index alert in SQL Server Management Studio as only a flag that indicates you should spend time investigating new missing indexes.

So, to recap, when creating nonclustered indexes for performance tuning, you should do the following:

Use the Missing Indexes DMVs to identify new “big picture” nonclustered indexes:
1. Don’t create indexes that will likely only help out a single query—few queries are important enough to deserve their own indexes.
2. Consider nonclustered Columnstore indexes instead for very large rowcount tables with very large rowcount queries. (You can read more on Columnstore indexes later in this chapter.)
Combine Missing Index suggestions, being aware of key order and INCLUDE lists.
Compare new index suggestions with existing indexes; perhaps you can combine them.
Remember to review index usage statistics, as well, to verify whether indexes are helping you. (See the next section for more on the index usage statistics DMV.)

Inside OUT

Does the Missing Indexes feature suggest only nonclustered index?

Yes, only nonclustered indexes.

The missing indexes feature can’t help you with proper clustered index design—that’s up to you, the informed database designer. It can provide some insight into usage after a time, but that would mean running a typical production workload against a heap, and suffering the performance issues likely to arise.

Here’s another limitation to the missing indexes feature: it is not aware of clustered or nonclustered Columnstore indexes, which are incredibly powerful structures to add for massive row count queries on large tables. The Missing Indexes feature cannot suggest Columnstore indexes, and it will even suggest an index to replace a well-behaving Columnstore index. Be aware of all indexes in your table, including Column-store indexes, when considering new indexes.

Therefore, when you have created a Columnstore index on a table, you will need to ignore index suggestions that look like the same workloads that are currently benefitting from the Columnstore. For a query that requires a scan on many rows in the table, the query optimizer is unlikely to pick a nonclustered index over a nonclustered Columnstore index. The Columnstore index will vastly outperform a nonclustered index for massive row count queries, though the missing index feature might still count this as a new nonclustered index suggestion.

Understanding when Missing Index suggestions are removed

Missing Index suggestions are cleared out for any change to the tables; for example, if you add or remove columns, or if you add or remove indexes. Missing Index suggestions are also cleared out when the SQL Server service is started, and cannot be manually cleared easily. (You can take the database offline and back online, which would clear out the Missing Index suggestions, but this seems like overkill.)

Logically, make sure the missing index data that you have collected is also based on a significant sample actual production user activity over time spanning at least one business cycle. Missing index suggestions based on development activity might not be a useful representation of intended application activity, though suggestions based on end-user testing or training could be.

Understanding and proving index usage statistics

You’ve added indexes to your database, and they are used over time, but meanwhile the query patterns of applications and reports change. Columns are added to the database, new tables are added, and although you add new indexes to suit new functionality, how does a database administrator ensure that existing indexes are still worth keeping?

SQL Server tracks this information for you automatically with yet another valuable DMV: sys. dm_db_index_usage_stats. Following is a script that measures index usage within a database, combining sys.dm_db_index_usage_stats with other system views and DMVs to return valuable information. Note that the ORDER BY clause will place indexes with the fewest read operations (seeks, scans, lookups) and the most write operations (updates) at the top of the list.

Click here to view code image

SELECT  TableName        = sc.name + '.' + o.name
    ,  IndexName         = i.name
    ,  s.user_seeks
    ,  s.user_scans
    ,  s.user_lookups
    ,  s.user_updates
    ,  ps.row_count
    ,  SizeMb            = (ps.in_row_reserved_page_count*8.)/1024.
    ,  s.last_user_lookup
    ,  s.last_user_scan
    ,  s.last_user_seek
    ,  s.last_user_update
FROM sys.dm_db_index_usage_stats AS s
INNER JOIN sys.indexes AS i

ON i.object_id = s.object_id AND i.index_id = s.index_id
INNER JOIN sys.objects AS o

ON o.object_id=i.object_id
INNER JOIN sys.schemas AS sc

ON sc.schema_id = o.schema_id
INNER JOIN sys.partitions AS pr

ON pr.object_id = i.object_id AND pr.index_id = i.index_id
INNER JOIN sys.dm_db_partition_stats AS ps

ON ps.object_id = i.object_id AND ps.partition_id = pr.partition_id
WHERE    o.is_ms_shipped = 0
ORDER BY user_seeks + user_scans + user_lookups  asc,  s.user_updates desc;

This script is available in the “index usage.sql” script in the accompanying downloads for this book, which is available at https://aka.ms/SQLServ2017Admin/downloads.

Any indexes that rise to the top of the preceding query should be considered for removal or redesign, given the following caveat:

Before justifying dropping any indexes, you should ensure that you have collected data from the index usage stats DMV that spans at least one business cycle. The index usage stats DMV is cleared when the SQL Server service is restarted. You cannot manually clear it. If your applications have week-end and month-end reporting, you might have indexes present and tuned specifically for those critical performance periods. Like many DMVs that are cleared when the SQL Server service restarts, consider a strategy of capturing data and storing it periodically in persistent tables.

Logically, verify that the index usage data that you have collected is also based on actual production user activity. Index usage data based on testing or development activity would not be a useful representation of intended application activity.

Again, the Query Store can be an invaluable tool to monitor for query regression after indexing changes.

Chapter 9 discusses the Query Store feature.

Like many server-level DMVs, the index usage stats data requires the VIEW SERVER STATE permission in SQL Server. In Azure SQL Database Premium tier, the VIEW DATABASE STATE permission is required, but only the server admin or Azure Active Directory admin accounts can access this data in standard and basic tiers.

Designing Columnstore indexes

Columnstore indexes were first introduced in SQL Server 2012, making a splash in their ability to far outperform clustered and nonclustered indexes when it comes to massive table reads. They were typically used in the scenario of nightly-refreshed data warehouses, but now they have beneficial applications on transactional systems, including on memory-optimized tables. Columnstore indexes are superior to rowstore data storage for performance in appropriate situations.

Big changes opened up Columnstore indexes further in SQL Server 2016:

Prior to SQL Server 2016, the presence of a nonclustered Columnstore index made the table read-only. This drawback and others were removed in SQL Server 2016, and now Columnstore indexes are fully-featured and quite useful.
With SQL Server 2016 SP1, Columnstore indexes are even available below Enterprise edition licenses of SQL Server (though with a limitation on Columnstore memory utilization).
Prior to SQL Server 2016, using read committed snapshot and snapshot isolation level were not supported with Columnstore indexes. Starting with SQL Server 2016, snapshot isolation and Columnstore indexes are fully compatible.
You can place a clustered Columnstore index on a memory-optimized table, providing the ability to do analytics on live real-time Online Transaction Processing (OLTP) data.

These key improvements opened Columnstore indexes to be used in transactional systems, when tables with millions of rows are read, resulting in million-row result sets. This is when Columnstore indexes really shine.

For now, Columnstore indexes are the only objects for which SQL Server can use Batch Mode execution. You’ll see “Batch” (instead of the default “Row”) in the Actual Execution Mode of an execution plan operator when this faster method is in use. Batch Mode processing benefits queries that process millions of rows or more. This isn’t a rule, however, as Columnstore indexes can use both Batch Mode and Row Mode execution for a variety of operators.

For more information on which operators can use Batch Mode execution, go to https://docs.microsoft.com/sql/relational-databases/indexes/columnstore-indexes-query-performance.

Columnstore indexes are not a B-tree; instead, they contain highly compressed data (on disk and in memory), stored in a different architecture from the traditional clustered and nonclustered indexes. You can create “clustered” or “nonclustered” Columnstore indexes, though this terminology is used more to indicate what the Columnstore index is replacing, not what it resembles behind the scenes.

You can also create nonclustered rowstore indexes on tables with a clustered Columnstore index, which is potentially useful to enforce uniqueness. Columnstore indexes cannot be unique, and so cannot replace the table’s unique constraint or Primary Key.

You can combine nonclustered and nonclustered Columnstore indexes on the same table, but you can have only one Columnstore index on a table, including clustered and nonclustered Columnstore indexes. You can even create nonclustered rowstore and nonclustered Columnstore indexes on the same columns. Perhaps you create both because you want to filter on the column value in one set of queries, and aggregate in another. Or, perhaps you create both only temporarily, for comparison.

One significant difference when you choose Columnstore index keys is that the order doesn’t matter. You can add columns in any order to satisfy many different queries, greatly increasing the versatility of the Columnstore index in your table. This is because the order of the data is not important to how Columnstore indexes work.

Demonstrating the power of Columnstore indexes

To demonstrate the power of this fully operational Columnstore index, let’s review an example scenario in which more than 14 million rows are added to the WideWorldImporters.Sales.InvoiceLines table. About half of the rows in the table now contain InvoiceID = 3932. This script is available in the “power of columnstore.sql” file in the accompanying downloads for this book, which is available at https://aka.ms/SQLServ2017Admin/downloads.

We dropped the existing WideWorldImporters-provided nonclustered Columnstore index and added a new nonclustered index we’ve created here, which performs an Index Scan on it to return the data. Remember that InvoiceID = 3932 is roughly half the table, so this isn’t a “needle in a haystack” situation; this isn’t a seek. If the query can use a seek operator, the nonclustered rowstore index would likely be better. When the query must scan, Columnstore is king.

We’ll use the following script as a test:

Click here to view code image

CREATE INDEX IDX_NC_InvoiceLines_InvoiceID_StockItemID_Quantity
ON [Sales].[InvoiceLines] (InvoiceID, StockItemID, Quantity);
GO
SELECT il.StockItemID, AvgQuantity = AVG(il.quantity)
FROM [Sales].[InvoiceLines] as il
WHERE il.InvoiceID = 3932
GROUP BY il.StockItemID;

The query returns 227 rows in 538 ms using the Index Scan operator on IDX_NC_InvoiceLines_InvoiceID_StockItemID_Quantity, the only operator in the execution plan. The total subtree cost of the plan is 18.9083. Next, we add a Columnstore index with the same key. In fact, the syntax is identical, save for the COLUMNSTORE keyword.

Click here to view code image

CREATE COLUMNSTORE INDEX IDX_CS_InvoiceLines_InvoiceID_StockItemID_quantity
ON [Sales].[InvoiceLines] (InvoiceID, StockItemID, Quantity);

The query returns 227 rows in 46 ms using the Columnstore Index Scan operator on IDX_CS_InvoiceLines_InvoiceID_StockItemID_quantity, the only operator in the execution plan. The total subtree cost of the plan is 2.56539.

Using compression delay on Columnstore indexes

We haven’t done a deep dive into the internals of Columnstore indexes, but we’ll touch on it here to discuss a potentially significant configuration option. The COMPRESSION_DELAY option for both nonclustered and clustered Columnstore indexes has to do with how long it takes changed data to be written from the delta store to the highly compressed Columnstore.

The delta store is an ephemeral location where changed data is stored in a clustered B-tree rowstore format. When certain thresholds are reached, specifically 1,048,576 rows, or when a Columnstore index is rebuilt, a group of delta store data is “closed” and then compressed into the Columnstore.

The COMPRESSION_DELAY option does not affect the 1,048,576 number, but rather how long it takes SQL Server to move the data into Columnstore. By setting the COMPRESSION_DELAY option to 10 minutes, data will remain in delta store for an extra 10 minutes before SQL Server compresses it.

The advantage of COMPRESSION_DELAY is noticeable for some write workloads, but not all. If the table is only ever inserted into, COMPRESSION_DELAY doesn’t really help. But if a block of recent data is updated and deleted for a period before finally settling in after a time, implementing COMPRESSION_DELAY can speed up the write transactions to the data and reduce the maintenance and storage footprint of the Columnstore index.

Changing the COMPRESSION_DELAY setting of the index, unlike many other index settings, does not require a rebuild of the index, and you can change it at any time; for example:

Click here to view code image

USE WideWorldImporters;
go
ALTER INDEX [IDX_cs_Sales_InvoiceLines]
ON [Sales].[InvoiceLines]
SET (COMPRESSION_DELAY = 10 MINUTES);

Understanding indexing in memory-optimized tables

Memory-optimized tables, first introduced in SQL Server 2014 and then greatly enhanced in SQL Server 2016, throw most of the rules of locks and concurrency out the window. They provide for table performance less bound (or in the case of nondurable data, unbound) by I/O constraints. Memory-optimized tables don’t use the locking mechanics of pessimistic concurrency, as is discussed in Chapter 9.

Memory-optimized tables have two types of indexes: hash and nonclustered. You must choose one of them to be the structure behind the Primary Key of the table if both data and schema are to be durable. If you do not include Primary Key constraints when a table is created with DURABILITY = SCHEMA_AND_DATA, you will receive an error and the table will not be created.

You can choose to have only the schema of the table be durable, but not the data. This has utility in certain scenarios as a staging table to receive data that will be moved to a durable disk-based or memory-optimized table. You should be aware of the potential for data loss. If only the schema of the memory-optimized table is durable, you do not need to declare a Primary Key. However, in the CREATE TABLE statement, you must still define at least one index or a Primary Key for a table by using DURABILITY = SCHEMA_ONLY.

Keep in mind that adding indexes to memory-optimized tables increases the amount of server memory needed. There is otherwise no limit to the size of memory-optimized tables in Enterprise edition; however, in Standard edition you are limited to 32 GB of memory-optimized tables per database.

Although there is no concept of a clustered index in memory-optimized tables, you can add a clustered Columnstore index to a memory-optimized table, dramatically improving your ability to query the data in aggregate. Because Columnstore indexes cannot be unique, they cannot serve as the Primary Key for a memory-optimized table.

Let’s go over the basics of using nonclustered hash and traditional B-tree nonclustered indexes on memory-optimized tables.

Understanding hash indexes for memory-optimized tables

Nonclustered hash indexes are an alternative to the typical B-tree internal architecture for index data storage. Hash indexes are best for queries that look for the needle in the haystack, but they are not effective at range lookups or queries that need a different sort order than the hash index. One other limitation of the hash index is that if you don’t query all the columns in a hash index, they are generally not as useful as a nonclustered index (see the next section).

Unlike B-tree–based nonclustered indexes, hash indexes also do not perform as well when there are multiple columns in the key of the indexes but not all columns, or even just the first column, are queried. Hash indexes are currently available only for memory-optimized tables, not disk-based tables. You can declare them by using the UNIQUE keyword, but they default to a non-unique key, similar to how B-tree nonclustered indexes are created. Just as with B-tree nonclustered indexes, you can also create more than one nonclustered hash index.

There is an additional unique consideration for creating hash indexes. Estimating the best number for the BUCKET_COUNT parameter can have a significant impact. The number should be as close to as possible to the number of unique key values that are expected. BUCKET_COUNT should be between 1 and 2 times this number. Hash indexes always use the same amount of space for the same-sized bucket count, regardless of the rowcount within.

For example, if you expect the table to have 100,000 unique values in it, the ideal BUCKET_COUNT value would be between 100,000 and 200,000.

Having too many or too few buckets in a hash index can result in poor performance. More buckets will increase the amount of memory needed and the number of those buckets that are empty. Too few buckets will result in queries needing to access more buckets in a chain to access the same information.

Hash indexes work best when the key values are mostly unique. If the ratio of total rows to unique key values is too high (10:1 is a general upper limit), a hash index is not recommended, and will perform poorly. Ideally, a hash index can be declared unique.

You should periodically and proactively compare the number of unique key values to the total number of rows in the table and then maintain the number of buckets in a memory-optimized hash index by using the ALTER TABLE/ALTER INDEX/REBUILD commands; for example:

Click here to view code image

ALTER TABLE [dbo].[Transactions]
ALTER INDEX [IDX_NC_H Transactions_1]
REBUILD WITH (BUCKET_COUNT = 200000)

Understanding nonclustered indexes for memory-optimized tables

Nonclustered indexes for memory-optimized tables behave similarly on memory-optimized tables as they do for disk-based tables. They will outperform hash indexes for queries that perform sorting on the key value(s) of the index, or when the index must be scanned. Further, if you don’t query all the columns in a hash index, they are generally not as useful as a nonclustered index.

You can declare nonclustered indexes on memory-optimized tables UNIQUE, however, the CREATE INDEX syntax is not supported. You must use the ALTER TABLE/ADD INDEX commands, or include them in the CREATE TABLE script.

Neither hash indexes nor nonclustered indexes can serve queries on memory-optimized tables for which the keys are sorted in the reverse order from how they are defined in the index. These types of queries simply can’t be serviced efficiently right now from memory-optimized indexes.

Moving to memory-optimized tables

When you’re considering moving a table from disk-based to memory-optimized, you can use the built-in Memory Optimization Advisor in SQL Server Management Studio. To do so, right-click any disk-based table to start the advisor. The advisor alerts you to table features that are not supported; warns you about some Memory Optimization Caveats, but will not make or recommend changes to applications that will need to incorporate error handling; and retries logic to support optimistic concurrency, if they do not already.

For more information on configuration of memory-optimized tables, see Chapter 8.

Understanding other types of indexes

There are other types of indexes that you should be aware of, each with specific, limited uses for certain SQL Server features; for example, the Full-Text Search engine, spatial data types, and the xml data type.

Understanding full-text indexes

If you have optionally chosen to install the Full-Text Search feature of SQL Server, you can take advantage of the full-text service (fdhost.exe) and query vast amounts of data using special full-text syntax, looking for word forms, phrases, thesaurus lookups, word proximity, and more.

Because they have specific uses for particular architectures and applications, we won’t spend much time on them in this reference. The Full-Text Engine is quite powerful and has a syntax of its own.

By design, full-text indexes require a unique nonclustered or clustered rowstore index on the table in which they are created, with a single column in the key. We recommend that this index have an integer key for performance reasons, such as an IDENTITY column. Full-text indexes are usually placed on varchar or nvarchar columns, often with large lengths, but you can also place them on xml and varbinary columns.

It is also an important to understand the two viable options to updating the full-text index. You can configure Change Tracking on the table that hosts the full-text index, which makes it possible for it to propagate changes to the base table into the full-text index, asynchronously keeping the full-text data synchronized with the table, with minimal overhead. Another option is using a column with the rowversion data type in the table and then periodically updating the full-text index. Consider both strategies, along with your requirements for frequency of updates to the full-text index. Both are superior to frequent full populations.

Understanding spatial Indexes

A spatial index is a special B-tree index that uses a suite of special code and geometry methods to perform spatial and geometry calculations. Developers can use these data structures for non-Euclidean geometry calculations, distance and area calculations on spheres, and more. Spatial indexes can improve the performance of queries with spatial operations.

You can create these indexes only on columns that use the spatial data types geometry or geography, and you can create different types of indexes on the same spatial column to server different calculations. To create a spatial index, the table must already have a Primary Key.

You create spatial indexes by using bounding boxes or tessellation schemes for geometry and geography data types. Consult the documentation and the developers’ intended use of spatial data when creating these indexes.

Understanding XML indexes

Eponymous XML indexes are created for much the same benefit for which we use nonclustered indexes: You use them to prevent the runtime shredding of XML files each time they are accessed, and to instead provide a persistent row set of the XML data’s tags, values, and paths.

Because the xml data type is stored as a BLOB and has an upper limit of 2 GB of data per row, XML data can be massive, and XML indexes can be extremely beneficial to reads. Like nonclustered indexes, they also incur an overhead to writes.

Primary XML indexes prevent the on-demand shredding of the data by providing a reference to the tags, values, and paths. On large XML documents, this can be a major performance improvement. Secondary XML indexes enhance the performance of primary XML indexes. Secondary XML indexes are created on either path, value, or property data in the primary XML index and benefit a read workload that heavily uses one of those three methods of querying XML data. Consult the documentation and the developers’ intended use of XML data when creating XML indexes.

Understanding index statistics

When we talk about statistics in SQL Server, we do not mean the term generically. Statistics on tables and views are created to describe the distribution of data within indexes and heaps; they are created as needed by the query optimizer.

Statistics are important to the query optimizer to help it make query plan decisions, and they are heavily involved in the concept of cardinality estimation. The SQL Server Cardinality Estimator provides accurate estimations of the number of rows that queries will return, a big part of producing query plans.

Making sure statistics are available and up to date is essential for choosing a well-performing query plan. “Stale” statistics that have evaded updates for too long contain information that is quite different from the current state of the table and will likely cause poor execution plans.

There are a number of options in each database regarding statistics. We reviewed some in Chapter 4 but we present them again here in the context of performance tuning.

Manually creating and updating statistics

You can also create statistics manually during troubleshooting or performance tuning by using the CREATE STATISTICS statement, but, generally, you create statistics as needed.

You can consider manually creating statistics for large tables, and with design principals similar to how nonclustered indexes should be created. The order of the keys in statistics does matter, and you should choose columns that are regularly queried together to provide the most value to queries.

When venturing into creating your own statistics objects, consider using filtered statistics, which can also be helpful if you are trying to carry out advanced performance tuning on queries with a static filter or specific range of data. Like filtered indexes, or even filtered views and queries, you can create statistics with a similar WHERE clause, which are typically needed for the same reasons filtered indexes are: to limit the scope of the objects to ubiquitous data subsets, such as IsActive=1 or IsDeleted=0 or IsCurrent=1. Filtered statistics are never automatically created.

You can manually verify that indexes are being kept up to date by the query optimizer. The STATS_DATE() function accepts an object_id and stats_id, which is functionally the same as the index_id, if the statistics object corresponds to an index. Not all statistics are associated with an index; for example, indexes that are automatically created. There will generally be more stats objects than index objects. The STATS_DATE() function returns a datetime value of the last time the statistics object was updated. This function works in SQL Server and Azure SQL Database.

You can also use a pair of DMVs to return the statistics object properties including the last_updated date. For tables and indexed views, use sys.dm_db_stats_properties, and for partitioned tables, use sys.dm_db_incremental_stats_properties.

Automatically creating and updating statistics

When the database option _CREATE_STATISTICS is turned on, SQL Server can create single-column statistics objects, based on query need. These can make a big difference in performance. You can determine that a statistics object was created by the AUTO_CREATE_STATISTICS = ON behavior because it will have the name prefix _WA. The behavior that creates statistics for indexes (with a matching name) happens automatically, regardless of the AUTO_CREATE_STATISTICS database option.

Statistics are not automatically created for Columnstore indexes, and instead will use statistics objects that exist on the heap or the clustered index of the table. Like any index, a statistics object of the same name is created; however, for Columnstore indexes it is blank, and in place for logistical reasons only.

As you can imagine, statistics must also be kept up to date with the data in the table. SQL Server has an option in each database for AUTO_UPDATE_STATISTICS, which is ON by default and should almost always remain on.

You should only ever turn off both AUTO_CREATE_STATISTICS and AUTO_UPDATE_STATISTICS when requested by highly complex application designs, with variable schema usage, and a separate regular process that creates and maintains statistics, such as SharePoint. On-premises SharePoint installations include a set of stored procedures that periodically run to create and update the statistics objects for the wide, dynamically assigned table structures within. If you have not designed your application to intelligently create and update statistics using a separate process from that of the SQL Server engine, we recommend that you never turn off these options.

Important performance options for statistics

You can also create and update statistics incrementally, by taking advantage of table partitioning. First introduced in SQL Server 2014, statistics that use table partitioning reduce the overhead of statistics creation. You should turn on the database setting for INCREMENTAL to allow statistics to take advantage of partitioning. It has no impact on statistics for tables that are not partitioned.

Finally, you can update statistics asynchronously, and this is as large a performance gain as has ever been introduced to the concept of statistics. When you turn on the AUTO_UPDATE_STATISTICS_ASYNC database option, running a query will continue even when the query optimizer has identified an out-of-date statistics object. The statistics will be updated afterward, instead of forcing a user query to wait.

Inside OUT

Should I turn on Auto Update Statistics and Auto Update Statistics Asynchronously in SQL 2017?

Yes! (Again, unless an application specifically recommends that you do not, such as SharePoint.)

Starting in SQL Server 2016 (and with database compatibility mode 130), the ratio of data modifications to rows in the table that helps identify out-of-date statistics has been aggressively lowered, causing statistics to be automatically updated more frequently. This is especially evident in large tables in which many rows were regularly updated. In SQL Server 2014 and before, this more aggressive behavior was not turned on by default, but could be turned on via Trace Flag 2371, starting with SQL 2008 R2 SP1.

It is more important now than ever to turn on Auto Update Statistics Asynchronously, which can dramatically reduce the overhead involved in automatic statistics maintenance.

Understanding statistics on memory-optimized tables

Statistics are created and updated automatically on memory-optimized tables. Memory-optimized tables require at least one index to be created, (a Primary Key if durability is set to SCHEMA_AND_DATA), and a matching statistics object is created for that index object.

NOTE

If a memory-optimized table was created in SQL 2014 compatibility level, you must manually update the statistics object yourself by using the UPDATE STATISTICS command. Then, if the AUTO_UPDATE_STATISTICS database option is turned on, statistics will update as normal. Statistics for new memory-optimized tables are not automatically updated when the database compatibility level is below 130 (SQL 2016) when the tables were created. It is always recommended to create memory-optimized tables in databases with the highest compatibility level.

For more on memory-optimized tables, see Chapter 8.

For more on updating statistics, see Chapter 13.

Understanding statistics on external tables

You can also create statistics on external tables; that is, tables that do not exist in the SQL Server database but instead are transparent references to data stored in a Hadoop cluster or in Azure Blob Storage.

You can create indexes on external tables, but currently, you cannot update them. Creating the index involved copying the external data into the SQL Server database only temporarily, and then calculating statistics. To update statistics for these datasets, you must drop them and re-create them. Because of the data sizes typically involved with external tables, using the FULLSCAN method to update statistics is not recommended.

For more on external tables, see Chapter 8.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 10 Understanding and designing indexes

Create new playlist

Sign In

Sign Up

Chapter 10Understanding and designing indexes

Designing clustered indexes

Choosing a proper clustered index key

The case against intentionally designing heaps

Designing nonclustered indexes

Understanding nonclustered index design

Choosing a proper nonclustered index key

Understanding redundant indexes

Understanding the INCLUDE list of an index

Creating “missing” nonclustered indexes

Understanding when Missing Index suggestions are removed

Understanding and proving index usage statistics

Designing Columnstore indexes

Demonstrating the power of Columnstore indexes

Using compression delay on Columnstore indexes

Understanding indexing in memory-optimized tables

Understanding hash indexes for memory-optimized tables

Understanding nonclustered indexes for memory-optimized tables

Moving to memory-optimized tables

Understanding other types of indexes

Understanding full-text indexes

Understanding spatial Indexes

Understanding XML indexes

Understanding index statistics

Manually creating and updating statistics

Automatically creating and updating statistics

Important performance options for statistics

Understanding statistics on memory-optimized tables

Understanding statistics on external tables

Table of Contents for
Chapter 10 Understanding and designing indexes

Chapter 10
Understanding and designing indexes