The right index on the right column, or columns, is the basis on which query tuning begins. On the other hand, a missing index or an index placed on the wrong column, or columns, can be the basis for all performance problems starting with basic data access, continuing through joins, and ending in filtering clauses. For these reasons, it is extremely important for everyone—not just the DBA—to understand the different indexing techniques that can be used to optimize the database design.
In this chapter, I cover the following topics:
What an index is
The benefits and overhead of an index
General recommendations for index design
Clustered and nonclustered index behavior and comparisons
Advanced indexing techniques: covering indexes, index intersections, index joins, filtered indexes, distributed indexes, indexed views, and index compression
Special index types
Additional characteristics of indexes
One of the best ways to reduce disk I/O and logical reads is to use an index. An index allows SQL Server to find data in a table without scanning the entire table. An index in a database is analogous to an index in a book. Say, for example, that you wanted to look up the phrase table scan in this book. Without the index at the back of the book, you would have to peruse the entire book to find the text you needed. With the index, you know exactly where the information you want is stored.
While tuning a database for performance, you create indexes on the different columns used in a query to help SQL Server find data quickly. For example, the following query against the Production.Product
table results in the data shown in Figure 4-1 (the first six of 500+ rows):
SELECT p.ProductID, p.[Name], p.StandardCost, p.[Weight], ROW_NUMBER() OVER (ORDER BY p.Name DESC) AS RowNumber FROM Production.Product p;
If you need to retrieve all the products where StandardCost
is greater than 150, without an index the table will have to be scanned, checking the value of StandardCost
at each row to determine which rows contain a value greater than 150. An index on the StandardCost
column could speed up this process by providing a mechanism that allows a structured search against the data rather than a row-by-row check. You can take two different, and fundamental, approaches for creating this index:
Like a dictionary: A dictionary is a distinct listing of words in alphabetical order. An index can be stored in a similar fashion. The data is ordered, although it will still have duplicates. The first six rows, ordered by StandardCost
instead of by Name
would look like the data shown in Figure 4-2. Notice the RowNumber
column shows the original placement of the row when ordering by Name
.
So now if you wanted to find all the rows where StandardCost
is greater than 150, the index would allow you to find them immediately by moving down to the first value greater than 150. An index that orders the data stored based on the index order is known as a clustered index. Because of how SQL Server stores data, this is one of the most important indexes in your database design. I explain this in detail later in the chapter.
Like a book's index: An ordered list can be created without altering the layout of the table, similar to the way the index of a book is created. Just like the keyword index of a book lists the keywords in a separate section with a page number to refer to the main content of the book, the list of StandardCost
values is created as a separate structure and refers to the corresponding row in the Product
table through a pointer. For the example, I'll use RowNumber
as the pointer. Table 4-1 shows the structure of the manufacturer index.
Table 4.1. Structure of the Manufacturer Index
StandardCost | RowNumber |
---|---|
| 365 |
| 375 |
| 327 |
| 498 |
| 474 |
SQL Server can scan the manufacturer index to find rows where StandardCost
is greater than 150. Since the StandardCost
values are arranged in a sorted order, SQL Server can stop scanning as soon as it encounters the row with a value of 150. This type of index is called a nonclustered index, and I explain it in detail later in the chapter.
In either case, SQL Server will be able to find all the products where StandardCost
is greater than 150 more quickly than without an index under most circumstances.
You can create indexes on either a single column (as described previously) or a combination of columns in a table. SQL Server automatically creates indexes for certain types of constraints (for example, PRIMARY KEY
and UNIQUE
constraints).
SQL Server has to be able to find data, even when no index is present on a table. When no clustered index is present to establish a storage order for the data, the storage engine will simply read through the entire table to find what it needs. A table without a clustered index is called a heap table. A heap is just a crude stack of data with a row identifier as a pointer to the storage location. This data is not ordered or searchable except by walking through the data, row-by-row, in a process called a scan. When a clustered index is placed on a table, the key values of the index establish an order for the data. Further, with a clustered index, the data is stored with the index so that the data itself is now ordered. When a clustered index is present, the pointer on the nonclustered index consists of the values that define the clustered index. This is a big part of what makes clustered indexes so important.
Since a page has a limited amount of space, it can store a larger number of rows if the rows contain a fewer number of columns. The nonclustered index usually doesn't contain all the columns of the table; it usually contains only a limited number of the columns. Therefore, a page will be able to store more rows of a nonclustered index than rows of the table itself, which contains all the columns. Consequently, SQL Server will be able to read more values for a column from a page representing a nonclustered index on the column than from a page representing the table that contains the column.
Another benefit of the nonclustered index is that, because it is in a separate structure from the data table, it can be put in a different filegroup, with a different I/O path, as explained in Chapter 2. This means that SQL Server can access the index and table concurrently, making searches even faster.
Indexes store their information in a B-tree structure, so the number of reads required to find a particular row is minimized. The following example shows the benefit of a B-tree structure.
Consider a single-column table with 27 rows in a random order and only 3 rows per leaf page. Suppose the layout of the rows in the pages is as shown in Figure 4-3.
To search the row (or rows) for the column value of 5, SQL Server has to scan all the rows and the pages, since even the last row in the last page may have the value 5. Because the number of reads depends on the number of pages accessed, nine read operations have to be performed without an index on the column. This content can be ordered by creating an index on the column, with the resultant layout of the rows and pages shown in Figure 4-4.
Indexing the column arranges the content in a sorted fashion. This allows SQL Server to determine the possible value for a row position in the column with respect to the value of another row position in the column. For example, in Figure 4-4, when SQL Server finds the first row with the column value 6, it can be sure that there are no more rows with the column value 5. Thus, only two read operations are required to fetch the rows with the value 5 when the content is indexed. However, what happens if you want to search for the column value 25? This will require nine read operations! This problem is solved by implementing indexes using the B-tree structure.
A B-tree consists of a starting node (or page) called a root node with branch nodes (or pages) growing out of it (or linked to it). All keys are stored in the leaves. Contained in each interior node (above the leaf nodes) are pointers to its branch nodes and values representing the smallest value found in the branch node. Keys are kept in sorted order within each node. B-trees use a balanced tree structure for efficient record retrieval—a B-tree is balanced when the leaf nodes are all at the same level from the root node. For example, creating an index on the preceding content will generate the balanced B-tree structure shown in Figure 4-5.
The B-tree algorithm minimizes the number of pages to be accessed to locate a desired key, thereby speeding up the data access process. For example, in Figure 4-5, the search for the key value 5 starts at the top root node. Since the key value is between 1 and 10, the search process follows the left branch to the next node. As the key value 5 falls between the values 4 and 7, the search process follows the middle branch to the next node with the starting key value of 4. The search process retrieves the key value 5 from this leaf page. If the key value 5 doesn't exist in this page, the search process will stop since it's the leaf page. Similarly, the key value 25 can also be searched using the same number of reads.
The performance benefit of indexes does come at a cost. Tables with indexes require more storage and memory space for the index pages in addition to the data pages of the table. Data manipulation queries (INSERT
, UPDATE
, and DELETE
statements, or the CUD part of create, read, update, delete [CRUD]) can take longer, and more processing time is required to maintain the indexes of constantly changing tables. This is because, unlike a SELECT
statement, data manipulation queries modify the data content of a table. If an INSERT
statement adds a row to the table, then it also has to add a row in the index structure. If the index is a clustered index, the overhead is greater still, because the row has to be added to the data pages themselves in the right order, which may require other data rows to be repositioned below the entry position of the new row. The UPDATE
and DELETE
data manipulation queries change the index pages in a similar manner.
When designing indexes, you'll be operating from two different points of view: the existing system, already in production, where you need to measure the overall impact of an index, and the tactical approach where all you worry about is the immediate benefits of an index, usually when initially designing a system. When you have to deal with the existing system, you should ensure that the performance benefits of an index outweigh the extra cost in processing resources. You can do this by using the Profiler tool (explained in Chapter 3) to do an overall workload optimization (explained in Chapter 16). When you're focused exclusively on the immediate benefits of an index, SQL Server supplies a series of dynamic management views that provide detailed information about the performance of indexes, sys.dm_db_index_operational_stats
or sys.dm_db_index_usage_stats
. The view sys.dm_db_index_operational stats
shows the low-level activity, such as locks and I/O, on an index that is in use. The view sys.dm_db_index_usage_stats
returns statistical counts of the various index operations that have occurred to an index over time. Both of these will be used more extensively in Chapter 12 when I discuss blocking.
To understand the overhead cost of an index on data manipulation queries, consider the following example. First, create a test table with 10,000 rows (create_test.sql
in the download):
IF (SELECT OBJECT_ID('t1') ) IS NOT NULL DROP TABLE dbo.t1 ; GO CREATE TABLE dbo.t1 (c1 INT, c2 INT, c3 CHAR(50)) ; SELECT TOP 10000 IDENTITY( INT,1,1 ) AS n INTO #Nums
FROM Master.dbo.SysColumns sc1 ,Master.dbo.SysColumns sc2 ; INSERT INTO dbo.t1 (c1, c2, c3) SELECT n ,n ,'C3' FROM #Nums ; DROP TABLE #Nums ;
If you then run an UPDATE
statement, like so:
UPDATE dbo.t1 SET c1 = 1, c2 = 1 WHERE c2 = 1 ;
the number of logical reads reported by SET STATISTICS IO
is as follows:
Table 't1'. Scan count 1, logical reads 87
After adding an index on column c1
like so:
CREATE CLUSTERED INDEX i1 ON t1(c1)
the resultant number of logical reads for the same UPDATE
statement increases from 87 to 95:
Table 't1'. Scan count 1, logical reads 95
Even though it is true that the amount of overhead required to maintain indexes increases for data manipulation queries, be aware that SQL Server must first find a row before it can update or delete it; therefore, indexes can be helpful for UPDATE
and DELETE
statements with complex WHERE
clauses as well. The increased efficiency in using the index to locate a row usually offsets the extra overhead needed to update the indexes, unless the table has a lot of indexes.
To understand how an index can benefit even data modification queries, let's build on the example. Create another index on table t1
This time, create the index on column c2
referred to in the WHERE
clause of the UPDATE
statement:
CREATE INDEX i2 ON t1(c2);
After adding this new index, run the UPDATE
command again:
UPDATE t1 SET c1 = 1, c2 = 1 WHERE c2 = 1;
the total number of logical reads for this UPDATE
statement decreases from 95 to 20 (= 15 + 5):
Table 't1'. Scan count 1, logical reads15
Table 'Worktable'. Scan count 1, logical reads5
A worktable is a temporary table used internally by SQL Server to process the intermediate results of a query. Worktables are created in the tempdb
database and are dropped automatically after query execution.
The examples in this section have demonstrated that although having an index adds some overhead cost to action queries, the overall result is a decrease in cost because of the beneficial effect of indexes on searching.
The main recommendations for index design are as follows:
Examine the WHERE
clause and join criteria columns.
Use narrow indexes.
Examine column uniqueness.
Examine the column data type.
Consider column order.
Consider the type of index (clustered vs. nonclustered).
Let's consider each of these recommendations in turn.
When a query is submitted to SQL Server, the query optimizer tries to find the best data access mechanism for every table referred to in the query. Here is how it does this:
The optimizer identifies the columns included in the WHERE
clause and the join criteria.
The optimizer then examines indexes on those columns.
The optimizer assesses the usefulness of each index by determining the selectivity of the clause (that is, how many rows will be returned) from statistics maintained on the index.
Finally, the optimizer estimates the least costly method of retrieving the qualifying rows, based on the information gathered in the previous steps.
Chapter 7 covers statistics in more depth.
To understand the significance of a WHERE
clause column in a query, let's consider an example. Let's return to the original code listing that helped you understand what an index is; the query consisted of a SELECT
statement without any WHERE
clause, as follows:
SELECT p.ProductID, p.[Name], p.StandardCost, p.[Weight] FROM Production.Product p;
The query optimizer performs a clustered index scan, the equivalent of a table scan against a heap on a table that has a clustered index, to read the rows as shown in Figure 4-6 (switch on the Show Execution Plan option by using Query Analyzer's Query menu, as well as the Set Statistics IO option by using Query Analyzer's Tools
The number of logical reads reported by SET STATISTICS IO
for the SELECT
statement is as follows:
Table 'Product'. Scan count 1, logical reads 15
To understand the effect of a WHERE
clause column on the query optimizer's decision, let's add a WHERE
clause to retrieve a single row:
SELECT p.ProductID, p.[Name], p.StandardCost, p.[Weight] FROM Production.Product p WHERE p.ProductID = 738;
With the WHERE
clause in place, the query optimizer examines the WHERE
clause column ProductID
identifies the availability of index PK_Product_ProductId
on column ProductId
assesses a high selectivity (that is, only one row will be returned) for the WHERE
clause from the statistics on index PK_Product_ProductId
and decides to use that index on column ProductId
as shown in Figure 4-7.
The resultant number of logical reads is as follows:
Table 'Product'. Scan count 0, logical reads 2
The behavior of the query optimizer shows that the WHERE
clause column helps the optimizer choose an optimal indexing operation for a query. This is also applicable for a column used in the join criteria between two tables. The optimizer looks for the indexes on the WHERE
clause column or the join criterion column and, if available, considers using the index to retrieve the rows from the table. The query optimizer considers index(es) on the WHERE
clause column(s) and the join criteria column(s) while executing a query. Therefore, having indexes on the frequently used columns in the WHERE
clause and the join criteria of a SQL query helps the optimizer avoid scanning a base table.
When the amount of data inside a table is so small that it fits onto a single page (8KB), a table scan may work better than an index seek. If you have a good index in place but you're still getting a scan, consider this issue.
You can create indexes on a combination of columns in a table. For the best performance, use as few columns in an index as you can. You should also avoid very wide data type columns in an index. Columns with string data types (CHAR
, VARCHAR
, NCHAR
, and NVARCHAR
sometimes can be quite wide as can binary; unless they are absolutely necessary, minimize the use of wide data type columns with large sizes in an index.
A narrow index can accommodate more rows in an 8KB index page than a wide index. This has the following effects:
Reduces I/O (by having to read fewer 8KB pages)
Makes database caching more effective, because SQL Server can cache fewer index pages, consequently reducing the logical reads required for the index pages in the memory
Reduces the storage space for the database
To understand how a narrow index can reduce the number of logical reads, create a test table with 20 rows and an index (narrowIDX_t1.sql
in the download):
IF(SELECT OBJECT_ID('t1')) IS NOT NULL DROP TABLE dbo.t1 GO CREATE TABLE dbo.t1(c1 INT, c2 INT); WITH Nums AS (SELECT 1 AS n UNION ALL SELECT n + 1 FROM Nums WHERE n < 20 ) INSERT INTO t1 (c1,c2) SELECT n,2 FROM Nums CREATE INDEX i1 ON t1(c1)
Since the indexed column is narrow (the INT
data type is 4 bytes), all the index rows can be accommodated in one 8KB index page. As shown in Figure 4-8, you can confirm this in the dynamic management views associated with indexes (base_indx.sql
in the download):
SELECT i.Name ,i.type_desc ,s.page_count ,s.record_count , s.index_level FROM sys.indexes i JOIN sys.dm_db_index_physical_stats(DB_ID(N'AdventureWorks2008'), OBJECT_ID(N'dbo.T1'), NULL, NULL, 'DETAILED') AS s ON i.index_id = s.index_id WHERE i.OBJECT_ID = OBJECT_ID(N'dbo.T1')
The sys.indexes
system table is stored in each database and contains the basic information on every index in the database. The dynamic management function, sys.dm_db_index_ physical_stats
contains the more detailed information about the statistics on the index (you'll learn more about this DMF in Chapter 7). To understand the disadvantage of a wide index key, modify the data type of the indexed column c1
from INT
to CHAR(500)
(narrow_alter.sql
in the download):
DROP INDEX t1.i1 ALTER TABLE t1 ALTER COLUMN c1 CHAR(500) CREATE INDEX i1 ON t1(c1)
The width of a column with the INT
data type is 4 bytes, and the width of a column with the CHAR(500)
data type is 500 bytes. Because of the large width of the indexed column, two index pages are required to contain all 20 index rows. You can confirm this in the sys.indexes
system table by running sysindex_select.sql
again (see Figure 4-9).
A large index key size increases the number of index pages, thereby increasing the amount of memory and disk activities required for the index. It is always recommended that the index key size is as narrow as possible.
Drop the test table before continuing:
DROP TABLE dbo.t1
Creating an index on columns with a very low range of possible values (such as gender) will not benefit performance, because the query optimizer will not be able to use the index to effectively narrow down the rows to be returned. Consider a Gender
column with only two unique values: M
and F
When you execute a query with the Gender
column in the WHERE
clause, you end up with a large number of rows from the table (assuming the distribution of M
and F
is even), resulting in a costly table or clustered index scan. It is always preferable to have columns in the WHERE
clause with lots of unique rows (or high selectivity) to limit the number of rows accessed. You should create an index on those column(s) to help the optimizer access a small result set.
Furthermore, while creating an index on multiple columns, which is also referred to as a composite index, column order matters. In some cases, using the most selective column first will help filter the index rows more efficiently.
The importance of column order in a composite index is explained later in the chapter in the "Consider Column Order" section.
From this, you can see that it is important to know the selectivity of a column before creating an index on it. You can find this by executing a query like this one; just substitute the table and column name:
SELECT COUNT(DISTINCT Gender) AS DistinctColValues ,COUNT(Gender) AS NumberOfRows ,(CAST(COUNT(DISTINCT Gender) AS DECIMAL) / CAST(COUNT(Gender) AS DECIMAL)) AS Selectivity FROM HumanResources.Employee
The column with the highest number of unique values (or selectivity) can be the best candidate for indexing when referred to in a WHERE
clause or a join criterion.
To understand how the selectivity of an index key column affects the use of the index, take a look at the Gender
column in the HumanResources.Employee
table. If you run the previous query, you'll see that it contains only 2 distinct values in more than 290 rows, which is a selectivity of .006. A query to look only for a Gender
of F
would look like this:
SELECT * FROM HumanResources.Employee WHERE Gender = 'F' AND SickLeaveHours = 59 AND MaritalStatus = 'M'
This results in the execution plan in Figure 4-10 and the following I/O and elapsed time:
Table 'Employee'. Scan count 1, logical reads 9 CPU time = 16 ms, elapsed time = 103 ms.
The data is returned by scanning the clustered index (where the data is stored) to find the appropriate values where Gender = 'F'
(The other operators will be covered in Chapter 9.) If you were to place an index on the column, like so:
CREATE INDEX IX_Employee_Test ON HumanResources.Employee (Gender)
and run the query again, the execution plan remains the same. The data is just not selective enough for the index to be used, let alone be useful. If instead you use a composite index that looks like this:
CREATE INDEX IX_Employee_Test ON HumanResources.Employee (SickLeaveHours, Gender, MaritalStatus) WITH (DROP_EXISTING = ON)
and then rerun the query to see the execution plan in Figure 4-11 and the performance results, you get this:
Table 'Employee'. Scan count 1, logical reads 6 CPU time = 0 ms, elapsed time = 32 ms.
Now you're doing better than you were with the clustered index scan. A nice clean Index Seek
operation takes about half the time to gather the data. The rest is spent in the Key Lookup
operation. A Key Lookup
operation is commonly referred to as a bookmark lookup.
You will learn more about bookmark lookups in Chapter 6.
Although none of the columns in question would probably be selective enough on their own to make a decent index, together they provide enough selectivity for the optimizer to take advantage of the index offered.
It is possible to attempt to force the query to use the first test index you created. If you drop the compound index, create the original again, and then modify the query as follows by using a query hint to force the use of the original index:
SELECT * FROM HumanResources.Employee WITH (INDEX (IX_Employee_Test)) WHERE SickLeaveHours = 59 AND Gender = 'F' AND MaritalStatus = 'M'
then the results and execution plan shown in Figure 4-12, while similar, are not the same:
Table 'Employee'. Scan count 1, logical reads 14 CPU time = 0 ms, elapsed time = 29 ms.
You see the same index seek, but the number of reads has more than doubled, and the estimated costs within the execution plan have changed. Although forcing the optimizer to choose an index is possible, it clearly isn't always an optimal approach.
Another way to force a different behavior on SQL Server 2008 is the FORCESEEK
query hint. FORCESEEK
makes it so the optimizer will choose only Index Seek
operations. If the query were rewritten like this:
SELECT * FROM HumanResources.Employee WITH (FORCESEEK) WHERE SickLeaveHours = 59 AND Gender = 'F' AND MaritalStatus = 'M'
which changes the I/O, execution time, and execution plan results yet again (Figure 4-13), you end up with these results:
Table 'Employee'. Scan count 1, logical reads 170 CPU time = 0 ms, elapsed time = 39 ms.
Limiting the options of the optimizer and forcing behaviors can in some situations help, but frequently, as shown with the results here, an increase in execution time and the number of reads is not helpful.
The data type of an index matters. For example, an index search on integer keys is very fast because of the small size and easy arithmetic manipulation of the INTEGER
(or INT
data type. You can also use other variations of integer data types (BIGINT
, SMALLINT
, and TINYINT
for index columns, whereas string data types (CHAR
, VARCHAR
, NCHAR
, and NVARCHAR
require a string match operation, which is usually costlier than an integer match operation.
Suppose you want to create an index on one column and you have two candidate columns—one with an INTEGER
data type and the other with a CHAR(4)
data type. Even though the size of both data types is 4 bytes in SQL Server 2008, you will still prefer the INTEGER
data type index. Look at arithmetic operations as an example. The value 1 in the CHAR(4)
data type is actually stored as 1 followed by three spaces, a combination of the following four bytes: 0x35
, 0x20
, 0x20
and 0x20.
The CPU doesn't understand how to perform arithmetic operations on this data, and therefore it converts to an integer data type before the arithmetic operations, whereas the value 1 in an INTEGER
data type is saved as 0x00000001
The CPU can easily perform arithmetic operations on this data.
Of course, most of the time, you won't have the simple choice between identically sized data types, allowing you to choose the more optimal type. Keep this information in mind when designing and building your indexes.
An index key is sorted on the first column of the index and then subsorted on the next column within each value of the previous column. The first column in a compound index is frequently referred to as the leading edge of the index. For example, consider Table 4-2.
If a composite index is created on the columns (c1
, c2
, then the index will be ordered as shown in Table 4-3.
As shown in Table 4-3, the data is sorted on the first column (c1
) in the composite index. Within each value of the first column, the data is further sorted on the second column (c2
).
Therefore, the column order in a composite index is an important factor in the effectiveness of the index. You can see this by considering the following:
Column uniqueness
Column width
Column data type
For example, suppose most of your queries on table t1
are similar to the following:
SELECT * FROM t1 WHERE c2=12 SELECT * FROM t1 WHERE c2=12 AND c1=11
An index on (c2
, c1
will benefit both the queries. But an index on (c1
, c2
will not be appropriate, because it will sort the data initially on column c1
whereas the first SELECT
statement needs the data to be sorted on column c2
.
To understand the importance of column ordering in an index, consider the following example. In the Production.Product
table, there is a column for StandardCost
and another for ListPrice
Create a temporary index on the table like this:
CREATE INDEX IX_Test ON Person.Address (City, PostalCode)
A simple SELECT
statement run against the table that will use this new index will look something like this:
SELECT * FROM Person.Address AS a WHERE City = 'Warrington'
The I/O and execution time for the query is as follows:
Table 'Address'. Scan count 1, logical reads 188 CPU time = 0 ms, elapsed time = 167 ms.
And the execution plan in Figure 4-14 shows the use of the index.
So, this query is taking advantage of the leading edge of the index to perform a Seek
operation to retrieve the data. If, instead of querying using the leading edge, you use another column in the index like the following query:
SELECT * FROM Person.Address AS a WHERE a.PostalCode = 'WA3 7BH'
then the results are as follows:
Table 'Address'. Scan count 1, logical reads 173
And the execution plan is clearly different, as you can see in Figure 4-15.
The reads for the second query are slightly lower than the first, but when you take into account that the first query returned 86 rows worth of data and the second query returned only 31, you begin to see the difference between the Index Seek
operation in Figure 4-14 and the Index Scan
operation in Figure 4-15. Also note that because it had to perform a scan, the optimizer marked the column as possibly missing an index.
Finally, to see the order of the index really shine, change the query to this:
SELECT a.AddressID ,a.City ,a.PostalCode FROM Person.Address AS a WHERE a.City = 'Warrington' AND a.PostalCode = 'WA3 7BH'
Executing this query will return the same 31 rows as the previous query, resulting in the following:
Table 'Address'. Scan count 1, logical reads 2
with the execution plan visible in Figure 4-16.
The radical changes in I/O and execution plan represent the real use of a compound index, the covering index. This is covered in detail in the section "Covering Indexes" later in the chapter.
When finished, drop the index:
DROP INDEX Person.Address.IX_Test
In SQL Server, you have two main index types: clustered and nonclustered. Both types have a B-tree structure. The main difference between the two types is that the leaf pages in a clustered index are the data pages of the table and are therefore in the same order as the data to which they point. This means that the clustered index is the table. As you proceed, you will see that the difference at the leaf level between the two index types becomes very important when determining the type of index to use.
The leaf pages of a clustered index and the data pages of the table the index is on are one and the same. Because of this, table rows are physically sorted on the clustered index column, and since there can be only one physical order of the table data, a table can have only one clustered index.
When you create a primary key constraint, SQL Server automatically creates it as a unique clustered index on the primary key if one does not already exist and if it is not explicitly specified that the index should be a unique nonclustered index. This is not a requirement; it's just default behavior. You can change it prior to creating the table.
As mentioned earlier in the chapter, a table with no clustered index is called a heap table. The data rows of a heap table are not stored in any particular order or linked to the adjacent pages in the table. This unorganized structure of the heap table usually increases the overhead of accessing a large heap table when compared to accessing a large nonheap table (a table with a clustered index).
There is an interesting relationship between a clustered index and the nonclustered indexes in SQL Server. An index row of a nonclustered index contains a pointer to the corresponding data row of the table. This pointer is called a row locator. The value of the row locator depends on whether the data pages are stored in a heap or are clustered. For a nonclustered index, the row locator is a pointer to the RID
for the data row in a heap. For a table with a clustered index, the row locator is the clustered index key value.
For example, say you have a heap table with no clustered index, as shown in Table 4-4.
A nonclustered index on column c1
in a heap will cause the row locator for the index rows to contain a pointer to the corresponding data row in the database table, as shown in Table 4-5.
Table 4.5. Nonclustered Index Page with No Clustered Index
c1 | Row Locator |
---|---|
A1 | Pointer to |
B1 | Pointer to |
On creating a clustered index on column c2
the row locator values of the nonclustered index rows are changed. The new value of the row locator will contain the clustered index key value, as shown in Table 4-6.
To verify this dependency between a clustered and a nonclustered index, let's consider an example. In the AdventureWorks2008 database, the table dbo.DatabaseLog
contains no clustered index, just a nonclustered primary key. If a query is run against it like the following:
SELECT dl.DatabaseLogId ,dl.PostTime FROM dbo.DatabaseLog AS dl WHERE DatabaseLogId = 115
then the execution will look like Figure 4-17.
As you can see, the index was used in a Seek
operation. But because the data is stored separately from the nonclustered index, an additional operation, the RID Lookup
operation, is required in order to retrieve the data, which is then joined back to the information from the Index Seek
operation through a Nested Loop
operation. This is a classic example of what is known as a bookmark lookup, which is explained in more detail in the "Defining the Bookmark Lookup" section. A similar query run against a table with a clustered index in place will look like this:
SELECT d.DepartmentID ,d.ModifiedDate FROM HumanResources.Department AS d WHERE d.DepartmentID = 10
Figure 4-18 shows this execution plan returned.
Although the primary key is used in the same way as the previous query, this time it's against a clustered index. As you now know, this means the data is stored with the index, so the additional column doesn't require a lookup operation to get the data. Everything is returned by the simple clustered Index Seek
operation.
To navigate from a nonclustered index row to a data row, this relationship between the two index types requires an additional indirection for navigating the B-tree structure of the clustered index. Without the clustered index, the row locator of the nonclustered index would be able to navigate directly from the nonclustered index row to the data row in the base table. The presence of the clustered index causes the navigation from the nonclustered index row to the data row to go through the B-tree structure of the clustered index, since the new row locator value points to the clustered index key.
On the other hand, consider inserting an intermediate row in the clustered index key order or expanding the content of an intermediate row. For example, imagine a clustered index table containing four rows per page, with clustered index column values of 1, 2, 4, and 5. Adding a new row in the table with the clustered index value 3 will require space in the page between values 2 and 4. If enough space is not available in that position, a page split will occur on the data page (or clustered index leaf page). Even though the data page split will cause relocation of the data rows, the nonclustered index row locator values need not be updated. These row locators continue to point to the same logical key values of the clustered index key, even though the data rows have physically moved to a different location. In the case of a data page split, the row locators of the nonclustered indexes need not be updated. This is an important point, since tables often have a large number of nonclustered indexes.
Page splits and their effect on performance are explained in more detail in Chapter 8.
The relationship between a clustered index and a nonclustered index imposes some considerations on the clustered index, which are explained in the sections that follow.
Create the Clustered Index First
Since all nonclustered indexes hold clustered index keys within their index rows, the order of nonclustered and clustered index creation is very important. For example, if the nonclustered indexes are built before the clustered index is created, then the nonclustered index row locator will contain a pointer to the corresponding RID
of the table. Creating the clustered index later will modify all the nonclustered indexes to contain clustered index keys as the new row locator value. This effectively rebuilds all the nonclustered indexes.
For the best performance, I recommend that you create the clustered index before you create any nonclustered index. This allows the nonclustered indexes to have their row locator set to the clustered index keys at the time of creation. This does not have any effect on the final performance, but rebuilding the indexes may be quite a large job.
Keep Indexes Narrow
Since all nonclustered indexes hold the clustered keys as their row locator, for the best performance keep the overall byte size of the clustered index as small as possible. If you create a wide clustered index, say CHAR(500)
this will add 500 bytes to every nonclustered index. Thus, keep the number of columns in the clustered index to a minimum, and carefully consider the byte size of each column to be included in the clustered index. A column of the INTEGER
data type usually makes a good candidate for a clustered index, whereas a string data type column will be a less-than-optimal choice.
To understand the effect of a wide clustered index on a nonclustered index, consider this example. Create a small test table with a clustered index and a nonclustered index (clust_ nonclust.sql
in the download):
IF (SELECT OBJECT_ID('t1')) IS NOT NULL DROP TABLE t1 ; GO CREATE TABLE t1 (c1 INT, c2 INT) ; WITH Nums AS (SELECT 1 AS n UNION ALL SELECT n + 1 FROM Nums WHERE n < 20 ) INSERT INTO t1 (c1, c2) SELECT n ,n + 1 FROM Nums CREATE CLUSTERED INDEX icl ON t1 (c2) ; CREATE NONCLUSTERED INDEX incl ON t1 (c1) ;
Since the table has a clustered index, the row locator of the nonclustered index contains the clustered index key value. Therefore:
Width of the nonclustered index row = Width of the nonclustered index column + Width of the clustered index column = size of INT data type + Size of INT data type |
= 4 bytes + 4 bytes = 8 bytes |
With this small size of a nonclustered index row, all the rows can be stored in one index page. You can confirm this by querying against the index statistics (base_index.sql
, as shown in Figure 4-19:
SELECT i.Name ,i.type_desc ,s.page_count ,s.record_count ,s.index_level
FROM sys.indexes i JOIN sys.dm_db_index_physical_stats(DB_ID(N'AdventureWorks2008'), OBJECT_ID(N'dbo.t1'), NULL, NULL, 'DETAILED') AS s ON i.index_id = s.index_id WHERE i.OBJECT_ID = OBJECT_ID(N'dbo.t1')
To understand the effect of a wide clustered index on a nonclustered index, modify the data type of the clustered indexed column c2
from INT
to CHAR(500)
DROP INDEX t1.icl ALTER TABLE t1 ALTER COLUMN c2 CHAR(500) CREATE CLUSTERED INDEX icl ON t1(c2)
Running sysindex_select2.sql
again returns the result in Figure 4-20.
You can see that a wide clustered index increases the width of the nonclustered index row size. Because of the large width of the nonclustered index row, one 8KB index page can't accommodate all the index rows. Instead, two index pages will be required to store all 20 index rows. In the case of a large table, an unreasonable expansion in the size of the nonclustered indexes because of a large clustered index key size can significantly increase the number of pages of the nonclustered indexes.
Therefore, a large clustered index key size not only affects its own width but also widens all nonclustered indexes on the table. This increases the number of index pages for all the indexes on the table, increasing the logical reads and disk I/Os required for the indexes.
Rebuild the Clustered Index in a Single Step
Because of the dependency of nonclustered indexes on the clustered index, rebuilding the clustered index as separate DROP INDEX
and CREATE INDEX
statements causes all the nonclustered indexes to be rebuilt twice. To avoid this, use the DROP_EXISTING
clause of the CREATE INDEX
statement to rebuild the clustered index in a single atomic step. Similarly, you can also use the DROP_EXISTING
clause with a nonclustered index.
When to Use a Clustered Index
In certain situations, using a clustered index is very helpful. I discuss these in the sections that follow.
Retrieving a Range of Data
Since the leaf pages of a clustered index and the data pages of the table are the same, the order of the clustered index column not only orders the rows of the clustered index but also physically orders the data rows. If the physical order of the data rows matches the order of data requested by a query, then the disk head can read all the rows sequentially, without much disk head movement. For example, if a query requests all the employee records belonging to the database group and the corresponding Employees
table has a clustered index on the Group
column, then all the relevant Employee
rows will be physically arranged together on the disk. This allows the disk head to move to the position of the first row on the disk and then electronically read all the data sequentially with minimal physical movement of the disk head. On the other hand, if the rows are not sorted on the disk in the correct physical order, the disk head has to move randomly from one location to another to fetch all the relevant rows. Since physical movement of the disk head constitutes a major portion of the cost of a disk operation, sorting the rows in the proper physical order on the disk (using a clustered index) optimizes the I/O cost.
One "range" of data that is accessed frequently in relational systems is the foreign key from another table. This data, depending on the access mechanisms of the application, is a great candidate for inclusion in the clustered index.
Retrieving Presorted Data
Clustered indexes are particularly efficient when the data retrieval needs to be sorted. If you create a clustered index on the column or columns that you may need to sort by, then the rows will be physically stored in that order, eliminating the overhead of sorting the data after it is retrieved.
Let's see this in action. Create a test table as follows (create_sort.sql
in the download):
IF (SELECT OBJECT_ID('od') ) IS NOT NULL DROP TABLE dbo.od ; GO SELECT * INTO dbo.od FROM Purchasing.PurchaseOrderDetail AS pod
The new table od
is created with data only. It doesn't have any indexes. You can verify the indexes on the table by executing the following, which returns nothing:
sp_helpindex dbo.od
To understand the use of a clustered index, fetch a large range of rows ordered on a certain column:
SELECT * FROM dbo.od WHERE od.ProductID BETWEEN 500 AND 510 ORDER BY od.ProductID
You can obtain the cost of executing this query (without any indexes) from the STATISTICS IO
output:
Table 'od'. Scan count 1, logical reads 79
To improve the performance of this query, you should create an index on the WHERE
clause column. This query requires both a range of rows and a sorted output. The result set requirement of this query meets the recommendations for a clustered index. Therefore, create a clustered index as follows, and reexamine the cost of the query:
CREATE CLUSTERED INDEX i1 ON od(ProductID)
When you run the query again, the resultant cost of the query (with a clustered index) is as follows:
Table 'od'. Scan count 1, logical reads 8
Creating the clustered index reduced the number of logical reads and therefore should contribute to the query performance improvement.
On the other hand, if you create a nonclustered index (instead of a clustered index) on the candidate column, then the query performance may be affected adversely. Let's verify the effect of a nonclustered index in this case:
DROP INDEX od.i1 CREATE NONCLUSTERED INDEX i1 on dbo.od(ProductID)
The resultant cost of the query (with a nonclustered index) is as follows:
Table 'od'. Scan count 1, logical reads 87
The nonclustered index significantly increases the number of logical reads, affecting the query performance accordingly. Drop the test table when you're done:
DROP TABLE dbo.od
For a query that retrieves a large range of rows and/or an ordered output, a clustered index is usually a better choice than a nonclustered index.
When Not to Use a Clustered Index
In certain situations, you are better off not using a clustered index. I discuss these in the sections that follow.
Frequently Updatable Columns
If the clustered index columns are frequently updated, this will cause the row locator of all the nonclustered indexes to be updated accordingly, significantly increasing the cost of the relevant action queries. This also affects database concurrency by blocking all other queries referring to the same part of the table and the nonclustered indexes during that period. Therefore, avoid creating a clustered index on columns that are highly updatable.
Chapter 12 covers blocking in more depth.
To understand how the cost of an UPDATE
statement that modifies only a clustered key column is increased by the presence of nonclustered indexes on the table, consider the following example. The Sales.SpecialOfferProduct
table has a composite clustered index on the primary key, which is also the foreign key from two different tables; this is a classic many-to-many join. In this example, I update one of the two columns using the following statement (note the use of the transaction to keep the test data intact):
BEGIN TRAN SET STATISTICS IO ON; UPDATE sales.SpecialOfferProduct SET ProductID = 345 WHERE SpecialOfferID = 1 AND productid = 720; SET STATISTICS IO OFF; ROLLBACK TRAN
The STATISTICS IO
output shows the reads necessary:
Table 'Product'. Scan count 0, logical reads 2 Table 'SalesOrderDetail'. Scan count 1, logical reads 1240 Table 'SpecialOfferProduct'. Scan count 0, logical reads 10
If you added a nonclustered index to the table, you would see the reads increase:
CREATE NONCLUSTERED INDEX ixTest ON sales.SpecialOfferProduct (ModifiedDate)
When you run the same query again, the output of STATISTICS IO
changes for the SpecialOfferProduct
table:
Table 'Product'. Scan count 0, logical reads 2 Table 'SalesOrderDetail'. Scan count 1, logical reads 1240 Table 'SpecialOfferProduct'. Scan count 0, logical reads 19
As you can see, the number of reads caused by the update of the clustered index is increased with the addition of the nonclustered index. Be sure to drop the index:
DROP INDEX Sales.SpecialOfferProduct.ixTest
Wide Keys
Since all nonclustered indexes hold the clustered keys as their row locator, for performance reasons you should avoid creating a clustered index on a very wide column (or columns) or on too many columns. As explained in the preceding section, a clustered index must be as narrow as possible.
Too Many Concurrent Inserts in Sequential Order
If you want to add many new rows concurrently, then it may be better for performance to distribute them across the data pages of the table. However, if you add all the rows in the same order as that imposed by the clustered index, then all the inserts will be attempted on the last page of the table. This may cause a huge "hot spot" on the corresponding sector of the disk. To avoid this disk hot spot, you should not arrange the data rows in the same order as their physical locations. The inserts can be randomized throughout the table by creating a clustered index on another column that doesn't arrange the rows in the same order as that of the new rows. This is an issue only with a large number of simultaneous inserts.
There is a caveat to this recommendation. Allowing inserts on the bottom of the table prevents page splits on the intermediate pages that are required to accommodate the new rows in those pages. If the number of concurrent inserts is low, then ordering the data rows (using a clustered index) in the order of the new rows will prevent intermediate page splits. However, if the disk hot spot becomes a performance bottleneck, then new rows can be accommodated in intermediate pages without causing page splits by reducing the fill factor of the table. In addition, the "hot" pages will be in memory, which also benefits performance.
Chapter 8 covers the fill factor in depth.
A nonclustered index does not affect the order of the data in the table pages, because the leaf pages of a nonclustered index and the data pages of the table are separate. A pointer (the row locator) is required to navigate from an index row to the data row. As you learned in the earlier "Clustered Indexes" section, the structure of the row locator depends on whether the data pages are stored in a heap or a clustered index. For a heap, the row locator is a pointer to the RID
for the data row; for a table with a clustered index, the row locator is the clustered index key.
The row locator value of the nonclustered indexes continues to have the same clustered index value, even when the clustered index rows are physically relocated.
To optimize this maintenance cost, SQL Server adds a pointer to the old data page to point to the new data page after a page split, instead of updating the row locator of all the relevant nonclustered indexes. Although this reduces the maintenance cost of the nonclustered indexes, it increases the navigation cost from the nonclustered index row to the data row, since an extra link is added between the old data page and the new data page. Therefore, having a clustered index as the row locator decreases this overhead associated with the nonclustered index.
When a query requests columns that are not part of the nonclustered index chosen by the optimizer, a lookup is required. This may be a key lookup when going against a clustered index or an RID
lookup when performed against a heap. The common term for these lookups comes from the old definition name, bookmark lookup. The lookup fetches the corresponding data row from the table by following the row locator value from the index row, requiring a logical read on the data page besides the logical read on the index page. However, if all the columns required by the query are available in the index itself, then access to the data page is not required. This is known as a covering index.
These bookmark lookups are the reason that large result sets are better served with a clustered index. A clustered index doesn't require a bookmark lookup, since the leaf pages and data pages for a clustered index are the same.
Chapter 6 covers bookmark lookups in more detail.
Since a table can have only one clustered index, you can use the flexibility of multiple nonclustered indexes to help improve performance. I explain the factors that decide the use of a nonclustered index in the following sections.
When to Use a Nonclustered Index
A nonclustered index is most useful when all you want to do is retrieve a small number of rows from a large table. As the number of rows to be retrieved increases, the overhead cost of the bookmark lookup rises proportionately. To retrieve a small number of rows from a table, the indexed column should have a very high selectivity.
Furthermore, there will be indexing requirements that won't be suitable for a clustered index, as explained in the "Clustered Indexes" section:
Frequently updatable columns
Wide keys
In these cases, you can use a nonclustered index, since, unlike a clustered index, it doesn't affect other indexes in the table. A nonclustered index on a frequently updatable column isn't as costly as having a clustered index on that column. The UPDATE
operation on a nonclustered index is limited to the base table and the nonclustered index. It doesn't affect any other nonclustered indexes on the table. Similarly, a nonclustered index on a wide column (or set of columns) doesn't increase the size of any other index, unlike that with a clustered index. However, remain cautious, even while creating a nonclustered index on a highly updatable column or a wide column (or set of columns), since this can increase the cost of action queries, as explained earlier in the chapter.
A nonclustered index can also help resolve blocking and deadlock issues. I cover this in more depth in Chapters 12 and 13.
When Not to Use a Nonclustered Index
Nonclustered indexes are not suitable for queries that retrieve a large number of rows. Such queries are better served with a clustered index, which doesn't require a separate bookmark lookup to retrieve a data row. Since a bookmark lookup requires an additional logical read on the data page besides the logical read on the nonclustered index page, the cost of a query using a nonclustered index increases significantly for a large number of rows. The SQL Server query optimizer takes this cost into effect and accordingly discards the nonclustered index when retrieving a large result set.
If your requirement is to retrieve a large result set from a table, then having a nonclustered index on the filter criterion (or the join criterion) column will not be useful unless you use a special type of nonclustered index called a covering index. I describe this index type in detail later in the chapter.
The main considerations in choosing between a clustered and a nonclustered index are as follows:
Number of rows to be retrieved
Data-ordering requirement
Index key width
Column update frequency
Bookmark cost
Any disk hot spots
When deciding upon a type of index on a table with no indexes, the clustered index is usually the preferred choice. Because the index page and the data pages are the same, the clustered index doesn't have to jump from the index row to the base row as required in the case of a nonclustered index.
To understand how a clustered index can outperform a nonclustered index in most circumstances, even in retrieving small number of rows, create a test table with a high selectivity for one column (cluster_bene.sql
in the download):
IF(SELECT OBJECT_ID('t1')) IS NOT NULL DROP TABLE t1 GO CREATE TABLE t1(c1 INT, c2 INT); SELECT TOP 10000 IDENTITY( INT,1,1 ) AS n INTO #Nums FROM Master.dbo.SysColumns sc1 ,Master.dbo.SysColumns sc2 ; INSERT INTO t1 (c1,c2) SELECT n,2 FROM #Nums DROP TABLE #Nums ;
The following SELECT
statement fetches only 1 out of 10,000 rows from the table:
SELECT c1 ,c2 FROM t1 WHERE c1 = 1
with the graphical execution plan shown in Figure 4-21 and the output of SET STATISTICS IO
as follows:
Table 't1'. Scan count 1, logical reads 22
Considering the small size of the result set retrieved by the preceding SELECT
statement, a nonclustered column on c1
can be a good choice:
CREATE NONCLUSTERED INDEX incl ON t1(c1)
You can run the same SELECT
command again. Since retrieving a small number of rows through a nonclustered index is more economical than a table scan, the optimizer used the nonclustered index on column c1
as shown in Figure 4-22. The number of logical reads reported by STATISTICS IO
is as follows:
Table 't1'. Scan count 1, logical reads 3
Even though retrieving a small result set using a column with high selectivity is a good pointer toward creating a nonclustered index on the column, a clustered index on the same column can be equally beneficial or even better. To evaluate how the clustered index can be more beneficial than the nonclustered index, create a clustered index on the same column:
CREATE CLUSTERED INDEX icl ON t1(c1)
Run the same SELECT
command again. From the resultant execution plan (see Figure 4-22) of the preceding SELECT
statement, you can see that the optimizer used the clustered index (instead of the nonclustered index) even for a small result set. The number of logical reads for the SELECT
statement decreased from three to two (Figure 4-23):
Table 't1'. Scan count 1, logical reads 2
Even though a clustered index can outperform a nonclustered index in many instances of data retrieval, a table can have only one clustered index. Therefore, reserve the clustered index for a situation in which it can be of the greatest benefit.
As you learned in the previous section, a nonclustered index is preferred over a clustered index in the following situations:
When the index key size is large.
To avoid the overhead cost associated with a clustered index since rebuilding the clustered index rebuilds all the nonclustered indexes of the table.
To resolve blocking by having a database reader work on the pages of a nonclustered index, while a database writer modifies other columns (not included in the nonclustered index) in the data page. In this case, the writer working on the data page won't block a reader that can get all the required column values from the nonclustered index without hitting the base table. I'll explain this in detail in Chapter 12.
When all the columns (from a table) referred to by a query can be safely accommodated in the nonclustered index itself, as explained in this section.
As already established, the data-retrieval performance when using a nonclustered index is generally poorer than that when using a clustered index, because of the cost associated in jumping from the nonclustered index rows to the data rows in the base table. In cases where the jump to the data rows is not required, the performance of a nonclustered index should be just as good as—or even better than—a clustered index. This is possible if the nonclustered index key includes all the columns required from the table.
To understand the situation where a nonclustered index can outperform a clustered index, consider the following example. Assume for our purposes that you need to examine the credit cards that are expiring between the months of June and September of 2008. You may have a query that returns a large number of rows and looks like this:
SELECT cc.CreditCardID ,cc.CardNumber ,cc.ExpMonth ,cc.ExpYear FROM Sales.CreditCard cc WHERE cc.ExpMonth BETWEEN 6 AND 9 AND cc.ExpYear = 2008 ORDER BY cc.ExpMonth
The following are the I/O results, and Figure 4-24 shows the execution plan:
Table 'CreditCard'. Scan count 1, logical reads 189
The clustered index is on the primary key, and although most access against the table may be through that key, making the index useful, the cluster in this instance is just not performing in the way you need. Although you could expand the definition of the index to include all the other columns in the query, they're not really needed to make the cluster function, and they would interfere with the operation of the primary key. In this instance, creating a different index is in order:
CREATE NONCLUSTERED INDEX ixTest ON Sales.CreditCard (ExpMonth, ExpYear, CardNumber)
Now when the query is run again, this is the result:
Table 'CreditCard'. Scan count 1, logical reads 30
Figure 4-25 shows the corresponding execution plan.
In this case, the SELECT
statement doesn't include any column that requires a jump from the nonclustered index page to the data page of the table, which is what usually makes a nonclustered index costlier than a clustered index for a large result set and/or sorted output. This kind of nonclustered index is called a covering index.
Clean up the index after the testing is done:
DROP INDEX Sales.CreditCard.ixTest
A few of the more advanced indexing techniques that you can also consider are as follows:
Covering indexes: These were introduced in the preceding section.
Index intersections: Use multiple nonclustered indexes to satisfy all the column requirements (from a table) for a query.
Index joins: Use the index intersection and covering index techniques to avoid hitting the base table.
Filtered indexes: To be able to index fields with odd data distributions or sparse columns, a filter can be applied to an index so that it indexes only some data.
Indexed views: This materializes the output of a view on disk.
I cover these topics in more detail in the following sections.
A covering index is a nonclustered index built upon all the columns required to satisfy a SQL query without going to the base table. If a query encounters an index and does not need to refer to the underlying data table at all, then the index can be considered a covering index. For example, in the following SELECT
statement, irrespective of where the columns are referred, all the columns (StateProvinceId
and PostalCode
should be included in the nonclustered index to cover the query fully:
SELECT a.PostalCode FROM Person.Address AS a WHERE a.StateProvinceID = 42
Then all the required data for the query can be obtained from the nonclustered index page, without accessing the data page. This helps SQL Server save logical and physical reads. If you run the query, you'll get the following I/O and execution time as well as the execution plan in Figure 4-26:
Table 'Address'. Scan count 1, logical reads 18 CPU time = 15 ms, elapsed time = 32 ms.
Here you have a classic bookmark lookup with the Key Lookup
operator pulling the PostalCode
data from the clustered index and joining it with the Index Seek
operator against the IX_Address_StateProvinceId
index.
Although you can re-create the index with both key columns, another way to make an index a covering index is to use the new INCLUDE
operator. This stores data with the index without changing the structure of the index itself. Use the following to re-create the index:
CREATE NONCLUSTERED INDEX [IX_Address_StateProvinceID] ON [Person].[Address] ([StateProvinceID] ASC) INCLUDE (PostalCode) WITH ( DROP_EXISTING = ON)
If you rerun the query, the execution plan (Figure 4-27), I/O, and execution time change:
Table 'Address'. Scan count 1, logical reads 2 CPU time = 0 ms, elapsed time = 0 ms.
The reads have dropped from 18 to 2, and the execution plan is just about as simple as it's possible to be; it's a single Index Seek
operation against the new and improved index, which is now covering. A covering index is a useful technique for reducing the number of logical reads of a query. Adding columns using the INCLUDE
statement makes this functionality easier to achieve without adding to the number of columns in an index or the size of the index key since the included columns are stored only at the leaf level of the index.
The INCLUDE
is best used in the following cases:
You don't want to increase the size of the index keys, but you still want to make the index a covering index.
You're indexing a data type that can't be indexed (except text, ntext, and images).
You've already exceeded the maximum number of key columns for an index (although this is a problem best avoided).
A Pseudoclustered Index
The covering index physically organizes the data of all the indexed columns in a sequential order. Thus, from a disk I/O perspective, a covering index that doesn't use included columns becomes a clustered index for all queries satisfied completely by the columns in the covering index. If the result set of a query requires a sorted output, then the covering index can be used to physically maintain the column data in the same order as required by the result set—it can then be used in the same way as a clustered index for sorted output. As shown in the previous example, covering indexes can give better performance than clustered indexes for queries requesting a range of rows and/or sorted output. The included columns are not part of the key and therefore wouldn't offer the same benefits for ordering as the key columns of the index.
Recommendations
To take advantage of covering indexes, be careful with the column list in SELECT
statements. Use as few columns as possible to keep the index key size small for the covering indexes. Add columns using the INCLUDE
statement in places where it makes sense. Since a covering index includes all columns used in a query, it has a tendency to be very wide, increasing the maintenance cost of the covering indexes. You must balance the maintenance cost with the performance gain that the covering index brings. If the number of bytes from all the columns in the index is small compared to the number of bytes in a single data row of that table and you are certain the query taking advantage of the covered index will be executed frequently, then it may be beneficial to use a covering index.
Covering indexes can also help resolve blocking and deadlocks, as you will see in Chapters 12 and 13.
Before building a lot of covering indexes, consider how SQL Server can effectively and automatically create covering indexes for queries on the fly using index intersection.
If a table has multiple indexes, then SQL Server can use multiple indexes to execute a query. SQL Server can take advantage of multiple indexes, selecting small subsets of data based on each index and then performing an intersection of the two subsets (that is, returning only those rows that meet all the criteria). SQL Server can exploit multiple indexes on a table and then employ a join algorithm to obtain the index intersection between the two subsets.
In the following SELECT
statement, for the WHERE
clause columns the table has a nonclustered index on the SalesPersonID
column, but it has no index on the OrderDate
column:
SELECT soh.* FROM Sales.SalesOrderHeader AS soh WHERE soh.SalesPersonID = 276 AND soh.OrderDate BETWEEN '4/1/2002' and '7/1/2002'
Figure 4-28 shows the execution plan for this query.
As you can see, the optimizer didn't use the nonclustered index on the SalesPersonID
column. Since the value of the OrderDate
column is also required, the optimizer chose the clustered index to fetch the value of all the referred columns.
To improve the performance of the query, the OrderDate
column can be added to the nonclustered index on the SalesPersonId
column or defined as an include column on the same index. But in this real-world scenario, you may have to consider the following while modifying an existing index:
It may not be permissible to modify an existing index for various reasons.
The existing nonclustered index key may be already quite wide.
The cost of the queries using the existing index will be affected by the modification.
In such cases, you can create a new nonclustered index on the OrderDate
column:
CREATE NONCLUSTERED INDEX IX_Test ON Sales.SalesOrderHeader (OrderDate)
Run your SELECT
command again.
Figure 4-29 shows the resultant execution plan of the SELECT
statement.
As you can see, SQL Server exploited both the nonclustered indexes as index seeks (rather than scans) and then employed an intersection algorithm to obtain the index intersection of the two subsets. It then did a Key Lookup
from the resulting data to retrieve the rest of the data not included in the indexes.
To improve the performance of a query, SQL Server can use multiple indexes on a table. Therefore, instead of creating wide index keys, consider creating multiple narrow indexes. SQL Server will be able to use them together where required, and when not required, queries benefit from narrow indexes. While creating a covering index, determine whether the width of the index will be acceptable and whether using include columns will get the job done. If not, then identify the existing nonclustered indexes that include most of the columns required by the covering index. You may already have two existing nonclustered indexes that jointly serve all the columns required by the covering index. If it is possible, rearrange the column order of the existing nonclustered indexes appropriately, allowing the optimizer to consider an index intersection between the two nonclustered indexes.
At times, it is possible that you may have to create a separate nonclustered index for the following reasons:
Reordering the columns in one of the existing indexes is not allowed.
Some of the columns required by the covering index may not be included in the existing nonclustered indexes.
The total number of columns in the two existing nonclustered indexes may be more than the number of columns required by the covering index.
In such cases, you can create a nonclustered index on the remaining columns. If the combined column order of the new index and an existing nonclustered index meets the requirement of the covering index, the optimizer will be able to use index intersection. While identifying the columns and their order for the new index, try to maximize their benefit by keeping an eye on other queries, too.
Drop the index that was created for the tests:
DROP INDEX Sales.SalesOrderHeader.IX_Test
The index join is a variation of index intersection, where the covering index technique is applied to the index intersection. If no single index covers a query but multiple indexes together can cover the query, SQL Server can use an index join to satisfy the query fully without going to the base table.
Let's look at this indexing technique at work. Make a slight modification to the query from the "Index Intersections" section like this:
SELECT soh.SalesPersonID,soh.OrderDate FROM Sales.SalesOrderHeader AS soh WHERE soh.SalesPersonID = 276 AND soh.OrderDate BETWEEN '4/1/2002' and '7/1/2002'
The execution plan for this query is shown in Figure 4-30, and the reads are as follows:
Table 'SalesOrderHeader'. Scan count 1, logical reads 686
As shown in Figure 4-30, the optimizer didn't use the existing nonclustered index on the SalesPersonID
column. Since the query requires the value of the OrderDate
column also, the optimizer selected the clustered index to retrieve values for all the columns referred to in the query. If an index is created on the OrderDate
column like this:
CREATE NONCLUSTERED INDEX [IX_Test] ON [Sales].[SalesOrderHeader] ([OrderDate] ASC);
and the query is rerun, then Figure 4-31 shows the result, and you can see the reads here:
Table 'Worktable'. Scan count 0, logical reads 0 Table 'SalesOrderHeader'. Scan count 2, logical reads 8
The combination of the two indexes acts like a covering index reducing the reads against the table from 686 to 8 because it's using two Index Seek
operations joined together instead of a clustered index scan.
But what if the WHERE
clause didn't result in both indexes being used? Instead, you know that both indexes exist and that a seek against each would work like the previous query, so you choose to use an index hint:
SELECT soh.SalesPersonID ,soh.OrderDate FROM Sales.SalesOrderHeader AS soh WITH (INDEX (IX_Test, IX_SalesOrderHeader_SalesPersonId)) WHERE soh.OrderDate BETWEEN '4/1/2002' AND '7/1/2002';
The results of this new query are shown in Figure 4-32, and the I/O is as follows:
Table 'Worktable'. Scan count 0, logical reads 0 Table 'SalesOrderHeader'. Scan count 2, logical reads 62
The reads have clearly increased, and the estimated costs against the very same indexes that had been picked by the query optimizer are now much higher. Most of the time, the optimizer makes very good choices when it comes to indexes and execution plans. Although query hints are available to allow you to take control from the optimizer, this control can cause as many problems as it solves. In attempting to force an index join as a performance benefit, instead the forced selection of indexes slowed down the execution of the query.
While generating a query execution plan, the SQL Server optimizer goes through the optimization phases not only to determine the type of index and join strategy to be used but also to evaluate the advanced indexing techniques such as index intersection and index join. Therefore, instead of creating wide covering indexes, consider creating multiple narrow indexes. SQL Server can use them together to serve as a covering index yet use them separately where required.
A filtered index is a nonclustered index that uses a filter, basically a WHERE
clause, to create a highly selective set of keys against a column or columns that may not have good selectivity otherwise. For example, a column with a large number of null
values may be stored as a sparse column to reduce the overhead of those null
values. Adding a filtered index to the column will allow you to have an index available on the data that is not null
The best way to understand this is to see it in action.
The Sales.SalesOrderHeader
table has more than 30,000 rows. Of those rows, 27,000+ have a null
value in the PurchaseOrderNumber
column and the SalesPersonId
column. If you wanted to get a simple list of purchase order numbers, the query might look like this:
SELECT soh.PurchaseOrderNumber ,soh.OrderDate ,soh.ShipDate ,soh.SalesPersonID FROM Sales.SalesOrderHeader AS soh WHERE PurchaseOrderNumber LIKE 'PO5%' AND soh.SalesPersonID IS NOT NULL;
Running the query results in, as you might expect, a clustered index scan, and the following I/O and execution time, as shown in Figure 4-33:
Table 'SalesOrderHeader'. Scan count 1, logical reads 686 CPU time = 0 ms, elapsed time = 619 ms.
To fix this, it is possible to create an index and include some of the columns from the query to make this a covering index:
CREATE NONCLUSTERED INDEX IX_Test ONSales.SalesOrderHeader(PurchaseOrderNumber
,SalesPersonId)
INCLUDE (OrderDate,ShipDate);
When you rerun the query, the performance improvement is fairly radical (see Figure 4-33 and the I/O and time in the following result):
Table 'SalesOrderHeader'. Scan count 1, logical reads 5 CPU time = 0 ms, elapsed time = 139 ms.
As you can see, the covering index dropped the reads from 686 down to 5 and the time from 619 ms to 139 ms. Normally, this would be enough. Assume for a moment that this query has to be called frequently. Very frequently. Now, every bit of speed you can wring from it will pay dividends. Knowing that so much of the data in the indexed columns is null
you can adjust the index so that it filters out the null
values, which aren't used by the index anyway, reducing the size of the tree and therefore the amount of searching required:
CREATE NONCLUSTERED INDEX IX_Test ON Sales.SalesOrderHeader(PurchaseOrderNumber ,SalesPersonId) INCLUDE (OrderDate,ShipDate) WHERE PurchaseOrderNumber IS NOT NULL AND SalesPersonId IS NOT NULL WITH (DROP_EXISTING = ON);
The final run of the query is visible in the following result and in Figure 4-35:
Table 'SalesOrderHeader'. Scan count 1, logical reads 4 CPU time = 0 ms, elapsed time = 36 ms.
Although in terms of sheer numbers reducing the reads from 5 to 4 isn't much, it is a 20 percent reduction in the I/O cost of the query, and if this query were running hundreds or even thousands of times in a minute, as some queries do, that 20 percent reduction would be a great payoff indeed. The real payoff is visible in the execution time, which dropped again from 139 ms to 36 ms.
Filtered indexes pay off in many ways:
Improving the efficiency of queries by reducing the size of the index
Reducing storage costs by making smaller indexes
Cutting down on the costs of index maintenance because of the reduced size
One of the first places suggested for their use is just like the previous example, eliminating null
values from the index. You can also isolate frequently accessed sets of data with a special index so that the queries against that data perform much faster. You can use the WHERE
clause to filter data in a fashion similar to creating an indexed view (covered in more detail in the "Indexed Views" section) without the data maintenance headaches associated with indexed views by creating a filtered index that is a covering index, just like the earlier example.
Filtered indexes require a specific set of ANSI settings when they are accessed or created:
ON: ANSI_NULLS, ANSI_PADDING, ANSI_WARNINGS, ARITHABORT, CONCAT_NULL_YIELDS_NULL, QUOTED_IDENTIFIER
OFF: NUMERIC_ROUNDABORT
When completed, drop the testing index:
DROP INDEX Sales.SalesOrderHeader.IX_Test;
A database view in SQL Server is like a virtual table that represents the output of a SELECT
statement. You create a view using the CREATE VIEW
statement, and you can query it exactly like a table. In general, a view doesn't store any data—only the SELECT
statement associated with it. Every time a view is queried, it further queries the underlying tables by executing its associated SELECT
statement.
A database view can be materialized on the disk by creating a unique clustered index on the view. Such a view is referred to as an indexed view or a materialized view. After a unique clustered index is created on the view, the view's result set is materialized immediately and persisted in physical storage in the database, saving the overhead of performing costly operations during query execution. After the view is materialized, multiple nonclustered indexes can be created on the indexed view.
Benefit
You can use an indexed view to increase the performance of a query in the following ways:
Aggregations can be precomputed and stored in the indexed view to minimize expensive computations during query execution.
Tables can be prejoined, and the resulting data set can be materialized.
Combinations of joins or aggregations can be materialized.
Overhead
Indexed views can produce major overhead on an OLTP database. Some of the overheads of indexed views are as follows:
Any change in the base table(s) has to be reflected in the indexed view by executing the view's SELECT
statement.
Any changes to a base table on which an indexed view is defined may initiate one or more changes in the nonclustered indexes of the indexed view. The clustered index will also have to be changed if the clustering key is updated.
The indexed view adds to the ongoing maintenance overhead of the database.
Additional storage is required in the database.
The restrictions on creating an indexed view include the following:
The first index on the view must be a unique clustered index.
Nonclustered indexes on an indexed view can be created only after the unique clustered index is created.
The view definition must be deterministic—that is, it is able to return only one possible result for a given query. (A list of deterministic and nondeterministic functions is provided in SQL Server Books Online.)
The indexed view must reference only base tables in the same database, not other views.
The indexed view may contain float columns. However, such columns cannot be included in the clustered index key.
The indexed view must be a schema bound to the tables referred to in the view to prevent modifications of the table schema.
There are several restrictions on the syntax of the view definition. (A list of the syntax limitations on the view definition is provided in SQL Server Books Online.)
The list of SET
options that must be fixed are as follows:
ON: ARITHABORT, CONCAT_NULL_YIELDS_NULL, QUOTED_IDENTIFIER, ANSI_NULLS, ANSI_PADDING, and ANSI_WARNING
OFF: NUMERIC_ROUNDABORT
Usage Scenarios
Reporting systems benefit the most from indexed views. OLTP systems with frequent writes may not be able to take advantage of the indexed views because of the increased maintenance cost associated with updating both the view and the underlying base tables. The net performance improvement provided by an indexed view is the difference between the total query execution savings offered by the view and the cost of storing and maintaining the view.
An indexed view need not be referenced in the query for the query optimizer to use it during query execution. This allows existing applications to benefit from the newly created indexed views without changing those applications. The query optimizer considers indexed views only for queries with nontrivial cost.
Let's see how indexed views work with the following example. Consider the three queries in Listings 4-1 through 4-3.
Example 4.1. Query1.sql
SELECT p.[Name] AS ProductName ,SUM(pod.OrderQty) AS OrderQty ,SUM(pod.ReceivedQty) AS ReceivedQty ,SUM(pod.RejectedQty) AS RejectedQty FROM Purchasing.PurchaseOrderDetail AS pod JOIN Production.Product AS p ON p.ProductID = pod.ProductID GROUP BY p.[Name]
Example 4.2. Query2.sql
SELECT p.[Name] AS ProductName ,SUM(pod.OrderQty) AS OrderQty ,SUM(pod.ReceivedQty) AS ReceivedQty ,SUM(pod.RejectedQty) AS RejectedQty FROM Purchasing.PurchaseOrderDetail AS pod JOIN Production.Product AS p ON p.ProductID = pod.ProductID GROUP BY p.[Name] HAVING (SUM(pod.RejectedQty)/SUM(pod.ReceivedQty)) > .08
Example 4.3. Query3.sql
SELECT p.[Name] AS ProductName ,SUM(pod.OrderQty) AS OrderQty ,SUM(pod.ReceivedQty) AS ReceivedQty ,SUM(pod.RejectedQty) AS RejectedQty FROM Purchasing.PurchaseOrderDetail AS pod JOIN Production.Product AS p ON p.ProductID = pod.ProductID WHERE p.[Name] LIKE 'Chain%' GROUP BY p.[Name]
All the three queries use the aggregation function SUM
on columns of the PurchaseOrderDetail
table. Therefore, you can create an indexed view to precompute these aggregations and minimize the cost of these complex computations during query execution.
Listings 4-4 through 4-6 show the number of logical reads performed by these queries to access the appropriate tables.
Example 4.4. Logical Reads by Query1
Table 'Worktable'. Scan count 0, logical reads 0 Table 'Product'. Scan count 1, logical reads 5 Table 'PurchaseOrderDetail'. Scan count 1, logical reads 66 CPU time = 15 ms, elapsed time = 100 ms.
Example 4.5. Logical Reads by Query2
Table 'Worktable'. Scan count 0, logical reads 0 Table 'Product'. Scan count 1, logical reads 5 Table 'PurchaseOrderDetail'. Scan count 1, logical reads 66 CPU time = 16 ms, elapsed time = 10 ms.
Example 4.6. Logical Reads by Query3
Table 'PurchaseOrderDetail'. Scan count 5, logical reads 894 Table 'Product'. Scan count 1, logical reads 2 CPU time = 0 ms, elapsed time = 46 ms.
Create an indexed view to precompute the costly computations and join the tables (odView.sql
in the download):
IF EXISTS ( SELECT * FROM sys.views WHERE object_id = OBJECT_ID(N'[Purchasing].[IndexedView]') ) DROP VIEW [Purchasing].[IndexedView] GO CREATE VIEW Purchasing.IndexedView WITH SCHEMABINDING
AS SELECT pod.ProductID ,SUM(pod.OrderQty) AS OrderQty ,SUM(pod.ReceivedQty) AS ReceivedQty ,SUM(pod.RejectedQty) AS RejectedQty ,COUNT_BIG(*) AS [Count] FROM Purchasing.PurchaseOrderDetail AS pod GROUP BY pod.ProductID GO CREATE UNIQUE CLUSTERED INDEX iv ON Purchasing.IndexedView (ProductId) ; GO
Certain constructs such as AVG
are disallowed. (For the complete list of disallowed constructs, refer to SQL Server Books Online.) If aggregates are included in the view, as in this one, you must include COUNT_BIG
by default.
The indexed view materializes the output of the aggregate functions on the disk. This eliminates the need for computing the aggregate functions during the execution of a query interested in the aggregate outputs. For example, Query3.sql
requests the sum of ReceivedQty
and RejectedQty
for certain products from the PurchaseOrderDetail
table. Because these values are materialized in the indexed view for every product in the PurchaseOrderDetail
table, you can fetch these preaggregated values using the following SELECT
statement on the indexed view:
SELECT ProductID, ReceivedQty, RejectedQty FROM Purchasing.IndexedView
As shown in the execution plan in Figure 4-36, the SELECT
statement retrieves the values directly from the indexed view without accessing the base table (PurchaseOrderDetail
.
The indexed view benefits not only the queries based on the view directly but also other queries that may be interested in the materialized data. For example, with the indexed view in place, the three queries on PurchaseOrderDetail
benefit without being rewritten (see the execution plan in Figure 4-37 for the execution plan from the first query), and the number of logical reads decreases, as shown in Listings 4-7 through 4-9.
Example 4.7. Logical Reads by Query1
with an Indexed View
Table 'Product'. Scan count 1, logical reads 13 Table 'IndexedView'. Scan count 1, logical reads 4 CPU time = 0 ms, elapsed time = 48 ms.
Example 4.8. Logical Reads by Query2
with an Indexed View
Table 'Product'. Scan count 1, logical reads 13 Table 'IndexedView'. Scan count 1, logical reads 4 CPU time = 0 ms, elapsed time = 0 ms.
Example 4.9. Logical Reads by Query3
with an Indexed View
Table 'IndexedView'. Scan count 0, logical reads 10 Table 'Product'. Scan count 1, logical reads 2 CPU time = 0 ms, elapsed time = 16 ms.
Even though the queries are not modified to refer to the new indexed view, the optimizer still uses the indexed view to improve performance. Thus, even existing queries in the database application can benefit from new indexed views without any modifications to the queries.
Data and index compression was introduced in SQL Server 2008 (available in the Enterprise and Developer Editions). Compressing an index means getting more key information onto a single page. This can lead to serious performance improvements because fewer pages and fewer index levels are needed to store the index. There will be overhead in the CPU and memory as the key values in the index are compressed and decompressed, so this may not be a solution for all indexes.
By default, an index will be not be compressed. You have to explicitly call for the index to be compressed when you create the index. There are two types of compression: row- and page-level compression. Nonleaf pages in an index receive no compression under the page type. To see it in action, consider the following index (don't run this code since the index already exists):
CREATE NONCLUSTERED INDEX [IX_Test] ON [Person].[Address] ([City] ASC, [PostalCode] ASC)
This index was created earlier in the chapter. If you were to re-create it as defined here
CREATE NONCLUSTERED INDEX IX_Comp_Test ON Person.Address ([City],[PostalCode]) WITH ( DATA_COMPRESSION = ROW ) ;
this creates a row type of compression on an index with the same two columns as the first test index IX_Test
Create one more index:
CREATE NONCLUSTERED INDEX IX_Comp_Page_Test ON Person.Address ([City],[PostalCode]) WITH ( DATA_COMPRESSION = PAGE ) ;
To examine the indexes being stored, modify the original query against sys.dm_db_index_physical_stats
to add another column, compressed_page_count:
SELECT i.Name ,i.type_desc ,s.page_count ,s.record_count ,s.index_level ,compressed_page_count FROM sys.indexes i JOIN sys.dm_db_index_physical_stats(DB_ID(N'AdventureWorks2008'), OBJECT_ID(N'Person.Address'), NULL, NULL, 'DETAILED') AS s ON i.index_id = s.index_id WHERE i.OBJECT_ID = OBJECT_ID(N'Person.Address')
Running the query, you get the results in Figure 4-38.
For this index, you can see that the page compression was able move the index from 99 pages down to 27, of which 26 were compressed. The row type compression in this instance didn't make much of a difference; in fact, the overhead associated with compression increased the number of pages it took to define the index.
To see the compression in action, run the following query:
SELECT a.City,a.PostalCode FROM Person.Address AS a WHERE a.City = 'Newton' AND a.PostalCode = 'V2M1N7'
The optimizer chose, on my system, to use the IX_Comp_Page_Test
index. Even if I forced it to use the IX_Test
index, thusly,
SELECT a.City,a.PostalCode FROM Person.Address AS a WITH(INDEX = IX_Test) WHERE a.City = 'Newton' AND a.PostalCode = 'V2M1N7'
the performance was identical. So although one index is taking up radically less room on approximately one quarter as many pages, it's done at no cost in performance.
Compression has a series of impacts on other processes within SQL Server, so further understanding of the possible impacts as well as the possible benefits should be explored thoroughly prior to implementation.
Clean up the indexes after you finish testing:
DROP INDEX Person.Address.IX_Test; DROP INDEX Person.Address.IX_Comp_Test; DROP INDEX Person.Address IX_Comp_Page_Test;
As special data types and storage mechanisms are introduced to SQL Server by Microsoft, methods for indexing these special storage types are also developed. Explaining all the details possible for each of these special index types is outside the scope of the book. In the following sections, I introduce the basic concepts of each index type in order to facilitate the possibility of their use in tuning your queries.
You can store large amounts of text in SQL Server by using the MAX
value in the VARCHAR
, NVARCHAR
, CHAR
, and NCHAR
fields. A normal clustered or nonclustered index against these large fields would be unsupportable because a single value can far exceed the page size within an index. So, a different mechanism of indexing text is to use the Full-Text Engine, which must be running to work with full-text indexes. You can also build a full-text index on VARBINARY
data.
You need to have one column on the table that is unique. The best candidates for performance are integers: INT
or BIGINT
This column is then used along with the word to identify which row within the table it belongs to, as well as its location within the field. SQL Server allows for incremental changes, either change tracking or time based, to the full-text indexes as well as complete rebuilds.
For more details, check out Pro Full-Text Search in SQL Server 2008 by Hilary Cotter and Michael Coles (Apress, 2008).
Introduced in SQL Server 2008 is the ability to store spatial data. This data can be either a geometry type or the very complex geographical type, literally identifying a point on the earth. To say the least, indexing this type of data is complicated. SQL Server stores these indexes in a flat B-tree, similar to regular indexes, except that it is also a hierarchy of four grids linked together. Each of the grids can be given a density of low, medium, or high, outlining how big each grid is. From there it begins to get truly complicated. Suffice to say, there are mechanisms to support indexing of the spatial data types so that different types of queries, such as finding when one object is within the boundaries or near another object, can benefit from performance increases inherent in indexing.
A spatial index can be created only against a column of type geometry or geography. It has to be on a base table, it must have no indexed views, and the table must have a primary key. You can create up to 249 spatial indexes on any given column on a table. Different indexes are used to define different types of index behavior. More information is available in the book Beginning Spatial with SQL Server 2008 by Alastair Aitchison (Apress, 2009).
Introduced as a data type in SQL Server 2005, XML can be stored not as text but as well-formed XML data within SQL Server. This data can be queries using the XQuery language as supported by SQL Server. To enhance the performance capabilities, a special set of indexes has been defined. An XML column can have one primary and several secondary indexes. The primary XML shreds the properties, attributes, and elements of the XML data and stores it as an internal table. There must be a primary key on the table, and that primary key must be clustered in order to create an XML index. After the XML index is created, the secondary indexes can be created. These indexes have types Path
, Value
, and Property
depending on how you query the XML. For more details, check out Pro SQL Server 2008 XML by Michael Coles (Apress, 2008).
Other index properties can affect performance, positively and negatively. A few of these behaviors are explored here.
SQL Server supports creating a composite index with a different sort order for the different columns of the index. Suppose you want an index with the first column sorted in ascending order and the second column sorted in descending order. You could achieve this as follows:
CREATE NONCLUSTERED INDEX i1 ON t1(c1 ASC, c2 DESC)
You can create an index on a computed column, as long as the expression defined for the computed column meets certain restrictions, such as that it references columns only from the table containing the computed column and it is deterministic.
SQL Server allows you to create an index on columns with the BIT
data type. The ability to create an index on a BIT
data type column by itself is not a big advantage since such a column can have only two unique values. As mentioned previously, columns with such low selectivity (number of unique values) are not usually good candidates for indexing. However, this feature comes into its own when you consider covering indexes. Because covering indexes require including all the columns in the index, the ability to add the BIT
data type column to an index allows covering indexes to include such a column, if required.
The CREATE INDEX
operation is integrated into the query processor. The optimizer can use existing index(es) to reduce scan cost and sort while creating an index.
Take, for example, the Person.Address
table. A nonclustered index exists on a number of columns: AddressLine1
, AddressLine2
, City
, StateProvinceId
, and PostalCode
If you needed to run queries against the City
column with the existing index, you'll get a scan of that index. Now create a new index like this:
CREATE INDEX IX_Test ON Person.Address(City)
You can see in Figure 4-39 that, instead of scanning the table, the optimizer chose to scan the index in order to create the new index because the column needed for the new index was contained within the other nonclustered index.
SQL Server supports parallel plans for a CREATE INDEX
statement, as supported in other SQL queries. On a multiprocessor machine, index creation won't be restricted to a single processor but will benefit from the multiple processors. You can control the number of processors to be used in a CREATE INDEX
statement with the max degree of parallelism
configuration parameter of SQL Server. The default value for this parameter is 0, as you can see by executing the sp_configure
stored procedure:
EXEC sp_configure 'max degree of parallelism'
The default value of 0 means that SQL Server can use all the available CPUs in the system for the parallel execution of a T-SQL statement. On a system with four processors, the maximum degree of parallelism can be set to 2 by executing sp_configure:
EXEC sp_configure 'max degree of parallelism', 2 RECONFIGURE WITH OVERRIDE
This allows SQL Server to use up to two CPUs for the parallel execution of a T-SQL statement. This configuration setting takes effect immediately, without a server restart.
The query hint MAXDOP
can be used for the CREATE INDEX
statement. Also, be aware that the parallel CREATE INDEX
feature is available only in SQL Server 2005 and 2008 Enterprise Editions.
The default creation of an index is done as an offline operation. This means that exclusive locks are placed on the table, restricting user access while the index is created. It is possible to create the indexes as an online operation. This allows users to continue to access the data while the index is being created. This comes at the cost of increasing the amount of time and resources it takes to create the index. Online index operations are available only in SQL Server 2005 and 2008 Enterprise Editions.
A simple approach to indexing is to use the Database Engine Tuning Advisor tool provided by SQL Server. This tool is a usage-based tool that looks at a particular workload and works with the query optimizer to determine the costs associated with various index combinations. Based on the tool's analysis, you can add or drop indexes as appropriate.
I will cover the Database Engine Tuning Advisor tool in more depth in Chapter 5.
In this chapter, you learned that indexing is an effective method for reducing the number of logical reads and disk I/O for a query. Although an index may add overhead to action queries, even action queries such as UPDATE
and DELETE
can benefit from an index.
To decide the index key columns for a particular query, evaluate the WHERE
clause and the join criteria of the query. Factors such as column selectivity, width, data type, and column order are important in deciding the columns in an index key. Since an index is mainly useful in retrieving a small number of rows, the selectivity of an indexed column should be very high. It is important to note that nonclustered indexes contain the value of a clustered index key as their row locator, because this behavior greatly influences the selection of an index type.
For better performance, try to cover a query fully using a covering index. Since SQL Server can benefit from multiple indexes, use the index intersection and index join techniques, and consider having multiple narrow indexes instead of one very wide index. When working with special data types, apply the indexes that work with those special data types in order to help performance.
In the next chapter, you will learn more about the Database Engine Tuning Advisor, the SQL Server-provided tool that can help you determine the correct indexes in a database for a given SQL workload.
3.15.168.73