Chapter 7. Statistics Analysis

By now, you should have a good understanding of the importance of indexes. It is equally important for the optimizer to have the necessary statistics on the data distribution so that it can choose indexes effectively. In SQL Server, this information is maintained in the form of statistics on the index key.

In this chapter, you'll learn the importance of statistics in query optimization. Specifically, I cover the following topics:

  • The role of statistics in query optimization

  • The importance of statistics on columns with indexes

  • The importance of statistics on nonindexed columns used in join and filter criteria

  • Analysis of single-column and multicolumn statistics, including the computation of selectivity of a column for indexing

  • Statistics maintenance

  • Effective evaluation of statistics used in a query execution

The Role of Statistics in Query Optimization

SQL Server's query optimizer is a cost-based optimizer; it decides on the best data access mechanism and join strategy by identifying the selectivity, how unique the data is, and which columns are used in filtering the data (meaning via the WHERE or JOIN clause). Statistics exist with an index, but they also exist on columns without an index that are used as part of a predicate. As you learned in Chapter 4, you should use a nonclustered index to retrieve a small result set, whereas for data in related sets, a clustered index works better. With a large result set, going to the clustered index or table directly is usually more beneficial.

Up-to-date information on data distribution in the columns referenced as predicates helps the optimizer determine the query strategy to use. In SQL Server, this information is maintained in the form of statistics, which are essential for the cost-based optimizer to create an effective query execution plan. Through the statistics, the optimizer can make reasonably accurate estimates about how long it will take to return a result set or an intermediate result set and therefore determine the most effective operations to use. As long as you ensure that the default statistical settings for the database are set, the optimizer will be able to do its best to determine effective processing strategies dynamically. Also, as a safety measure while troubleshooting performance, you should ensure that the automatic statistics maintenance routine is doing its job as desired. Where necessary, you may even have to take manual control over the creation and/or maintenance of statistics. (I cover this in the "Manual Maintenance" section, and I cover the precise nature of the functions and shape of statistics in the "Analyzing Statistics" section.) In the following section, I show you why statistics are important to indexed columns and nonindexed columns functioning as predicates.

Statistics on an Indexed Column

The usefulness of an index is fully dependent on the statistics of the indexed columns; without statistics, SQL Server's cost-based query optimizer can't decide upon the most effective way of using an index. To meet this requirement, SQL Server automatically creates the statistics of an index key whenever the index is created. It isn't possible to turn this feature off.

As data changes, the data-retrieval mechanism required to keep the cost of a query low may also change. For example, if a table has only one matching row for a certain column value, then it makes sense to retrieve the matching rows from the table by going through the nonclustered index on the column. But if the data in the table changes so that a large number of rows are added with the same column value, then using the nonclustered index no longer makes sense. To be able to have SQL Server decide this change in processing strategy as the data changes over time, it is vital to have up-to-date statistics.

SQL Server can keep the statistics on an index updated as the contents of the indexed column are modified. By default, this feature is turned on and is configurable through the Properties

Statistics on an Indexed Column
  • When a table with no rows gets a row

  • When a table has fewer than 500 rows and is increased by 500 or more rows

  • When a table has more than 500 rows and is increased by 500 rows + 20 percent of the number of rows

This built-in intelligence keeps the CPU utilization by each process very low. It's also possible to update the statistics asynchronously. This means when a query would normally cause statistics to be updated, instead that query proceeds with the old statistics, and the statistics are updated offline. This can speed up the response time of some queries, such as when the database is large or when you have a short timeout period.

You can manually disable (or enable) the auto update statistics and the auto update statistics asynchronously features by using the ALTER DATABASE command. By default, the auto update statistics feature is enabled, and it is strongly recommended that you keep it enabled. The auto update statistics asynchronously feature is disabled by default. Turn this feature on only if you've determined it will help with timeouts on your database.

Note

I explain ALTER DATABASE later in this chapter in the "Manual Maintenance" section.

Benefits of Updated Statistics

The benefits of performing an auto update usually outweigh its cost on the system resources.

To more directly control the behavior of the data, instead of using the tables in AdventureWorks, for this set of examples you will create one manually. Specifically, create a test table (create_t1.sql in the download) with only three rows and a nonclustered index:

IF (SELECT  OBJECT_ID('t1')
   ) IS NOT NULL
    DROP TABLE dbo.t1 ;
GO
CREATE TABLE dbo.t1 (c1 INT, c2 INT IDENTITY) ;
SELECT TOP 1500
        IDENTITY( INT,1,1 ) AS n
INTO    #Nums
FROM    Master.dbo.SysColumns sc1
       ,Master.dbo.SysColumns sc2;

INSERT  INTO dbo.t1 (c1)
        SELECT  n
        FROM    #Nums

DROP TABLE #Nums
CREATE NONCLUSTERED INDEX i1 ON dbo.t1 (c1) ;

If you execute a SELECT statement with a very selective filter criterion on the indexed column to retrieve only one row, as shown in the following line of code, then the optimizer uses a nonclustered index seek, as shown in the execution plan in Figure 7-1:

SELECT * FROM t1 WHERE c1 = 2 --Retrieve 1 row
Execution plan for a very small result set

Figure 7.1. Execution plan for a very small result set

To understand the effect of small data modifications on a statistics update, create a trace using Profiler. In the trace, add the event Auto Stats, which captures statistics update and create events, and add SQL:BatchCompleted with a filter on the TextData column. The filter should look like Not Like SET% when you are done. Add only one row to the table:

INSERT INTO t1 (c1) VALUES(2)

When you reexecute the preceding SELECT statement, you get the same execution plan as shown in Figure 7-1. Figure 7-2 shows the trace events generated by the query.

Trace output on the addition of a small number of rows

Figure 7.2. Trace output on the addition of a small number of rows

The trace output doesn't contain any SQL activity representing a statistics update because the number of changes fell below the threshold where any table that has more than 500 rows must have an increase of 500 rows plus 20 percent of the number of rows.

To understand the effect of large data modification on statistics update, add 1,500 rows to the table (add_rows.sql in the download):

SELECT TOP 1500
        IDENTITY( INT,1,1 ) AS n
INTO    #Nums
FROM    Master.dbo.SysColumns sc1
       ,Master.dbo.SysColumns sc2;
    INSERT INTO dbo.t1 (
        c1
    ) SELECT 2
    FROM #Nums;
DROP TABLE #Nums;

Now, if you reexecute the SELECT statement, like so:

SELECT * FROM dbo.t1 WHERE c1 = 2;

a large result set (1,502 rows out of 3,001 rows) will be retrieved. Since a large result set is requested, scanning the base table directly is preferable to going through the nonclustered index to the base table 1,502 times. Accessing the base table directly will prevent the overhead cost of bookmark lookups associated with the nonclustered index. This is represented in the resultant execution plan (see Figure 7-3).

Execution plan for a large result set

Figure 7.3. Execution plan for a large result set

Figure 7-4 shows the resultant Profiler trace output.

Trace output on the addition of a large number of rows

Figure 7.4. Trace output on the addition of a large number of rows

The Profiler trace output includes an Auto Stats event since the threshold was exceeded by the large-scale update this time. These SQL activities consume some extra CPU cycles. However, by doing this, the optimizer determines a better data-processing strategy and keeps the overall cost of the query low.

Drawbacks of Outdated Statistics

As explained in the preceding section, the auto update statistics feature allows the optimizer to decide on an efficient processing strategy for a query as the data changes. If the statistics become outdated, however, then the processing strategies decided on by the optimizer may not be applicable for the current data set and thereby will degrade performance.

To understand the detrimental effect of having outdated statistics, follow these steps:

  1. Re-create the preceding test table with 1,500 rows only and the corresponding nonclustered index.

  2. Prevent SQL Server from updating statistics automatically as the data changes. To do so, disable the auto update statistics feature by executing the following SQL statement:

    ALTER DATABASE AdventureWorks2008 SET AUTO_UPDATE_STATISTICS OFF
  3. Add 1,500 rows to the table as before.

Now, reexecute the SELECT statement to understand the effect of the outdated statistics on the query optimizer. The query is repeated here for clarity:

SELECT * FROM dbo.t1 WHERE c1 = 2

Figure 7-5 and Figure 7-6 show the resultant execution plan and the Profiler trace output for this query, respectively.

Execution plan with AUTO_UPDATE_STATISTICS OFF

Figure 7.5. Execution plan with AUTO_UPDATE_STATISTICS OFF

Trace output with AUTO_UPDATE_STATISTICS OFF

Figure 7.6. Trace output with AUTO_UPDATE_STATISTICS OFF

With the auto update statistics feature switched off, the query optimizer has selected a different execution plan from the one it selected with this feature on. Based on the outdated statistics, which have only one row for the filter criterion (c1 = 2), the optimizer decided to use a nonclustered index seek. The optimizer couldn't make its decision based on the current data distribution in the column. For performance reasons, it would have been better to hit the base table directly instead of going through the nonclustered index, since a large result set (1,501 rows out of 3,000 rows) is requested.

You can see that turning off the auto update statistics feature has a negative effect on performance by comparing the cost of this query with and without updated statistics. Table 7-1 shows the difference in the cost of this query.

Table 7.1. Cost of the Query with and Without Updated Statistics

Statistics Update Status

Figure

Cost (SQL:Batch Completed Event)

 
  

CPU (ms)

Number of Reads

Updated

Figure 7-4

0

34

Not updated

Figure 7-6

15

1509

The number of logical reads and the CPU utilization is significantly higher when the statistics are out-of-date even though the data returned is nearly identical and the query was precisely the same. Therefore, it is recommended that you keep the auto update statistics feature on. The benefits of keeping statistics updated outweigh the costs of performing the update. Before you leave this section, turn AUTO_UPDATE_STATISTICS back on (although you can also use sp_autostats):

ALTER DATABASE AdventureWorks2008 SET AUTO_UPDATE_STATISTICS ON

Statistics on a Nonindexed Column

Sometimes you may have columns in join or filter criteria without any index. Even for such nonindexed columns, the query optimizer is more likely to make the best choice if it knows the data distribution (or statistics) of those columns.

In addition to statistics on indexes, SQL Server can build statistics on columns with no indexes. The information on data distribution, or the likelihood of a particular value occurring in a nonindexed column, can help the query optimizer determine an optimal processing strategy. This benefits the query optimizer even if it can't use an index to actually locate the values. SQL Server automatically builds statistics on nonindexed columns if it deems this information valuable in creating a better plan, usually when the columns are used in a predicate. By default, this feature is turned on, and it's configurable through the Properties

Statistics on a Nonindexed Column

In general, you should not disable the automatic creation of statistics on nonindexed columns. One of the scenarios in which you may consider disabling this feature is while executing a series of ad hoc SQL activities that you will not execute again. In such a case, you must decide whether you want to pay the cost of automatic statistics creation to get a better plan in this one case and affect the performance of other SQL Server activities. It is worthwhile noting that SQL Server eventually removes statistics when it realizes that they have not been used for a while. So, in general, you should keep this feature on and not be concerned about it.

Benefits of Statistics on a Nonindexed Column

To understand the benefit of having statistics on a column with no index, create two test tables with disproportionate data distributions, as shown in the following code (create_t1_t2.sql in the download). Both tables contain 10,001 rows. Table t1 contains only one row for a value of the second column (t1_c2) equal to 1, and the remaining 10,000 rows contain this column value as 2. Table t2 contains exactly the opposite data distribution.

--Create first table with 10001 rows
IF(SELECT OBJECT_ID('dbo.t1')) IS NOT NULL
  DROP TABLE dbo.t1;
GO
CREATE TABLE dbo.t1(t1_c1 INT IDENTITY, t1_c2 INT);
INSERT INTO dbo.t1 (t1_c2) VALUES (1);
SELECT TOP 119026
        IDENTITY( INT,1,1 ) AS n
INTO    #Nums
FROM    Master.dbo.SysColumns sc1
       ,Master.dbo.SysColumns sc2;
    INSERT INTO dbo.t1 (
        t1_c2
    ) SELECT 2
    FROM #Nums
GO
CREATE CLUSTERED INDEX i1 ON dbo.t1(t1_c1)

--Create second table with 10001 rows,
--  but opposite data distribution
IF(SELECT OBJECT_ID('dbo.t2')) IS NOT NULL
  DROP TABLE dbo.t2;
GO
CREATE TABLE dbo.t2(t2_c1 INT IDENTITY, t2_c2 INT);
INSERT INTO dbo.t2(t2_c2) VALUES (2);
    INSERT INTO dbo.t2 (
        t2_c2
    ) SELECT 1
    FROM #Nums;
DROP TABLE #Nums;
GO
CREATE CLUSTERED INDEX i1 ON dbo.t2(t2_c1);

Table 7-2 illustrates how the tables will look.

Table 7.2. Sample Tables

 

Table t1

Table t2

Column

t1_c1

t1_c2

t2_c1

t2_c2

Row1

1

1

1

2

Row2

2

2

2

1

RowN

N

2

N

1

Row10001

10001

2

10001

1

To understand the importance of statistics on a nonindexed column, use the default setting for the auto create statistics feature. By default, this feature is on. You can verify this using the DATABASEPROPERTYEX function (although you can also query the sys.databases view):

SELECT DATABASEPROPERTYEX('AdventureWorks2008', 'IsAutoCreateStatistics')

Note

You can find a detailed description of configuring the auto create statistics feature later in this chapter.

Use the following SELECT statement (nonindexed_select.sql in the download) to access a large result set from table t1 and a small result set from table t2. Table t1 has 10,000 rows for the column value of t1_c2 = 2, and table t2 has 1 row for t2_c2 = 2. Note that these columns used in the join and filter criteria have no index on either table.

SELECT  t1.t1_c2
       ,t2.t2_c2
FROM   dbo. t1
        JOIN dbo.t2
        ON t1.t1_c2 = t2.t2_c2
WHERE   t1.t1_c2 = 2;

Figure 7-7 shows the execution plan for this query.

Execution plan with AUTO_CREATE_STATISTICS ON

Figure 7.7. Execution plan with AUTO_CREATE_STATISTICS ON

Figure 7-8 shows the Profiler trace output with all completed events and the Auto Stats event for this query. You can use this to evaluate some of the added costs for a given query.

Tip

To keep the Profiler trace output compact, I used a prefilter on TextData Not like SET%. This filters the SET statements used to show the graphical execution plan.

The Profiler trace output shown in Figure 7-8 includes two Auto Stats events creating statistics on the nonindexed columns referred to in the JOIN and WHERE clauses, t2_c2 and t1_c2. This activity consumes a few extra CPU cycles (since none could be detected), but by consuming these extra CPU cycles, the optimizer decides upon a better processing strategy for keeping the overall cost of the query low.

Trace output with AUTO_CREATE_STATISTICS ON

Figure 7.8. Trace output with AUTO_CREATE_STATISTICS ON

To verify the statistics automatically created by SQL Server on the nonindexed columns of each table, run this SELECT statement against the sys.stats table:

SELECT  *
FROM    sys.stats
WHERE   object_id = OBJECT_ID('t1')

Figure 7-9 shows the automatic statistics created for table t1.

Automatic statistics for table t1

Figure 7.9. Automatic statistics for table t1

To verify how a different result set size from the two tables influences the decision of the query optimizer, modify the filter criteria of nonindexed_select.sql to access an opposite result set size from the two tables (small from t1 and large from t2). Instead of filtering on t1.t1_c2 = 2, change it to filter on 1:

SELECT  t1.t1_c2
       ,t2.t2_c2
FROM  dbo. t1
        JOIN dbo.t2
        ON t1.t1_c2 = t2.t2_c2
WHERE   t1.t1_c2 = 1

Figure 7-10 shows the resultant execution plan, and Figure 7-11 shows the Profiler trace output of this query.

Execution plan for a different result set

Figure 7.10. Execution plan for a different result set

Trace output for a different result set

Figure 7.11. Trace output for a different result set

The resultant Profiler trace output doesn't perform any additional SQL activities to manage statistics. The statistics on the nonindexed columns (t1.t1_c2 and t2.t2_c2) had already been created during the previous execution of the query.

For effective cost optimization, in each case the query optimizer selected different processing strategies, depending upon the statistics on the nonindexed columns (t1.t1_c2 and t2.t2_c2). You can see this from the last two execution plans. In the first, table t1 is the outer table for the nested loop join, whereas in the latest one, table t2 is the outer table. By having statistics on the nonindexed columns (t1.t1_c2 and t2.t2_c2), the query optimizer can create a cost-effective plan suitable for each case.

An even better solution would be to have an index on the column. This would not only create the statistics on the column but also allow fast data retrieval through an Index Seek operation, while retrieving a small result set. However, in the case of a database application with queries referring to nonindexed columns in the WHERE clause, keeping the auto create statistics feature on still allows the optimizer to determine the best processing strategy for the existing data distribution in the column.

If you need to know which column or columns might be covered by a given statistic, you need to look into the sys.stats_columns system table. You can query it in the same way as you did the sys.stats table:

SELECT  *
FROM    sys.stats_columns
WHERE   object_id = OBJECT_ID('t1')

This will show the column or columns being referenced by the automatically created statistics. This will help you if you decide you need to create an index to replace the statistics, because you will need to know which columns to create the statistics on.

Drawback of Missing Statistics on a Nonindexed Column

To understand the detrimental effect of not having statistics on nonindexed columns, drop the statistics automatically created by SQL Server and prevent SQL Server from automatically creating statistics on columns with no index by following these steps:

  1. Drop the automatic statistics created on column t1.t1_c2 through the Manage Statistics dialog box as shown in the section "Benefits of Statistics on a Nonindexed Column," or use the following SQL command:

    DROP STATISTICS [t1].StatisticsName
  2. Similarly, drop the corresponding statistics on column t2.t2_c2.

  3. Disable the auto create statistics feature by deselecting the Auto Create Statistics check box for the corresponding database or by executing the following SQL command:

    ALTER DATABASE AdventureWorks2008 SET AUTO_CREATE_STATISTICS OFF

Now reexecute the SELECT statement nonindexed_select.sql:

SELECT  t1.t1_c2
       ,t2.t2_c2
FROM   dbo. t1
        JOIN dbo.t2
        ON t1.t1_c2 = t2.t2_c2
WHERE   t1.t1_c2 = 2;

Figure 7-12 and Figure 7-13 show the resultant execution plan and Profiler trace output, respectively.

Execution plan with AUTO_CREATE_STATISTICS OFF

Figure 7.12. Execution plan with AUTO_CREATE_STATISTICS OFF

Trace output with AUTO_CREATE_STATISTICS OFF

Figure 7.13. Trace output with AUTO_CREATE_STATISTICS OFF

With the auto create statistics feature off, the query optimizer selected a different execution plan compared to the one it selected with the auto create statistics feature on. On not finding statistics on the relevant columns, the optimizer chose the first table (t1) in the FROM clause as the outer table of the nested loop join operation. The optimizer couldn't make its decision based on the actual data distribution in the column. Not only that, but the optimizer and the query engine determined that this query passed the threshold for parallelism, making this a parallel execution (those are the little arrows on the operators, marking them as parallel (more on parallel execution plans in Chapter 9). For example, if you modify the query to reference table t2 as the first table in the FROM clause:

SELECT  t1.t1_c2
       ,t2.t2_c2
FROM   dbo. t2
        JOIN dbo.t1
        ON t1.t1_c2 = t2.t2_c2
WHERE   t1.t1_c2 = 2;

then the optimizer selects table t2 as the outer table of the nested loop join operation. Figure 7-14 shows the execution plan.

Execution plan with AUTO_CREATE_STATISTICS OFF (a variation)

Figure 7.14. Execution plan with AUTO_CREATE_STATISTICS OFF (a variation)

You can see that turning off the auto create statistics feature has a negative effect on performance by comparing the cost of this query with and without statistics on a nonindexed column. Table 7-3 shows the difference in the cost of this query.

Table 7.3. Cost Comparison of a Query with and Without Statistics on a Nonindexed Column

 

Statistics on Nonindexed Column

Figure

Cost (SQL:Batch Completed Event)

  

Duration (ms)

Number of Reads

With statistics

Figure 7-11

159

48

Without statistics

Figure 7-13

303

20,313

The number of logical reads and the CPU utilization are very high with no statistics on the nonindexed columns. Without these statistics, the optimizer can't create a cost-effective plan.

A query execution plan highlights the missing statistics by placing an exclamation point on the operator that would have used the statistics. You can see this in the clustered index scan operators in the previous execution plan (Figure 7-14), as well as in the detailed description in the Warnings section for a node in a graphical execution plan, as shown in Figure 7-15 for table t2.

Missing statistics indication in a graphical plan

Figure 7.15. Missing statistics indication in a graphical plan

The XML execution plan provides the missing statistics information under the Warnings column, as shown in Figure 7-16. Remember that you can obtain the XML execution plan by right-clicking within the graphical execution plan and selecting Show Execution Plan XML or by using SET STATISTICS XML in the query window.

Missing statistics indication in an XML plan

Figure 7.16. Missing statistics indication in an XML plan

Note

In a database application, there is always the possibility of queries using columns with no indexes. Therefore, for performance reasons, leaving the auto create statistics feature of SQL Server databases on is recommended.

Analyzing Statistics

Statistics are collections of information stored as histograms. A histogram is a statistical construct that shows how often data falls into varying categories. The histogram stored by SQL Server consists of a sampling of data distribution for a column or an index key (or the first column of a multicolumn index key) of up to 200 rows. The information on the range of index key values between two consecutive samples is called a step. These steps consist of varying size intervals between the 200 values stored. A step provides the following information:

  • The top value of a given step (RANGE_HI_KEY).

  • The number of rows equal to RANGE_HI_KEY (EQ_ROWS).

  • The range of rows between the previous top value and the current top value, without counting either of these samples (RANGE_ROWS).

  • The number of distinct rows in the range (DISTINCT_RANGE_ROWS). If all values in the range are unique, then RANGE_ROWS equals DISTINCT_RANGE_ROWS.

  • The average number of rows equal to a key value within a range (AVG_RANGE_ROWS).

The value of EQ_ROWS for an index key value (RANGE_HI_KEY) helps the optimizer decide how (and whether) to use the index when the indexed column is referred to in a WHERE clause. Because the optimizer can perform a SEEK or SCAN operation to retrieve rows from a table, the optimizer can decide which operation to perform based on the number of matching rows (EQ_ROWS) for the index key value.

To understand how the optimizer's data-retrieval strategy depends on the number of matching rows, create a test table (create_t3.sql in the download) with different data distributions on an indexed column:

IF (SELECT  OBJECT_ID('dbo.t1')
   ) IS NOT NULL
    DROP TABLE dbo.t1 ;
GO
CREATE TABLE dbo.t1 (c1 INT, c2 INT IDENTITY) ;
INSERT  INTO dbo.t1 (c1)
VALUES  (1) ;
SELECT TOP 119026
        IDENTITY( INT,1,1 ) AS n
INTO    #Nums ;
INSERT  INTO dbo.t1 (c1)
        SELECT  2
        FROM    #Nums ;
CREATE NONCLUSTERED INDEX i1 ON dbo.t1 (c1) ;

When the preceding nonclustered index is created, SQL Server automatically creates statistics on the index key. You can obtain statistics for this nonclustered index key (i1) by executing the DBCC SHOW_STATISTICS command:

DBCC SHOW_STATISTICS(t1, i1)

Figure 7-17 shows the statistics output.

Statistics on index i1

Figure 7.17. Statistics on index i1

Now, to understand how effectively the optimizer decides upon different data-retrieval strategies based on statistics, execute the following two queries requesting different numbers of rows:

SELECT * FROM dbo.t1 WHERE c1 = 1 --Retrieve 1 row;
SELECT * FROM dbo.t1 WHERE c1 = 2 --Retrieve 10000 rows;

Figure 7-18 shows execution plans of these queries.

Execution plans of small and large result set queries

Figure 7.18. Execution plans of small and large result set queries

From the statistics, the optimizer can find the number of rows affected by the preceding two queries. Understanding that there is only one row to be retrieved for the first query, the optimizer chose an Index Seek operation, followed by the necessary RID Lookup to retrieve the data not stored with the clustered index. For the second query, the optimizer knows that a large number of rows (10,000 rows) will be affected and therefore avoided the index to attempt to improve performance. (Chapter 4 explains indexing strategies in detail.)

Besides the information on steps, other useful information in the statistics includes the following:

  • The time statistics were last updated

  • The number of rows in the table

  • The average index key length

  • The number of rows sampled for the histogram

  • Densities for combinations of columns

Information on the time of the last update can help you decide whether you should manually update the statistics. The average key length represents the average size of the data in the index key column(s). It helps you understand the width of the index key, which is an important measure in determining the effectiveness of the index. As explained in Chapter 4, a wide index is usually costly to maintain and requires more disk space and memory pages.

Density

When creating an execution plan, the query optimizer analyzes the statistics of the columns used in the filter and JOIN clauses. A filter criterion with high selectivity limits the number of rows from a table to a small result set and helps the optimizer keep the query cost low. A column with a unique index will have a very high selectivity, since it can limit the number of matching rows to one.

On the other hand, a filter criterion with low selectivity will return a large result set from the table. A filter criterion with very low selectivity makes a nonclustered index on the column ineffective. Navigating through a nonclustered index to the base table for a large result set is usually costlier than scanning the base table (or clustered index) directly because of the cost overhead of bookmark lookups associated with the nonclustered index. You can observe this behavior in the execution plan in Figure 7-18.

Statistics track the selectivity of a column in the form of a density ratio. A column with high selectivity (or uniqueness) will have low density. A column with low density (that is, high selectivity) is suitable for a nonclustered index, because it helps the optimizer retrieve a small number of rows very fast. This is also the principal on which filtered indexes operate since the filter's goal is to increase the selectivity, or density, of the index.

Density can be expressed as follows:

Density = 1 / Number of distinct values for a column

Density will always come out as a number somewhere between 0 and 1. The lower the column density, the more suitable it is for use in a nonclustered index. You can perform your own calculations to determine the density of columns within your own indexes and statistics. For example, to calculate the density of column c1 from the test table built by create_t3.sql, use the following (results in Figure 7-19):

SELECT 1.0/COUNT(DISTINCT c1) FROM t1
Results of density calculation for column c1

Figure 7.19. Results of density calculation for column c1

You can see this as actual data in the All density column in the output from DBCC SHOW_STATISTICS. This high-density value for the column makes it a less suitable candidate for an index, even a filtered index. However, the statistics of the index key values maintained in the steps help the query optimizer use the index for the predicate c1 = 1, as shown in the previous execution plan.

Statistics on a Multicolumn Index

In the case of an index with one column, statistics consist of a histogram and a density value for that column. Statistics for a composite index with multiple columns consist of one histogram for the first column only and multiple density values. This is one reason why it's wise to put the more selective column, the one with the lowest density, first when building a compound index or compound statistics. The density values include the density for the first column and for each prefix combination of the index key columns. Multiple density values help the optimizer find the selectivity of the composite index when multiple columns are referred to by predicates in the WHERE and JOIN clauses. Although the first column can help determine the histogram, the final density of the column itself would be the same regardless of column order.

To better understand the density values maintained for a multicolumn index, you can modify the nonclustered index used earlier to include two columns:

CREATE NONCLUSTERED INDEX i1 ON dbo.t1(c1,c2) WITH DROP_EXISTING

Figure 7-20 shows the resultant statistics provided by DBCC SHOW_STATISTICS.

Statistics on the multicolumn index i1

Figure 7.20. Statistics on the multicolumn index i1

As you can see, there are two density values under the All density column:

  • The density of the first column

  • The density of the (first + second) columns

For a multicolumn index with three columns, the statistics for the index would also contain the density value of the (first + second + third) columns. The statistics won't contain a density value for any other combination of columns. Therefore, this index (i1) won't be very useful for filtering rows only on the second column (c2), because the density value of the second column (c2) alone isn't maintained in the statistics.

You can compute the second density value (0.190269999) shown in Figure 7-19 through the following steps. This is the number of distinct values for a column combination of (c1, c2):

SELECT 1.0/COUNT(*)
    FROM (SELECT DISTINCT c1, c2 FROM dbo.t1) DistinctRows

Statistics on a Filtered Index

The purpose of a filtered index is to change the data that makes up the index and therefore change the density and histogram to make the index more performant. Instead of a test table, this example will use the AdventureWorks2008 database. Create an index on the Sales. PurchaseOrderHeader table on the PurchaseOrderNumber column:

CREATE INDEX IX_Test ON Sales.SalesOrderHeader (PurchaseOrderNumber)

Figure 7-21 shows the header and the density of the output from DBCC SHOW_STATISTICS run against this new index:

DBCC SHOW_STATISTICS('Sales.SalesOrderHeader',IX_Test)
Statistics header of an unfiltered index

Figure 7.21. Statistics header of an unfiltered index

If the same index is re-created to deal with values of the column that are not null, it would look something like this:

CREATE INDEX IX_Test ON Sales.SalesOrderHeader (PurchaseOrderNumber)
WHERE PurchaseOrderNumber IS NOT NULL
WITH DROP_EXISTING

And now, in Figure 7-22, take a look at the statistics information.

Statistics header for a filtered index

Figure 7.22. Statistics header for a filtered index

First you can see that the number of rows that compose the statistics have radically dropped in the filtered index because there is a filter in place. Notice also that the average key length has increased since you're no longer dealing with zero-length strings. A filter expression has been defined rather than the NULL value visible in Figure 7-21. But the unfiltered rows of both sets of data are the same.

The density measurements are very interesting. Notice that the density is close to the same for both values, but the filtered density is slightly lower, meaning fewer unique values. This is because the filtered data, while marginally less selective, is actually more accurate, eliminating all the empty values that won't contribute to a search. And the density of the second value, which represents the clustered index pointer, is identical with the value of the density of the PurchaseOrderNumber alone because each represents the same amount of unique data. The density of the additional clustered index in the previous column is a much smaller number because of all the unique values of SalesOrderId that are not included in the filtered data because of the elimination of the null values.

One other option open to you is to create filtered statistics. This allows you to create even more fine-tuned histograms on partitioned tables. This is necessary because statistics are not automatically created on partitioned tables and you can't create your own using CREATE STATISTICS. You can create filtered indexes by partition and get statistics or create filtered statistics specifically by partition.

Before going on, clean the indexes created, if any:

DROP INDEX Sales.SalesOrderHeader.IX_Test;

Statistics Maintenance

SQL Server allows a user to manually override the maintenance of statistics in an individual database. The four main configurations controlling automatic statistics maintenance behavior of SQL Server are as follows:

  • New statistics on columns with no index (auto create statistics)

  • Updating existing statistics (auto update statistics)

  • The degree of sampling used to collect statistics

  • Asynchronous updating of existing statistics (auto update statistics async)

You can control the preceding configurations at the levels of a database (all indexes and statistics on all tables) or on a case-by-case basis on individual indexes or statistics. The auto create statistics setting is applicable for nonindexed columns only, because SQL Server always creates statistics for an index key when the index is created. The auto update statistics setting, and the asynchronous version, is applicable for statistics on both indexes and WHERE clause columns with no index.

Automatic Maintenance

By default, SQL Server automatically takes care of statistics. Both the auto create statistics and auto update statistics settings are on by default. These two features together are referred to as autostats. As explained previously, it is usually better to keep these settings on. The auto update statistics async setting is off by default.

Auto Create Statistics

The auto create statistics feature automatically creates statistics on nonindexed columns when referred to in the WHERE clause of a query. For example, when this SELECT statement is run against the Sales.SalesOrderHeader table on a column with no index, statistics for the column are created:

SELECT  *
FROM    Sales.SalesOrderHeader AS soh
WHERE   ShipToAddressID = 29692

Then the auto create statistics feature (make sure it is turned back on if you have turned it off) automatically creates statistics on column ShipToAddressID. You can see this in the Profiler trace output in Figure 7-23.

Trace output with AUTO_CREATE_STATISTICS ON

Figure 7.23. Trace output with AUTO_CREATE_STATISTICS ON

Auto Update Statistics

The auto update statistics feature automatically updates existing statistics on the indexes and columns of a permanent table when the table is referred to in a query, provided the statistics have been marked as out-of-date. The types of changes are action statements, such as INSERT, UPDATE, and DELETE. The threshold for the number of changes depends on the number of rows in the table, as shown in Table 7-4.

Table 7.4. Update Statistics Threshold for Number of Changes

Number of Rows

Threshold for Number of Changes

0

> 1 insert

< 500

> 500 changes

> 500

500 + 20% of cardinality changes

In SQL Server, cardinality is counted as the number of rows in the table.

Using an internal threshold reduces the frequency of the automatic update of statistics, and using column changes rather than rows changes further reduces the frequency. For example, consider the following table (create_t4.sql in the download):

IF(SELECT OBJECT_ID('dbo.t1')) IS NOT NULL
    DROP TABLE dbo.t1;
CREATE TABLE dbo.t1(c1 INT);
CREATE INDEX ix1 ON dbo.t1(c1);
INSERT INTO dbo.t1 (
    c1
) VALUES (  0 ) ;

After the nonclustered index is created, a single row is added to the table. This outdates the existing statistics on the nonclustered index. If the following SELECT statement is executed with a reference to the indexed column in the WHERE clause, like so:

SELECT * FROM dbo.t1 WHERE c1 = 0

then the auto update statistics feature automatically updates statistics on the nonclustered index, as shown in the Profiler trace output in Figure 7-24.

Trace output with AUTO_UPDATE_STATISTICS ON

Figure 7.24. Trace output with AUTO_UPDATE_STATISTICS ON

Once the statistics are updated, the change-tracking mechanisms for the corresponding tables are set to 0. This way, SQL Server keeps track of the number of changes to the tables and manages the frequency of automatic updates of statistics.

Auto Update Statistics Asynchronously

If auto update statistics asynchronously is set to on, the basic behavior of statistics in SQL Server isn't changed radically. When a set of statistics is marked as out-of-date and a query is then run against those statistics, the statistics update does not interrupt the query, as normally happens. Instead, the query finishes execution using the older set of statistics. Once the query completes, the statistics are updated. The reason this may be attractive is that when statistics are updated, query plans in the procedure cache are removed, and the query being run must be recompiled. So, rather than make a query wait for both the update of the statistics and a recompile of the procedure, the query completes its run. The next time the same query is called, it will have updated statistics waiting for it, and it will have to recompile only.

Although this functionality does make recompiles somewhat faster, it can also cause queries that could benefit from updated statistics and a new execution plan to work with the old execution plan. Careful testing is required before turning this functionality on to ensure it doesn't cause more harm than good.

Note

If you are attempting to update statistics asynchronously, you must also have AUTO_UPDATE_STATISTICS set to ON.

Manual Maintenance

The following are situations in which you need to interfere with the automatic maintenance of statistics:

  • When experimenting with statistics: Just a friendly suggestion: please spare your production servers from experiments such as the ones you are doing in this book.

  • After upgrading from a previous version to SQL Server 2008: Since the statistics maintenance of SQL Server 2008 has been upgraded, you should manually update the statistics of the complete database immediately after the upgrade instead of waiting for SQL Server to update it over time with the help of automatic statistics.

  • While executing a series of ad hoc SQL activities that you won't execute again: In such cases, you must decide whether you want to pay the cost of automatic statistics maintenance to get a better plan in that one case and affect the performance of other SQL Server activities. So, in general, you don't need to be concerned with such one-timers.

  • When you come upon an issue with the automatic statistics maintenance and the only workaround for the time being is to keep the automatic statistics maintenance feature off: Even in these cases you can turn the feature off for the specific database table that faces the problem instead of disabling it for the complete database.

  • While analyzing the performance of a query, you realize that the statistics are missing for a few of the database objects referred to by the query: This can be evaluated from the graphical and XML execution plans, as explained earlier in the chapter.

  • While analyzing the effectiveness of statistics, you realize that they are inaccurate: This can be determined when poor execution plans are being created from what should be good sets of indexes.

SQL Server allows a user to control many of its automatic statistics maintenance features. You can enable (or disable) the automatic statistics creation and update features by using the auto create statistics and auto update statistics settings, respectively, and then you can get your hands dirty.

Manage Statistics Settings

You can control the auto create statistics setting at a database level. To disable this setting, use the ALTER DATABASE command:

ALTER DATABASE AdventureWorks2008 SET AUTO_CREATE_STATISTICS OFF

You can control the auto update statistics setting at different levels of a database, including all indexes and statistics on a table, or at the individual index or statistics level. To disable auto update statistics at the database level, use the ALTER DATABASE command:

ALTER DATABASE AdventureWorks2008 SET AUTO_UPDATE_STATISTICS OFF

Disabling this setting at the database level overrides individual settings at lower levels.

Auto update statistics asynchronously requires that the auto update statistics be on first. Then you can enable the asynchronous update:

ALTER DATABASE AdventureWorks2008 SET AUTO_UPDATE_STATISTICS_ASYNC ON

To configure auto update statistics for all indexes and statistics on a table in the current database, use the sp_autostats system stored procedure:

USE AdventureWorks2008
EXEC sp_autostats 'HumanResources.Department', 'OFF'

You can also use the same stored procedure to configure this setting for individual indexes or statistics. To disable this setting for the AK_Department_Name index on AdventureWorks2008.HumanResources.Department, execute the following statements:

USE AdventureWorks2008
EXEC sp_autostats 'HumanResources.Department', 'OFF', AK_Department_Name

You can also use the UPDATE STATISTICS command's WITH NORECOMPUTE option to disable this setting for all or individual indexes and statistics on a table in the current database. The sp_createstats stored procedure also has the NORECOMPUTE option. The NORECOMPUTE option will not disable automatic update of statistics directly, but it will prevent them, which is almost the same.

Avoid disabling the automatic statistics features, unless you have confirmed through testing that this brings a performance benefit. If the automatic statistics features are disabled, then you should manually identify and create missing statistics on the columns that are not indexed and then keep the existing statistics up-to-date.

Reset the automatic maintenance of the index so that it is on where it has been turned off:

EXEC sp_autostats 'HumanResources.Department', 'ON'
EXEC sp_autostats 'HumanResources.Department', 'ON', AK_Department_Name

Generate Statistics

To create statistics manually, use one of the following options:

  • CREATE STATISTICS: You can use this option to create statistics on single or multiple columns of a table or an indexed view. Unlike the CREATE INDEX command, CREATE STATISTICS uses sampling by default.

  • sp_createstats: Use this stored procedure to create single-column statistics for all eligible columns for all user tables in the current database. This includes all columns except computed columns; columns with the NTEXT, TEXT, GEOMETRY, GEOGRAPHY, or IMAGE data type; sparse columns; and columns that already have statistics or are the first column of an index.

Similarly, to update statistics manually, use one of the following options:

  • UPDATE STATISTICS: You can use this option to update the statistics of individual or all index keys and nonindexed columns of a table or an indexed view.

  • sp_updatestats: Use this stored procedure to update statistics of all user tables in the current database.

You may find that allowing the automatic updating of statistics is not quite adequate for your system. Scheduling UPDATE STATISTICS for the database during off-hours is an acceptable way to deal with this issue. UPDATE STATISTICS is the preferred mechanism because it offers a greater degree of flexibility and control. It's possible, because of the types of data inserted, that the sampling method for gathering the statistics, used because it's faster, may not gather the appropriate data. In these cases, you can force a FULLSCAN so that all the data is used to update the statistics just like what happens when the statistics are initially created. This can be a very costly operation, so it's best to be very selective about which indexes receive this treatment and when it is run.

Note

In general, you should always use the default settings for automatic statistics. Consider modifying these settings only after identifying that the default settings appear to detract from performance.

Statistics Maintenance Status

You can verify the current settings for the autostats feature using the following:

  • DATABASEPROPERTYEX

  • sp_autostats

Status of Auto Create Statistics

You can verify the current setting for auto create statistics by running a query against the sys.databases system table:

SELECT is_auto_create_stats_on
FROM sys.databases
WHERE [name] = 'AdventureWorks2008'

A return value of 1 means enabled, and a value of 0 means disabled.

You can also verify the status of this feature using the sp_autostats system stored procedure, as shown in the following code. Supplying any table name to the stored procedure will provide the configuration value of auto create statistics for the current database under the Output section of the global statistics settings:

USE AdventureWorks2008
EXEC sp_autostats 'HumanResources.Department'

Figure 7-25 shows an excerpt of the preceding sp_autostats statement's output.

sp_autostats output

Figure 7.25. sp_autostats output

A return value of ON means enabled, and a value of OFF means disabled.

This stored procedure is more useful when verifying the status of auto update statistics, as explained later in this chapter.

Status of Auto Update Statistics

You can verify the current setting for auto update statistics, and auto update statistics asynchronously, in a similar manner to auto create statistics. Here's how to do it using the function DATABASEPROPERTYEX:

SELECT DATABASEPROPERTYEX('AdventureWorks2008', 'IsAutoUpdateStatistics')

Here's how to do it using sp_autostats:

USE AdventureWorks2008
EXEC sp_autostats 'Sales.SalesOrderDetail'

Analyzing the Effectiveness of Statistics for a Query

For performance reasons, it is extremely important to maintain proper statistics on your database objects. Issues with statistics are uncommon. However, you still need to keep your eyes open to the possibility of problems with statistics while analyzing the performance of a query. If an issue with statistics does arise, then it can really take you for a ride. In fact, checking that the statistics are up-to-date at the beginning of a query-tuning session eliminates an easily fixed problem. In this section, you'll see what you can do should you find statistics to be missing or out-of-date.

While analyzing an execution plan for a query, look for the following points to ensure a cost-effective processing strategy:

  • Indexes are available on the columns referred to in the filter and join criteria.

  • In the case of a missing index, statistics should be available on the columns with no index. It is preferable to have the index itself.

  • Since outdated statistics are of no use and can even be misleading, it is important that the estimates used by the optimizer from the statistics are up-to-date.

You analyzed the use of a proper index in Chapter 4. In this section, you will analyze the effectiveness of statistics for a query.

Resolving a Missing Statistics Issue

To see how to identify and resolve a missing statistics issue, consider the following example. To more directly control the data, I'll use a test table instead of one of the AdventureWorks2008 tables. First, disable both auto create statistics and auto update statistics using the ALTER DATABASE command:

ALTER DATABASE AdventureWorks2008 SET AUTO_CREATE_STATISTICS OFF;
ALTER DATABASE AdventureWorks2008 SET AUTO_UPDATE_STATISTICS OFF;

Create a test table with a large number of rows and a nonclustered index (create_t6.sql in the download):

IF EXISTS ( SELECT  *
            FROM    sys.objects
            WHERE   object_id = OBJECT_ID(N'[dbo].[t1]')
                    AND type IN (N'U') )
    DROP TABLE [dbo].[t1]
GO

CREATE TABLE dbo.t1 (c1 INT, c2 INT, c3 CHAR(50)) ;
INSERT  INTO dbo.t1 (c1, c2, c3)
VALUES  (51, 1, 'c3') ;
INSERT  INTO dbo.t1 (c1, c2, c3)
VALUES  (52, 1, 'c3') ;
CREATE NONCLUSTERED INDEX i1 ON dbo.t1 (c1, c2) ;
SELECT TOP 10000
        IDENTITY( INT,1,1 ) AS n
INTO    #Nums
FROM    Master.dbo.SysColumns sc1
       ,Master.dbo.SysColumns sc2;

INSERT  INTO dbo.t1 (c1, c2, c3)
        SELECT  n % 50
               ,n
               ,'c3'
        FROM    #Nums;
DROP TABLE #Nums;

Since the index is created on (c1, c2), the statistics on the index contain a histogram for the first column, c1, and density values for the prefixed column combinations (c1 and c1 + c2). There are no histograms or density values for column c2.

To understand how to identify missing statistics on a column with no index, execute the following SELECT statement. Since the auto create statistics feature is off, the optimizer won't be able to find the data distribution for the column c2 used in the WHERE clause. You can see this in the execution plan:

SELECT * FROM dbo.t1 WHERE t1.c2 = 1;

If you right-click the execution plan, you can take a look at the XML data behind it. As shown in Figure 7-26, the XML execution plan indicates missing statistics for a particular execution step under its Warnings element. This shows that the statistics on column t1.c2 are missing.

Missing statistics indication in an XML plan

Figure 7.26. Missing statistics indication in an XML plan

The information on missing statistics is also provided by the graphical execution plan, as shown in Figure 7-27.

Missing statistics indication in a graphical plan

Figure 7.27. Missing statistics indication in a graphical plan

The graphical execution plan contains a node with the yellow exclamation point. This indicates some problem with the data-retrieval mechanism (usually missing statistics). You can obtain a detailed description of the error by moving your mouse over the corresponding node of the execution plan to retrieve the tool tip, as shown in Figure 7-28.

Tool tip of a graphical plan's node

Figure 7.28. Tool tip of a graphical plan's node

Figure 7-28 shows that the statistics for the column are missing. This may prevent the optimizer from selecting the best processing strategy. The current cost of this query as shown by SET STATISTICS IO and SET STATISTICS TIME is as follows:

Table 't1'. Scan count 1, logical reads 84
SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 44 ms.

To resolve this missing statistics issue, you can create the statistics on column t1.c2 by using the CREATE STATISTICS statement:

CREATE STATISTICS s1 ON t1(c2);

Before rerunning the procedure, be sure to clean out the procedure cache because this query will benefit from simple parameterization:

DBCC FREEPROCCACHE();

Figure 7-29 shows the resultant execution plan with statistics created on column c2.

Table 't1'. Scan count 1, logical reads 43
SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 15 ms.
Execution plan with statistics in place

Figure 7.29. Execution plan with statistics in place

The query optimizer uses statistics on a noninitial column in a composite index to determine whether scanning the leaf level of the composite index to obtain the bookmarks will be a more efficient processing strategy than scanning the whole table. In this case, creating statistics on column c2 allows the optimizer to determine that instead of scanning the base table, it will be less costly to scan the composite index on (c1, c2) and bookmark lookup to the base table for the few matching rows. Consequently, the number of logical reads has decreased from 87 to 42, but the elapsed time has decreased only slightly.

Resolving an Outdated Statistics Issue

Sometimes outdated or incorrect statistics can be more damaging than missing statistics. Based on old statistics or a partial scan of changed data, the optimizer may decide upon a particular indexing strategy, which may be highly inappropriate for the current data distribution. Unfortunately, the execution plans don't show the same glaring warnings for outdated or incorrect statistics as they do for missing statistics.

To identify outdated statistics, you should examine how close the optimizer's estimation of the number of rows affected is to the actual number of rows affected.

The following example shows you how to identify and resolve an outdated statistics issue. Figure 7-30 shows the statistics on the nonclustered index key on column c1 provided by DBCC SHOW_STATISTICS.

DBCC SHOW_STATISTICS(t1, i1);

These results say that the density value for column c1 is 0.5. Now consider the following SELECT statement:

SELECT * FROM dbo.t1 WHERE c1 = 51;

Since the total number of rows in the table is currently 10,002, the number of matching rows for the filter criteria c1 = 51 can be estimated to be 5,001 (= 0.5 × 10,002). This estimated number of rows (5,001) is way off the actual number of matching rows for this column value. The table actually contains only one row for c1 = 51.

Statistics on index i1

Figure 7.30. Statistics on index i1

You can get the information on both the estimated and actual number of rows from the actual execution plan. An estimated plan refers to and uses the statistics only, not the actual data. This means it can be wildly different from the real data, as you're seeing now. The actual execution plan, on the other hand, has both the estimated and actual number of rows available.

Executing the query results in this execution plan (Figure 7-31) and performance:

Table 't1'. Scan count 1, logical reads 84
SQL Server Execution Times:
   CPU time = 16 ms,  elapsed time = 2 ms.
Execution plan with outdated statistics

Figure 7.31. Execution plan with outdated statistics

To see the estimated and actual rows, you can view the tool tip by hovering over the Table Scan operator (Figure 7-32).

Tool tip showing row count discrepancy

Figure 7.32. Tool tip showing row count discrepancy

From the estimated rows value vs. the actual rows value, it's clear that the optimizer made an incorrect estimation based on out-of-date statistics. If the difference between the estimated rows and actual rows is more than a factor of 10, then it's quite possible that the processing strategy chosen may not be very cost effective for the current data distribution. An inaccurate estimation may misguide the optimizer in deciding the processing strategy.

To help the optimizer make an accurate estimation, you should update the statistics on the nonclustered index key on column c1 (alternatively, of course, you can just leave the auto update statistics feature on):

UPDATE STATISTICS t1 i1;

If you run the query again, you'll get the following statistics, and the resultant output is as shown in Figure 7-33:

Table 't1'. Scan count 1, logical reads 3
SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
Actual and estimated number of rows with up-to-date statistics

Figure 7.33. Actual and estimated number of rows with up-to-date statistics

The optimizer accurately estimated the number of rows using updated statistics and consequently was able to come up with a plan. Since the estimated number of rows is 1, it makes sense to retrieve the row through the nonclustered index on c1 instead of scanning the base table.

Updated, accurate statistics on the index key column help the optimizer come to a better decision on the processing strategy and thereby reduce the number of logical reads from 84 to 3 and reduce the execution time from 16 ms to ~0 ms (there is a ˜4 ms lag time).

Before continuing, turn the statistics back on for the database:

ALTER DATABASE AdventureWorks2008 SET AUTO_CREATE_STATISTICS ON;
ALTER DATABASE AdventureWorks2008 SET AUTO_UPDATE_STATISTICS ON;

Recommendations

Throughout this chapter, I covered various recommendations for statistics. For easy reference, I've consolidated and expanded upon these recommendations in the sections that follow.

Backward Compatibility of Statistics

Statistical information in SQL Server 2008 is different from that in previous versions of SQL Server. However, SQL Server 2008 transfers the statistics during upgrade and, by default, automatically updates these statistics over time. For the best performance, however, manually update the statistics immediately after an upgrade.

Auto Create Statistics

This feature should usually be left on. With the default setting, during the creation of an execution plan, SQL Server determines whether statistics on a nonindexed column will be useful. If this is deemed beneficial, SQL Server creates statistics on the nonindexed column. However, if you plan to create statistics on nonindexed columns manually, then you have to identify exactly for which nonindexed columns statistics will be beneficial.

Auto Update Statistics

This feature should usually be left on, allowing SQL Server to decide on the appropriate execution plan as the data distribution changes over time. Usually the performance benefit provided by this feature outweighs the cost overhead. You will seldom need to interfere with the automatic maintenance of statistics, and such requirements are usually identified while troubleshooting or analyzing performance. To ensure that you aren't facing surprises from the automatic statistics features, it's important to analyze the effectiveness of statistics while diagnosing SQL Server issues.

Unfortunately, if you come across an issue with the auto update statistics feature and have to turn it off, make sure to create a SQL Server job to update the statistics and schedule it to run at regular intervals. For performance reasons, ensure that the SQL job is scheduled to run during off-peak hours.

You can create a SQL Server job to update the statistics from SQL Server Management Studio by following these simple steps:

  1. Select ServerName

    Auto Update Statistics
  2. On the General page of the New Job dialog box, enter the job name and other details, as shown in Figure 7-34.

    Entering new job information

    Figure 7.34. Entering new job information

  3. Choose the Steps page, click New, and enter the SQL command for the user database, as shown in Figure 7-35. I used sp_updatestats here instead of UPDATE STATISTICS because it's a shortcut. I could have run UPDATE STATISTICS against each table in the database a different way, especially if I was interested in taking advantage of all the control offered by UPDATE STATISTICS:

    EXEC sp_msforeachtable 'UPDATE STATISTICS ? ALL'
    Entering the SQL command for the user database

    Figure 7.35. Entering the SQL command for the user database

  4. Return to the New Job dialog box by clicking the OK button.

  5. On the Schedules page of the New Job dialog box, click New Schedule, and enter an appropriate schedule to run the SQL Server job, as shown in Figure 7-36.

    Scheduling the SQL Server job

    Figure 7.36. Scheduling the SQL Server job

  6. Return to the New Job dialog box by clicking the OK button.

  7. Once you've entered all the information, click OK in the New Job dialog box to create the SQL Server job.

  8. Ensure that SQL Server Agent is running so that the SQL Server job is run automatically at the set schedule.

Automatic Update Statistics Asynchronously

Letting statistics update at the beginning of a query, which is the default behavior, will be just fine in most cases. In the very rare circumstances where the statistics update or the execution plan recompiles resulting from that update are very expensive (more expensive than the cost of out-of-date statistics), then you can turn on the asynchronous update of statistics. Just understand that it may mean that procedures that would benefit from more up-to-date statistics will suffer until the next time they are run. Don't forget—you do need automatic update of statistics enabled in order to enable the asynchronous updates.

Amount of Sampling to Collect Statistics

It is generally recommended that you use the default sampling rate. This rate is decided by an efficient algorithm based on the data size and number of modifications. Although the default sampling rate turns out to be best in most cases, if for a particular query you find that the statistics are not very accurate, then you can manually update them with FULLSCAN.

If this is required repeatedly, then you can add a SQL Server job to take care of it. For performance reasons, ensure that the SQL job is scheduled to run during off-peak hours. To identify cases in which the default sampling rate doesn't turn out to be the best, analyze the statistics effectiveness for costly queries while troubleshooting the database performance. Remember that FULLSCAN is expensive, so you should run it only on those tables or indexes that you've determined will really benefit from it.

Summary

As discussed in this chapter, SQL Server's cost-based optimizer requires accurate statistics on columns used in filter and join criteria to determine an efficient processing strategy. Statistics on an index key are always created during the creation of the index, and by default, SQL Server also keeps the statistics on indexed and nonindexed columns updated as the data changes. This enables it to determine the best processing strategies applicable to the current data distribution.

Even though you can disable both the auto create statistics and auto update statistics features, it is recommended that you leave these features on, since their benefit to the optimizer is almost always more than their overhead cost. For a costly query, analyze the statistics to ensure that the automatic statistics maintenance lives up to its promise. The best news is that you can rest easy with a little vigilance, since automatic statistics do their job well most of the time. If manual statistics maintenance procedures are used, then you can use SQL Server jobs to automate these procedures.

Even with proper indexes and statistics in place, a heavily fragmented database will incur an increased data-retrieval cost. In the next chapter, you will see how fragmentation in an index can affect query performance, and you'll learn how to analyze and resolve fragmentation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.104.202