Benefits of clustered indexes

SQL Server stores a table as a heap or as a balanced tree (B-tree). If you create a clustered index, a table is stored as a B-tree. As a general best practice, you should store every table with a clustered index because storing a table as a B-tree has many advantages, as listed here:

  • You can control table fragmentation with the ALTER INDEX command using the REBUILD or REORGANIZE option.
  • A clustered index is useful for range queries because the data is logically sorted on the key.
  • You can move a table to another filegroup by recreating the clustered index on a different filegroup. You do not have to drop the table as you would to move a heap.
  • A clustering key is a part of all nonclustered indexes. If a table is stored as a heap, then the row identifier is stored in nonclustered indexes instead. A short integer-clustering key is shorter than a row identifier, thus making nonclustered indexes more efficient.
  • You cannot refer to a row identifier in queries, but clustering keys are often part of queries. This raises the probability for covered queries. Covered queries are queries that read all data from one or more nonclustered indexes without going to the base table. This means that there are fewer reads and less disk IO.

Clustered indexes are particularly efficient when the clustering key is short. Creating a clustering index with a long key makes all nonclustered indexes less efficient. In addition, the clustering key should be unique. If it is not unique, SQL Server makes it unique by adding a 4-byte sequential number called a uniquifier to duplicate keys. The uniquifier becomes a part of the clustering key, which is duplicated in every nonclustered index. This makes keys longer and all indexes less efficient. Clustering keys can be useful if they are ever-increasing. With ever-increasing keys, minimally logged bulk inserts are possible even if a table already contains data, as long as the table does not have additional nonclustered indexes.

Data warehouse surrogate keys are often ideal for clustered indexes. Because you are the one who defines them, you can define them as efficiently as possible. Use integers with auto-numbering options. The primary key constraint creates a clustered index by default. In addition, clustered indexes can be very useful for partial scans. Remember that analytical queries typically involve a lot of data and, therefore, don't use seeks a lot. However, instead of scanning the whole table, you can find the first value with a seek and then perform a partial scan until you reach the last value needed for the query result. Many times, analytical queries use date filters; therefore, a clustering key over a date column might be ideal for such queries.

You need to decide whether to optimize your tables for data load or for querying. However, with partitioning, you can get both—efficient data load without a clustered key on an ever-increasing column, and more efficient queries with partial scans. In order to show the efficiency of partial scans, let's first create a new table organized as a heap with the following query:

SELECT 1 * 1000000 + f.[Sale Key] AS SaleKey, 
  cu.[Customer Key] AS CustomerKey, cu.Customer, 
  ci.[City Key] AS CityKey, ci.City,  
  f.[Delivery Date Key] AS DateKey, 
  s.[Stock Item Key] AS StockItemKey, s.[Stock Item] AS Product,  
  f.Quantity, f.[Total Excluding Tax] AS TotalAmount, f.Profit 
INTO dbo.FactTest 
FROM Fact.Sale AS f 
  INNER JOIN Dimension.Customer AS cu 
    ON f.[Customer Key] = cu.[Customer Key] 
  INNER JOIN Dimension.City AS ci 
    ON f.[City Key] = ci.[City Key] 
  INNER JOIN Dimension.[Stock Item] AS s 
    ON f.[Stock Item Key] = s.[Stock Item Key] 
  INNER JOIN Dimension.Date AS d 
    ON f.[Delivery Date Key] = d.Date; 

Now you can turn STATISTICS IO on to show the number of logical reads in the following two queries:

SET STATISTICS IO ON; 
-- All rows 
SELECT * 
FROM dbo.FactTest; 
-- Date range 
SELECT * 
FROM dbo.FactTest 
WHERE DateKey BETWEEN '20130201' AND '20130331'; 
SET STATISTICS IO OFF; 

SQL Server used a Table Scan operator to execute both queries. For both of them, even though the second one used a filter on the delivery date column, SQL Server performed 5,893 logical IOs.

Note that your results for the logical IOs might vary slightly for every query in this chapter. However, you should be able to notice which query is more efficient and which is less.

Now let's create a clustered index in the delivery date column:

CREATE CLUSTERED INDEX CL_FactTest_DateKey 
ON dbo.FactTest(DateKey); 
GO 

If you execute the same two queries, you get around 6,091 reads with the Clustered Index Scan operator for the first query, and 253 logical reads for the second query, with the Clustered Index Seek operator, which finds the first value needed for the query and performs a partial scan afterwards.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.95.74