© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
G. FritcheySQL Server 2022 Query Performance Tuninghttps://doi.org/10.1007/978-1-4842-8891-7_14

14. Query Design Analysis

Grant Fritchey1  
(1)
Grafton, MA, USA
 

While a healthy amount of performance is determined by your hardware, your cloud service tier, server settings, indexes, and other data structures, the single most important aspect of SQL Server performance is your queries. If your T-SQL is problematic, in many cases, none of the rest of functionality can save performance. There are a number of common coding issues, referred to as code smells, that can lead to bad performance. You may even have queries attempting to work on your data in a row-by-row fashion (quoting Jeff Moden, Row By Agonizing Row, abbreviated to RBAR). Focusing on fixing your T-SQL can be the single most important thing you can do to enhance the performance of your databases.

In this chapter, I cover the following topics:
  • Common code smells in T-SQL

  • Query designs to ensure effective index use

  • Appropriate use of optimizer hints

  • How database constraints affect performance

Query Design Recommendations

T-SQL is a very powerful language, and there is frequently more than one way to write a query and still get the exact same results. However, some queries are simply going to perform better than others. Certain query structures interfere with the optimizer finding the best execution plan. There are also query mechanisms that are more resource intensive than others. We’ll cover that in the next chapter. In this section, I want to discuss the following recommendations:
  • Keep your result sets as small as possible.

  • Use indexes effectively.

  • Use optimizer hints sparingly.

  • Maintain and enforce referential integrity.

While all of my recommendations are tested, situationally they may not work for your queries or your environment. You should always make it a habit to test and validate code changes as you make them. Use the information put forward so far within the book, from capturing query metrics to reading execution plans, in order to understand how your own queries are performing.

Keep Your Result Sets Small

There’s a classic saying, probably as old as computing:

Move only the data you need and only when you need to move it.

While this was originally stated back when networks could be easily overwhelmed by relatively benign data movement, the concept is still very applicable. By choosing to only select from columns that you need, you can easily reduce the overhead in your system. Further, limiting the rows being moved, you get to take advantage of indexes in assisting your performance. Follow these guidelines:
  • Limit the columns in your SELECT list.

  • Filter your data through a WHERE clause.

It’s very common to hear someone say that the business requirement is to return 100,000 rows for a report. Yet if you drill down on this requirement, you may find that they simply need an aggregate, or some other subset of the data. Humans simply do not process 100,000 rows on their own. Be sure you’re only moving the data you need.

Limit the Columns in Your SELECT List

Strictly speaking, using SELECT * to retrieve every column is not a massive performance problem. However, retrieving all columns means that you won’t be able to take advantage of covering indexes, indexes that provide all the columns needed for a given query. Listing 14-1 shows an example query.
SELECT NAME,
       TerritoryID
FROM Sales.SalesTerritory AS st
WHERE st.Name = 'Australia';
Listing 14-1

Query with a limited column list

The Sales.SalesTerritory table has a nonclustered index on the Name column. The clustered index on the table is on the TerritoryID column. Since the nonclustered index will use the clustered index key, you have a covering index for the query. Executing it, you get the metrics and execution plan shown in Figure 14-1:
Reads: 2
Duration: 190mcs

An illustration of the execution plan. The index seeks of non clustered with a cost of 100 percentage points to select with a cost of 0 percent.

Figure 14-1

An execution plan showing a covering index in action

If I modify the query to return all columns on the table, we’ll no longer see the nonclustered index used. Instead, we’ll see the metrics and execution plan shown in Figure 14-2:
Reads: 4
Duration: 1.1ms

An illustration of the execution plan. The index seeks of non clustered and key loops of clustered with a cost of 50 percentage points to nested loops and finally to select with a cost of 0 percent.

Figure 14-2

The query no longer has a covering index

I’m no longer getting the benefit of a covering index. The extra work required to both perform a Key Lookup to get the columns from the clustered index and then the Nested Loops to join the two result sets together ends up doubling the number of reads, from 2 to 4. The query duration on average went up more than five times.

In short, only choose the columns you need at the moment.

Filter Your Data Through a WHERE Clause

In order to see benefits from an index on queries that involve looking up data, and this includes UPDATE and DELETE when they’re finding the data being modified, INSERT when a referential integrity check occurs, and, of course, all SELECT statements, a WHERE clause must be used. Further, as explained in Chapter 8, the selectivity of the column, or columns, referenced in the WHERE, ON, and HAVING clauses determines how that index is used. As much as possible, your queries should be filtering data through these clauses. Where possible, you want to use the most selective column as well.

The majority of applications are going to be working off of limited result sets. While you may need to perform data movement involving all, or just a very large subset, of the data, generally speaking, you should be seeing mostly small data sets being retrieved. The exception of course is analytical queries, but you can better support queries of that type through columnstore indexes. In the event that you are in a situation where you regularly have to move extremely large result sets, you may need to look to external processes and hardware to improve performance.

Use Indexes Effectively

If a query is badly structured, it may not be able to take advantage of indexes. Since indexes are the most effective tool when it comes to performance tuning, it behooves you to ensure that your queries are structured such that they can use the indexes effectively. To improve the use of indexes, I’d suggest following these guidelines:
  • Use effective search conditions.

  • Avoid operations on columns in WHERE, ON, and HAVING clauses.

  • Use care in creating custom scalar UDFs.

I’ll break down these guidelines in the following sections.

Use Effective Search Conditions

Search conditions within a WHERE, ON, or HAVING clause can use a very large number of logical operations. Some of those operations are highly effective in allowing the query to work with indexes and statistics. Other operations are not as effective and can actively prevent the efficient use of indexes. Traditionally, we refer to the effective operations as being “Search ARGument ABLE” or sargable for short.

Note

While the use of sargable conditions is in the HAVING, ON, and WHERE clauses, rather than saying all three every time, I’m going to simply reference the WHERE clause. Assume I mean all three when I reference just the one.

There are a few search conditions, introduced in more recent versions of SQL Server, for which these rules don’t exactly apply. For example, SOME/ANY and ALL are dependent on the subquery that defines them. If you use a non-sargable condition in that subquery, then you’ll have issues.

Table 14-1 shows both the sargable and non-sargable conditions within SQL Server. The sargable search conditions allow the optimizer to use an index on the column(s) in the WHERE clause. Generally, you’re likely to see a seek to a row, or set of rows, in the index when using those conditions.
Table 14-1

Common sargable and non-sargable search conditions

Type

Search Conditions

Sargable

Inclusion conditions =, >, >=, <, <=, and BETWEEN, and some LIKE conditions such as LIKE ‘<literal>%’

Non-sargable

Exclusion conditions <>, !=, !>, !<, NOT EXISTS, NOT IN, NOT LIKE, and some LIKE conditions such as ‘%<literal>’ or ‘%<literal>%’

The non-sargable conditions listed will prevent good index use by the optimizer. Instead of a seek, you’re much more likely to see scans when using these conditions. This is especially true to the logical NOT operators such as <>, !=, and NOT LIKE. Using these will always result in a scan of all rows to identify those that match.

Where you can, implement workarounds for these non-sargable search conditions to improve performance. In some cases, it may be possible to rewrite the logic of a query to use inclusion instead of exclusion operations. You will have to experiment with different mechanisms in order to determine which are going to work in a given situation. Also, don’t forget that additional filtering could limit the scans necessary when multiple columns from one table are in a WHERE clause. Testing is your friend when you have to deal with these types of situations.

BETWEEN vs. IN/OR

Listing 14-2 is an example of a query using an IN condition.
SELECT sod.CarrierTrackingNumber,
       sod.OrderQty
FROM Sales.SalesOrderDetail AS sod
WHERE sod.SalesOrderID IN ( 51825, 51826, 51827, 51828 );
Listing 14-2

A query using an IN condition

Logically, this query could be written using an OR condition as shown in Listing 14-3.
SELECT sod.CarrierTrackingNumber,
       sod.OrderQty
FROM Sales.SalesOrderDetail AS sod
WHERE sod.SalesOrderID = 51825
      OR sod.SalesOrderID = 51826
      OR sod.SalesOrderID = 51827
      OR sod.SalesOrderID = 51828;
Listing 14-3

A query using an OR condition

Since the values we’re searching for in this case are consecutive integers, another way to logically write this query is shown in Listing 14-4.
SELECT sod.CarrierTrackingNumber,
       sod.OrderQty
FROM Sales.SalesOrderDetail AS sod
WHERE sod.SalesOrderID
BETWEEN 51825 AND 51828;
Listing 14-4

A query using a BETWEEN condition

Logically, all three queries are the same. If you look at the result sets, they are all the same. If I capture all three execution plans and put them side by side, as shown in Figure 14-3, they are even visually identical.

A diagram of execution plans for 3 queries. The clustered index seeks of clustered with a cost of 100 percent, then points to select with a cost of 0 percent.

Figure 14-3

Three visually identical execution plans

Even though plans may be visually the same, if we look at the performance metrics, we can see differences. Here are the metrics for the three queries in order:
  • Using IN

Duration: 362mcs
Reads: 18
  • Using OR

Duration: 336mcs
Reads: 18
  • Using BETWEEN

Duration: 342mcs
Reads: 6
While the OR clause was slightly, 6 microseconds, faster than the BETWEEN, it used three times as many reads, 18 compared to 6. The key to understanding this is in how the optimizer chooses to deal with these queries. While they are logically the same, they are physically different, and the optimizer has choices. If we compare the IN execution plan to the BETWEEN plan using the plan comparison capabilities in SSMS (as explained in Chapter 4), we can quickly identify differences, especially this one in the Clustered Index Seek operator in Figure 14-4.

An illustration of 3 panels of difference between top and bottom plans. 1, Two execution plans of clustered index seek points to select. 2 and 3, a top and bottom plan with properties.

Figure 14-4

Differences between two execution plans

The only real difference is in the Seek Predicate values. Here is the predicate for the OR query in Listing 14-3:
[1] Seek Keys[1]: Prefix: [AdventureWorks].[Sales].[SalesOrderDetail].SalesOrderID = Scalar Operator((51825)), [2] Seek Keys[1]: Prefix: [AdventureWorks].[Sales].[SalesOrderDetail].SalesOrderID = Scalar Operator((51826)), [3] Seek Keys[1]: Prefix: [AdventureWorks].[Sales].[SalesOrderDetail].SalesOrderID = Scalar Operator((51827)), [4] Seek Keys[1]: Prefix: [AdventureWorks].[Sales].[SalesOrderDetail].SalesOrderID = Scalar Operator((51828))
This is the predicate for the BETWEEN query in Listing 14-4:
Seek Keys[1]: Start: [AdventureWorks].[Sales].[SalesOrderDetail].SalesOrderID >= Scalar Operator([@1]), End: [AdventureWorks].[Sales].[SalesOrderDetail].SalesOrderID <= Scalar Operator([@2])

First, the optimizer has chosen to apply simple parameterization to this query as you can see in the @1 and @2 values within the second predicate. Also, instead of a BETWEEN operation, >= and <= have been substituted. In the end, this approach results in far fewer reads. On a production system under load, this will doubtless lead to superior performance.

If we also take a look at STATISTICS IO for the two queries, we get a better sense of what’s happening:
  • For the OR query:

Table "SalesOrderDetail". Scan count 4, logical reads 18
  • For the BETWEEN query:

Table "SalesOrderDetail". Scan count 1, logical reads 6

As you can see, there are four scans to satisfy the OR condition, while a single scan satisfies the BETWEEN condition.

This is not to say that you will always see better performance under all circumstances between IN, OR, and BETWEEN. However, as you can tell, while three queries may share logical similarities, choosing different mechanisms can arrive at superior performance. Testing, and then using the tools to evaluate the queries, is the key. Different operators can, and will, use the indexes in different ways.

This is also an example where STATISTICS IO can give us additional information beyond what’s available to us within Extended Events. You won’t always need to use it, but keep that tool in mind when you need to drill down to better understand why you may be seeing differences in behavior between two queries.

LIKE Condition

Searches against text within the database are extremely common. Where possible, you should try to avoid using wild cards as part of the leading edge of that condition. Because it won’t be able to use the histogram of the statistics to identify potential ranges of matching rows, it must scan the entire index to look for matching values. It’s also a good idea to provide as many leading characters as you can in a LIKE condition in order to help the optimizer in identifying ranges. The optimizer also makes internal changes to queries when using the LIKE condition. Take Listing 14-5 as an example.
SELECT C.CurrencyCode
FROM Sales.Currency AS C
WHERE C.NAME LIKE 'Ice%';
Listing 14-5

Using the LIKE condition

The optimizer changes this predicate and we can see it in the execution plan:
Seek Keys[1]: Start: [AdventureWork].[Sales].[Currency].Name >= Scalar Operator(N’Ice’), End: [AdventureWork].[Sales].[Currency].Name <  Scalar Operator(N’Ice’)

The change is from a LIKE condition to >= and < conditions. You could rewrite the query to use those conditions yourself. If you did, the performance, including the reads, would be exactly the same. This simply means using leading characters in a LIKE condition means the optimizer can optimize the search using indexes on the table.

!< Condition vs. >= Condition

Even though both !< and >= are logically equivalent and you will see an identical result set in a query using them, the optimizer is going to implement execution plans differently using these two conditions. The key is in the equality operator for >=. That gives the optimizer a solid starting point for using an index. The other operation, !<, has no starting point, resulting in the entire index, or the entire table, having to be scanned.

However, as I’ve already shown you in multiple examples within the book, the optimizer is capable of some modifications to the T-SQL code you’ve submitted. Here, we have a pair of queries that are using the conditions just discussed (Listing 14-6).
SELECT poh.TotalDue,
       poh.Freight
FROM Purchasing.PurchaseOrderHeader AS poh
WHERE poh.PurchaseOrderID >= 2975;
SELECT poh.TotalDue,
       poh.Freight
FROM Purchasing.PurchaseOrderHeader AS poh
WHERE poh.PurchaseOrderID !< 2975;
Listing 14-6

Comparing !< and >= conditions

Capturing the execution plans for both queries, they look visually identical in Figure 14-5.

An execution plan of two queries. Both queries with a clustered index seek points to compute scalar, which has a cost of 1 percent, and finally to select.

Figure 14-5

Identical execution plans

In fact, if you compare the plans, they are identical. It’s the same execution plan, used for both queries. You can see what happened if you look at the T-SQL stored with the plan. First, the optimizer changed the !< to >=. Second, it used simple parameterization, which leads to plan reuse. This is an example of when code is logically equivalent, the optimizer can help the performance of your query.

Avoid Operations on Columns

Calculations such as mathematical operations or functions like DATEPART when used against columns prevent good statistics and index use. Listing 14-7 shows a calculation against a column in the WHERE clause.
SELECT poh.EmployeeID,
       poh.OrderDate
FROM Purchasing.PurchaseOrderHeader AS poh
WHERE poh.PurchaseOrderID * 2 = 3400;
Listing 14-7

Calculation on a column

The execution plan is shown in Figure 14-6.

A diagram displays index scan and key lookup with costs, 67 and 18 percent point to nested loops, with cost 15 percent, then points at select with 0 percent cost.

Figure 14-6

An index scan caused by a calculation

There is an index on the PurchaseOrderID column, so the optimizer scanned that instead of scanning the clustered index. This is because the nonclustered index is smaller. However, that scan results in the need to then pull the remaining columns from the clustered key through the Key Lookup operation and a join.

The query metrics are as follows:
Reads: 11
Duration: 2.51ms

In Listing 14-8, I move the calculation off the column to the hard-coded value.

SELECT poh.EmployeeID,
       poh.OrderDate
FROM Purchasing.PurchaseOrderHeader AS poh
WHERE poh.PurchaseOrderID = 3400 / 2;
Listing 14-8

Changing the query to not calculate on the column

Making this modification results in the execution plan and runtime metrics shown in Figure 14-7:
Reads: 2
Duration: 143mcs

A diagram depicts a clustered index seek, with a cost of 100 percent, points at select with a cost of 0 percent.

Figure 14-7

A seek with the calculation removed

There’s not much to explain. Because of the calculation, the optimizer must scan the data. The optimizer tries to help you by scanning a smaller index and then looking up the remaining data, but as you can see, simply moving the calculation off the column makes all the difference.

This works for functions too. A classic problem arises when using DATETIME data types when you only need dates. An even worse issue is when people store date and time information as strings, in VARCHAR or CHAR columns, because you can store it formatted. To set up an example, I’m going to create an index on the Sales.SalesOrderHeader table as shown in Listing 14-9.
IF EXISTS
(
    SELECT *
    FROM sys.indexes
    WHERE OBJECT_ID = OBJECT_ID(N'[Sales].[SalesOrderHeader]')
          AND NAME = N'IndexTest'
)
    DROP INDEX IndexTest ON Sales.SalesOrderHeader;
GO
CREATE INDEX IndexTest ON Sales.SalesOrderHeader (OrderDate);
Listing 14-9

Creating an index on a DATETIME column

I’m going to query the data in the SalesOrderHeader table in order to retrieve sales that take place in the year 2008, in the month of April, as shown in Listing 14-10.
SELECT soh.SalesOrderID,
       soh.OrderDate
FROM Sales.SalesOrderHeader AS soh
    JOIN Sales.SalesOrderDetail AS sod
        ON soh.SalesOrderID = sod.SalesOrderID
WHERE DATEPART(yy, soh.OrderDate) = 2008
      AND DATEPART(mm, soh.OrderDate) = 4;
Listing 14-10

Querying for parts of dates

Because we’re using DATEPART on the column, we’re going to see poor index use in the execution plan in Figure 14-8 and the following metrics:
Reads: 73
Duration: 8.89ms

A diagram of the execution plan with index scan and clustered index seek points to nested loops. Finally, from nested loops to select with a cost of 0.

Figure 14-8

An index scan caused by the DATEPART function

The thing is I can treat the dates, as dates, to arrive at the same result set by changing the logic of the query in Listing 14-11.
SELECT soh.SalesOrderID,
       soh.OrderDate
FROM Sales.SalesOrderHeader AS soh
    JOIN Sales.SalesOrderDetail AS sod
        ON soh.SalesOrderID = sod.SalesOrderID
WHERE soh.OrderDate >= '2008-04-01'
      AND soh.OrderDate < '2008-05-01';
Listing 14-11

Modifying the date logic

This version of the query has the following metrics and execution plan in Figure 14-9:
Reads: 2
Duration: 198mcs

A diagram of the index seeks, and clustered index seek, with the cost at 50 percent, points to nested loops with 0 percent cost. Finally, points at select with 0 percent cost.

Figure 14-9

A seek occurs when dates are treated appropriately

It’s not even close. In both examples, changing the logic to avoid running operations against columns results in radical performance enhancements.

Listing 14-12 removes the index I created.
DROP INDEX Sales.SalesOrderHeader.IndexTest;
Listing 14-12

Cleaning up after the test

Custom Scalar UDF

Scalar functions are an attractive means of code reuse, especially if you need only a single value. However, while you can use them for data retrieval, you can see poor performance because it’s not an ideal use of scalar UDFs. In order to see this in action, Listing 14-13 creates a scalar function.
CREATE OR ALTER FUNCTION dbo.ProductStandardCost
(
    @ProductID INT
)
RETURNS MONEY
AS
BEGIN
    DECLARE @Cost MONEY;
    SELECT TOP 1
           @Cost = pch.StandardCost
    FROM Production.ProductCostHistory AS pch
    WHERE pch.ProductID = @ProductID
    ORDER BY pch.StartDate DESC;
    IF @Cost IS NULL
        SET @Cost = 0;
    RETURN @Cost;
END;
Listing 14-13

Scalar function to retrieve product costs

Listing 14-14 then uses the function in a simple query.
SELECT p.NAME,
       dbo.ProductStandardCost(p.ProductID)
FROM Production.Product AS p
WHERE p.ProductNumber LIKE 'HL%';
Listing 14-14

Consuming the scalar function

Running the query results in the runtime metrics and execution plan shown in Figure 14-10:
Reads: 14
Duration: 277mcs

A diagram execution plan with a scalar function. It includes the connections of clustered index seek, the index seeks to nested loops, compute scalar, and select.

Figure 14-10

Execution plan with a scalar function

This is a hard-to-read execution plan, so we’ll break it down into its component parts. Older versions of SQL Server won’t show this level of detail within the execution plan but instead will hide the scalar function within a single Compute Scalar operator. Capturing an Estimated Plan will show all the information you see here.

We’ll start with the section that shows how the scalar function is performing its data retrieval. Figure 14-11 shows the detail of that part of the plan.

A diagram depicts the data retrieval from the clustered index seek, top cost, stream aggregate, 2 compute scalars, then finally to nested loops.

Figure 14-11

Data retrieval inside the scalar function

The function is able to use a Clustered Index Seek to retrieve the data. You can then see it performs the necessary functions to prep the information, using a Top operator to support the TOP 1 part of the query and a Stream Aggregate to define the value returns. The Compute Scalar operators just ensure that the data types are correct.

The next section of the plan (Figure 14-12) at the bottom is how the optimizer deals with the IF clause in the query.

A diagram of the execution plan has 2 constant scan that points to nested loops with a cost of 0 percent.

Figure 14-12

IF statement within the execution plan

The Constant Scan operators are used to evaluate whether or not the @Cost value is NULL. The join operation will ensure that a value, either the @Cost value or the value of 0, gets returned. This is then joined to the output of the data retrieval in the scalar operator.

At the top of the plan is the query from Listing 14-14, and it’s part of the data retrieval in Figure 14-13.

A diagram of satisfying the rest of the query. It contains index seeks and key lookup that points to 2 nested loops, then computes scalar. Finally, points to select with a cost of 0 percent.

Figure 14-13

Satisfying the rest of the query

The Index Seek filters the data from the WHERE clause, and then the Name column is retrieved using the Key Lookup. This data is then joined and then joined again with the data from the scalar function.

The query is running fairly fast at 277 microseconds, but you can rewrite this query to avoid using the scalar function, seen here in Listing 14-15.
SELECT p.NAME,
       pc.StandardCost
FROM Production.Product AS p
    CROSS APPLY
(
    SELECT TOP 1
           pch.StandardCost
    FROM Production.ProductCostHistory AS pch
    WHERE pch.ProductID = p.ProductID
    ORDER BY pch.StartDate DESC
) AS pc
WHERE p.ProductNumber LIKE 'HL%';
Listing 14-15

Replacing the scalar function

This query runs in about 239 microseconds, a small improvement over the scalar function. Figure 14-14 shows the execution plan for this query.

A diagram of index seeks and key lookup that points to 2 nested loops. The clustered index seeks points to the top, then to the second nested loops. Finally, points to select with a cost of 0 percent.

Figure 14-14

Simplified execution plan after eliminating the scalar function

If you compare this plan to the plan shown previously, you can see most of the same functionality in play. Yet because there are fewer join operations and others, the speed is improved.

Minimize Optimizer Hints

The query optimizer within SQL Server is one of the most amazing pieces of software of which I’m aware. The way it uses your data structures, statistics, and its algorithms to make the queries you write run fast is incredible. However, the optimizer will not always get things perfectly correct, every time. Rarely, it may need some assistance, and you can take some aspects of control away from the optimizer in these cases through the use of what are called hints.

More often than not, the optimizer will give you a good enough plan for the query in question. However, like in the case of parameter sniffing (discussed in Chapter 13), the optimizer’s choices can be less than optimal. As I showed, one way to deal with parameter sniffing gone bad is through the use of a query hint, such as OPTIMIZE FOR. It’s extremely important to understand that a hint is not a suggestion nearly as much as it’s a command for the optimizer to behave in a certain way. The majority of the time, I reserve hints to the very last option when attempting to improve the performance of a query. However, understanding what hints are and how they can be used when you need them is also important.

I don’t have room in the book to cover all possible hints. I have already covered some in earlier chapters, and I’ll cover more in later chapters. Here, I want to demonstrate two specific hints: JOIN hint and INDEX hint.

JOIN Hint

In Chapter 2, I explained how the optimizer dynamically determines a cost-effective JOIN strategy between two data sets. This strategy combines the data structures, constraints, and statistics to arrive at the appropriate type of join. Table 14-2 summarizes the types of JOIN operations within SQL Server.
Table 14-2

JOIN types supported by SQL Server

JOIN Types

Index on Joining Columns

Usual Size of Joining Tables

Sorted Data Requirement

Nested Loops

Inner table a mustOuter table preferable

Small

Optional

Merge

Both tables a must

Large

Yes

Hash

Inner table not indexed

Any

No

Adaptive

Uses either Hash or Loops, so those requirements are needed

Generally, very large

Depends on the join type

There is no hint for the Adaptive join since it’s entirely determined by the optimizer when it will get used and then how it will get used based on the data, all covered in Chapter 8. Table 14-3 shows the possible JOIN hints.
Table 14-3

JOIN hints

JOIN Type

JOIN Hint

Nested Loops

LOOP

Merge

MERGE

Hash

HASH

 

REMOTE

Note

There are four hints, but only three JOIN types because the REMOTE hint is used only when one of the tables in a JOIN is in a different database.

Listing 14-16 shows a query with a number of joins between different tables.
SELECT s.NAME AS StoreName,
       p.LastName + ', ' + p.FirstName
FROM Sales.Store AS s
    JOIN Sales.SalesPerson AS sp
        ON s.SalesPersonID = sp.BusinessEntityID
    JOIN HumanResources.Employee AS E
        ON sp.BusinessEntityID = E.BusinessEntityID
    JOIN Person.Person AS p
        ON E.BusinessEntityID = p.BusinessEntityID;
Listing 14-16

Joining multiple tables in a query

This query results in the following performance metrics and execution plan (Figure 14-15):
Reads: 2,364
Duration: 6.22ms

A diagram depicts the index seeks and key lookup that points to a hash match, then to a nested loop. The clustered index seeks points to nested loops. Then nested loop points to compute scalar and finally select.

Figure 14-15

Execution for a query without any hints

The optimizer has made several choices based on the query and the objects. In this case, it’s using a Hash Match for the first join and Nested Loops for the second. For simple queries of this type, with a small data set, a Nested Loops join makes sense. The optimizer saw no way for indexes to help with the other tables, so it ended up using a HASH to bring the data together. It’s possible, since the number of rows coming from the Sales.SalesPerson table is relatively small, a loop join might be better. We can modify the code to test it out in Listing 14-17. Executing the query results in the plan you see in Figure 14-16.
SELECT s.NAME AS StoreName,
       p.LastName + ',   ' + p.FirstName
FROM Sales.Store AS s
    JOIN Sales.SalesPerson AS sp
        ON s.SalesPersonID = sp.BusinessEntityID
    JOIN HumanResources.Employee AS E
        ON sp.BusinessEntityID = E.BusinessEntityID
    JOIN Person.Person AS p
        ON E.BusinessEntityID = p.BusinessEntityID
OPTION (LOOP JOIN);
Listing 14-17

Adding a JOIN hint

A diagram of the clustered index scan and clustered index seek points to 2 nested loops. The clustered index seeks points to second nested loops. Then nested loop points to compute scalar and finally to select.

Figure 14-16

Plan after forcing LOOPS joins

The plan shape is roughly the same, but the Hash Match has been replaced by the Nested Loops, per the instruction. One point worth mentioning. In Figure 14-10, the Index Scan operation against the Sales.SalesPerson table estimated that 17 rows would be needed and that’s how many it returned. If you look at Figure 14-11, the Clustered Index Scan has hit all 701 rows in the table. This results in the following performance metrics:
Reads: 3,737
Writes: 8.18ms
That’s about 30% slower and almost 50% more reads. In short, we’ve received exactly what we asked for, all Nested Loops joins, but at the cost of worse performance. That’s pretty bad, but it could be worse. In this case, we’ve told the optimizer to use Nested Loops for any joins it deems necessary. However, we could change things and only specify the join for the one table here in Listing 14-18.
SELECT s.NAME AS StoreName,
       p.LastName + ',   ' + p.FirstName
FROM Sales.Store AS s
    INNER LOOP JOIN Sales.SalesPerson AS sp
        ON s.SalesPersonID = sp.BusinessEntityID
    JOIN HumanResources.Employee AS E
        ON sp.BusinessEntityID = E.BusinessEntityID
    JOIN Person.Person AS p
        ON E.BusinessEntityID = p.BusinessEntityID;
Listing 14-18

Forcing just one join to behave a certain way

This results in the execution plan visible in Figure 14-17.

A diagram depicts that the clustered index scan points to sort, nested loops, merge join, nested loops, compute scalar, and select. The clustered index seeks points to nested loops. The clustered index seeks points to merge join, and clustered index seeks points to second nested loops.

Figure 14-17

Forcing a single join to change may have consequences

The execution plans in Figures 14-15 and 14-16 showed three different references to pulling data from tables or indexes. Figure 14-17 now has four. In short, by forcing the join through the query hint as we did, we took away the optimizer’s ability to put this query through the simplification process. Through the enforced referential constraints, the optimizer was able to eliminate a table from the execution plan, that it has now been forced to put back due to the query hint. The performance metrics reflect this change of course:
Reads: 3,746
Duration: 8.89ms
I haven’t improved performance at all through these experiments. While there will absolutely be exceptions, in general, letting the optimizer choose how best to join together tables results in superior performance. The use of hints can
  • Cause elimination of simplification

  • Prevent auto-parameterization

  • Prevent the optimizer from dynamically deciding the join order

While hints can sometimes improve performance, it’s absolutely not a guarantee. Only use them after very thorough testing.

INDEX Hints

Because the optimizer is dependent on statistics, it is possible for an index that might help a query to get overlooked. However, the vast majority of the time, the optimizer is right. Still, people regularly try to outsmart the optimizer. Take our earlier example with the calculation against the column from Listing 14-7. There’s an index on the table, PK_PurchaseOrderHeader_PurchaseOrderID, that, without the calculation, would surely have been used to retrieve the data. Listing 14-19 shows how to implement an index hint.
SELECT poh.EmployeeID,
       poh.OrderDate
FROM Purchasing.PurchaseOrderHeader AS poh WITH (INDEX(PK_PurchaseOrderHeader_PurchaseOrderID))
WHERE poh.PurchaseOrderID * 2 = 3400;
Listing 14-19

Forcing index choice through a query hint

Figure 14-18 shows the resulting execution plan, and these are the runtime metrics:
Reads: 44
Duration: 682mcs

A diagram of a scan from an index hint. The cluster index scan with a cost of 100 percentage points to select, which has a cost of 0 percent.

Figure 14-18

A scan from an index hint

This is a mixed result. The reads went from 11 to 44. However, the execution time went from 2.5ms to 682mcs, a very substantial reduction. It’s still not as fast as simply removing the calculation from the column, which ran in 142mcs and only had 2 reads. Fixing the code is absolutely the superior choice by every measure. However, the INDEX hint did help the original query. This kind of experimentation can sometimes result in wins.

Using Domain and Referential Integrity

While the primary purpose of domain and referential integrity is to ensure good, clean data, the optimizer can also take advantage of these constraints. Knowing that a foreign key is enforced can change the row estimates and improve the choices made by the optimizer.

We’ll explore three examples here:
  • The NOT NULL constraint

  • A check constraint

  • Declarative referential integrity (DRI)

NOT NULL Constraint

The NOT NULL constraint ensures that a given column will never allow NULL values, helping ensure domain integrity of the data there. SQL Server will enforce this constraint in real time as data is manipulated within the table. The optimizer uses the information that there will be no NULL values to help improve the speed of the queries against the column.

Listing 14-20 contains two queries against similarly sized data sets and columns.
SELECT p.FirstName
FROM Person.Person AS p
WHERE p.FirstName < 'B'
      OR p.FirstName >= 'C';
SELECT p.MiddleName
FROM Person.Person AS p
WHERE p.MiddleName < 'B'
      OR p.MiddleName >= 'C';
Listing 14-20

Querying the Person.Person table

These two queries have visually identical execution plans as you seen in Figure 14-19.

A diagram of two queries. Index scan of non clustered with a cost of 100 percentage points to select with a cost of 0 percent.

Figure 14-19

Two visually identical execution plans

The key difference is visible in the plan. The first query is returning 18,967 rows out of an estimated 18,942. The second is returning 11,134 out of 11,372. That’s about the only real difference. Because neither the FirstName nor the MiddleName column has an index to support these queries, a scan of the IX_Person_LastName_FirstName_MiddleName index is used.

Further, because the second column, MiddleName, includes NULL values, which, by definition, do not equal the value of ‘B’, they need to be included. Listing 14-21 shows the adjusted query.
SELECT p.MiddleName
FROM Person.Person AS p
WHERE p.MiddleName < 'B'
      OR p.MiddleName >= 'C'
      OR p.MiddleName IS NULL;
Listing 14-21

Fixing the logic of the second query

In addition to fixing the logic of the query, I’m also going to take the Missing Index recommendations from the optimizer and create two indexes here in Listing 14-22.
CREATE INDEX TestIndex1 ON Person.Person (MiddleName);
CREATE INDEX TestIndex2 ON Person.Person (FirstName);
Listing 14-22

Adding indexes to make the queries faster

I’m now going to go back and rerun the queries, which results in the execution plans seen in Figure 14-20.

A diagram of two queries. 1, index seek points to select. 2, three constant scan points to three compute scalar, then concatenation, compute scalar, sort, merge interval. Merge interval and index seek points to nested loops, then finally select.

Figure 14-20

Indexes and IS NULL changed the execution plans

The first query against the FirstName column used the new index to simply retrieve the data. While the second query was able to use the new index, because of the IS NULL operation, you can see that the plan has greatly expanded in scope in order to satisfy that criterion. Three values are created using the three Constant Scan and Compute Scalar operators, one for each criterion in the WHERE clause. These are then sorted and merged and used in a join to arrive at the filtered data set.

An interesting point worth noting is that the first execution plan has a higher estimated cost. This is meaningless since the queries and data sets are different. However, there is some accuracy here as the first query runs in about 14.5ms while the second runs in 9.6ms.

This is a place where experimenting with using a filtered index to deal with the NOT NULL may result in a performance enhancement. In this case, it actually doesn’t. There are more reads and a slightly slower performance when making TestIndex1 a filtered index. One other point worth bringing up about NULL values is that you can use Sparse columns. This helps reduce the overhead and space associated with storing NULL values. It’s handy for analytical data when you have a lot of NULLs. However, it comes at a performance overhead, so generally it is not considered to be a way to enhance query behaviors.

Before going on, I’m dropping the test indexes I created (Listing 14-23).
DROP INDEX TestIndex1 ON Person.Person;
DROP INDEX TestIndex2 ON Person.Person;
Listing 14-23

Dropping the test indexes

User-Defined Constraints

Constraints are another way to help ensure that your data integrity is high. However, the optimizer can also use constraints to help you with performance. For example, Listing 14-24 shows a constraint from the AdventureWorks database.
ALTER TABLE Sales.SalesOrderDetail WITH CHECK
ADD CONSTRAINT CK_SalesOrderDetail_UnitPrice CHECK ((
                                             UnitPrice >= (0.00)
                                                   ));
Listing 14-24

Ensuring that the UnitPrice is greater than zero

If I run the query from Listing 14-25, we know it will return no rows.
SELECT soh.OrderDate,
       soh.ShipDate,
       sod.OrderQty,
       sod.UnitPrice,
       p.Name AS ProductName
FROM Sales.SalesOrderHeader AS soh
    JOIN Sales.SalesOrderDetail AS sod
        ON sod.SalesOrderID = soh.SalesOrderID
    JOIN Production.Product AS p
        ON p.ProductID = sod.ProductID
WHERE p.Name = 'Water Bottle - 30 oz.'
      AND sod.UnitPrice < $0.0;
Listing 14-25

No rows returned from this query

However, that’s not the interesting part. The interesting part is how the optimizer chooses to satisfy this query here in Figure 14-21.

A diagram depicts a constant scan with 100 percent cost points to select with 0 percent cost.

Figure 14-21

An execution plan with no data access at all

The Constant Scan operator is simply a placeholder that the optimizer will use in execution plans to build data sets against. However, in this case, there is only the Constant Scan operator. So what’s happening?

The optimizer recognizes that this query cannot possibly return any rows at all. However, it still has to return a result set, even if it’s just going to be empty. So it uses the Constant Scan to create the empty result set. You can see it in the Output List property in Figure 14-22.

A dialog box of Output List. It contains 5 entries, Order date, ship date, order quantity, unit price, and product name, and a close button at the bottom right.

Figure 14-22

The Output List to produce an empty result set

If the optimizer didn’t make this choice, since there is no index in support of the preceding query, you’d get a Clustered Index Scan, reading all the data in the table, but returning zero rows. The optimizer saved us from a nasty performance hit because it can read the constraints.

Note

The optimizer can only use a check constraint if it was created using the WITH CHECK option. Otherwise, the constraint is untrusted because the data hasn’t been validated.

Declarative Referential Integrity

Declarative referential integrity (DRI) is the most used mechanism within SQL Server to ensure data cleanliness between a parent and a child table. DRI ensures that rows in the child table only exist when the corresponding row exists in the parent table. The only exception here is when a child table has NULL values in the column, or columns, that refer back to the parent. In SQL Server, DRI is defined through the use of FOREIGN KEY constraint on the child that matches a PRIMARY KEY or UNIQUE INDEX on the parent.

When DRI is established between two tables and the foreign key columns of the child table are set to NOT NULL, the optimizer is assured that for every row in the child table, a corresponding row exists in the parent table. The optimizer can use this knowledge to improve performance because accessing the parent table isn’t necessary to verify the existence of a row for the corresponding child row.

To see this in action, I’m going to run Listing 14-26 to remove the foreign key constraints between two tables: Person.Address and Person.StateProvince.
IF EXISTS
(
    SELECT *
    FROM sys.foreign_keys
    WHERE OBJECT_ID = OBJECT_ID(N'[Person].[FK_Address_StateProvince_StateProvinceID]')
          AND parent_object_id = OBJECT_ID(N'[Person].[Address]')
)
    ALTER TABLE Person.ADDRESS
    DROP CONSTRAINT FK_Address_StateProvince_StateProvinceID;
Listing 14-26

Removing foreign key constraints

With that out of the way, I have two queries in Listing 14-27.
SELECT A.AddressID,
       sp.StateProvinceID
FROM Person.ADDRESS AS A
    JOIN Person.StateProvince AS sp
        ON A.StateProvinceID = sp.StateProvinceID
WHERE A.AddressID = 27234;
--NOTE, Address.StateProvinceID
SELECT A.AddressID,
       A.StateProvinceID
FROM Person.ADDRESS AS A
    JOIN Person.StateProvince AS sp
        ON A.StateProvinceID = sp.StateProvinceID
WHERE A.AddressID = 27234;
Listing 14-27

Two nearly identical queries

The only difference between these two queries is that the second one, as noted, is using Address.StateProvinceID instead of StateProvince.StateProvinceID in the SELECT statement. With the foreign key constraint gone, these two queries generate the execution plans, shown in the Compare ShowPlan view, in Figure 14-23.

A diagram of 2 queries with identical execution plans, which consists of 2 clustered index seek, points to nested loop, then to select.

Figure 14-23

Two identical execution plans

When using Compare ShowPlan, common parts of the structure are highlighted. As you can see, these are identical. I’m now going to use Listing 14-28 to replace the foreign key from earlier.
ALTER TABLE Person.ADDRESS WITH CHECK
ADD CONSTRAINT FK_Address_StateProvince_StateProvinceID
    FOREIGN KEY (StateProvinceID)
    REFERENCES Person.StateProvince (StateProvinceID);
Listing 14-28

Creating a foreign key constraint on the Address table

Running the queries from Listing 14-27 again, I get the execution plans shown in Figure 14-24:

A diagram of execution plans of two queries. 1, Two cluster indexes seek points to nested loops and finally select. 2, cluster index seeks points to select.

Figure 14-24

The execution plans are now different because of the foreign key

Clearly, the second query now has a new execution plan. When the foreign key wasn’t present, the optimizer didn’t know if rows in the parent and the child table, StateProvince and Address, matched. So it had to validate that by using join operation, Nested Loops in this case. However, with the foreign key put back in place, specifically using the WITH CHECK option to validate the data, the optimizer could now simplify the execution plan and ignore the JOIN operation entirely, arriving at superior performance.

Summary

As important as your structures are, poor decisions with your T-SQL code can hurt performance just as much as if you had never taken the time to create appropriate indexes and maintain your statistics. The optimizer is extremely efficient at working out how best to retrieve your queries, but your queries need to be designed to work with the optimizer. While query hints are available and can be useful, thoroughly testing them before using them is a great practice. Finally, make sure you’re taking advantage of constraints and referential integrity. It’s not only important to make sure that queries are using your structure, but that they avoid excessive resource use.

The next chapter will be focused on reducing resource usage within queries.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.89.24