Saying is one thing and doing is another.
This chapter covers two classes of tuning techniques for controlling execution plans: universal techniques that work independently of your choice of database vendor, and techniques that are database-specific. Database-specific techniques are covered well by your own vendor’s documentation, so you might know them well already. In general, you need both types of techniques to get precisely the execution plan you want. Each of the vendor-specific sections stands alone, even repeating material from the other sections when applicable. Therefore, you can skip the vendor sections that you don’t need.
Much ado has been made over controlling execution plans, sometimes with elaborate tools. This chapter focuses on the simplest ways to control plans, with a strong emphasis on getting the types of plans you will need to optimize real-world SQL. I have found that when you know which execution plan you want in advance, getting it is easy and requires only simple tools.
This section describes a number of database-independent techniques you can use to control execution plans. The techniques are good for the following purposes:
Enabling use of the index you want
Preventing use of the wrong indexes
Enabling the join order you want
Preventing join orders you do not want
Choosing the order to execute outer queries and subqueries
Providing the cost-based optimizer with good data
Fooling the cost-based optimizer with bad data
These vendor-independent techniques often offer an alternative method to achieve ends you could also achieve with vendor-specific methods. When you have a choice, the vendor-specific methods are usually cleaner. However, some problems are solvable only by these universal techniques, which offer solutions that can sometimes work on SQL that is intended to run on multiple vendor databases.
To enable efficient use of an index, you need a reasonably selective condition on the leading column (or only column) of that index. The condition must also be expressed in a way that enables the database to establish a reasonably narrow index range for the index values. The ideal form this takes is:
SomeAlias.Leading_Indexed_Column=<Expression>
In less ideal cases, the comparison is with some range of
values, using BETWEEN
, LIKE
, <
, >
, <=
, or >=
. These range comparisons also
potentially enable use of the index, but the index range is likely to
be larger and the resulting query therefore slower. If the index range
is too large, the optimizer might conclude that the index is not worth
using and choose another path to the data. When you combine equalities
and range conditions for multicolumn indexes, you should prefer
indexes that lead with the columns that have equality conditions and
finish with columns that have range conditions. Note that the left
side of the comparison simply names the column, with no function
around the column, and no expression (such as addition) using the
column. Use of a function, a type conversion, or an arithmetic expression
on the side with the indexed column will generally disable use of that
index.
Type conversions are a particularly subtle way that SQL sometimes disables use of an index. DB2 returns an error if you compare two expressions with incompatible types. SQL Server prefers to perform the implicit conversion on the side of the comparison that does not disable the index. Oracle implicitly converts character-type expressions to the type of the other side, even when this disables index use. For example, consider this expression:
P.Phone_Number=5551212
If Phone_Number
were a
character-type column, this would likely evaluate internally on Oracle
and SQL Server as:
On Oracle: TO_NUMBER(P.Phone_Number)=5551212 On SQL Server: P.Phone_Number=CAST(5551212 AS VARCHAR)
SQL Server preserves indexed access to the column. On Oracle,
the implicit use of TO_NUMBER( )
disables use of the index just as surely as if you made the expression
explicit. (The only real difference is that the problem is harder to
find in the implicit form.) The same problem can plague index use for
joins, as well as for single-table conditions. For example, consider
the join:
P.Phone_Number=C.Contact_Number
If Contact_Number
were a
number type and Phone_Number
were a
character type, the implicit conversion on Oracle would prevent an
index-driven nested-loops join from C
to P
. A
join in the other direction would be unhindered.
The expression opposite the indexed column reference can be arbitrarily complex. However, it must not reference columns in the same alias with the indexed column. For example, consider the condition:
P.Phone_Number=P.Area_Code||'5551212'
The database cannot drive into the index on P.Phone_Number
with this condition, because
the database must reach alias P
before it can evaluate the expression on the right side. This
chicken-and-egg problem prevents identifying (with the index) the
subset of the table that meets this condition until after the database
examines the whole table.
The final way that SQL often disables index use is with
conditions combined with OR
. For
example, consider the query:
SELECT ... FROM Order_Details D, ... WHERE ... AND (D.Order_ID=:1 or :1 IS NULL) AND ...
In this example, the database can reach Order_Details
through an index on Order_ID
if the bind variable :1
happens to be nonnull. But if :1
is bound to a null value, there is no
restriction at all on Order_ID
and
thus no use for that index. Since the database cannot tell what
:1
will be bound to when it parses
the SQL and prepares the plan, it will find no good opportunity to use
the index. In this case, the solution is to create a two-part plan,
with each part optimized for one of the cases:
SELECT ... FROM Order_Details D, ... WHERE ... AND D.Order_ID=:1 AND :1 IS NOT NULL AND ... UNION ALL SELECT ... FROM Order_Details D, ... WHERE ... AND :1 IS NULL AND ...
When you view the execution plan for this query, it shows both
indexed access through the index on Order_Details(Order_ID)
and full-table-scan
access to Order_Details
. This might
appear to be the worst of both worlds, but you are saved by the
conditions:
AND :1 IS NOT NULL ... AND :1 IS NULL
These conditions make no reference at all to any data in the
database, so the database can and does evaluate them before it even
begins reading data for that half of the combined statement.
Therefore, it never actually executes the full table scan when
:1
is not null, and it never
actually executes an indexed read (or any other part of the execution
plan for the first half of the query) when :1
is null. This amounts to a method to
branch your execution plan depending on conditions on the bind
variables, the variables that determine which data is available to
drive the query. The only catch is that you must ensure that the
conditions on the bind variables are mutually exclusive, so that
exactly one of the branches actually returns data. For example, if you
have another bind variable to provide Customer_Name
, you might put together a
query like this:
SELECT ... FROM Order_Details D, Customers C, ... WHERE ... AND D.Order_ID=:1 AND :1 IS NOT NULL AND (C.Customer_Name=:2 OR :2 IS NULL) AND ... UNION ALL SELECT ... FROM Order_Details D, Customers C, ... WHERE ... AND :1 IS NULL AND :2 IS NOT NULL AND C.Customer_Name=:2 AND ... UNION ALL SELECT ... FROM Order_Details D, Customers C, ... WHERE ... AND :1 IS NULL AND :2 IS NULL AND ...
This could support a three-part plan, in which the database would:
Drive into Orders
on the
index on Order_ID
(your first
choice), when possible.
Otherwise, drive into Customers
on the index on Customer_Name
(your second choice) when
it has no Order_ID
specified
but has a customer name.
Otherwise, just get all the rows, probably beginning with a full table scan, when it has no selective conditions at all.
In any case, the conditions on the bind variables in the three parts are contrived to be mutually exclusive:
AND :1 IS NOT NULL ... AND :1 IS NULL AND :2 IS NOT NULL ... AND :1 IS NULL AND :2 IS NULL
Join expressions are usually simple, usually between consistent types, and usually between numerical IDs. Conditions on the driving table are usually simple and compatible with index use. A more frequent problem than enabling use of the right index is preventing use of the wrong indexes. In many queries, there are multiple single-table conditions that are capable of reaching multiple indexes, but you want to use only a specific one of those indexes. Join conditions are usually expressed to allow index-driven joins in either direction, although only one of the possible join directions turns out to be optimal. Occasionally, you’ll prefer to disable use of an index on a join altogether, to force a hash or sort-merge join.
To disable use of an index, create the simplest possible
expression around the indexed column reference. For example, you
should prevent use of an index on Status_Code
for the unselective condition on
closed orders, as the number of closed orders will eclipse open orders
as you do more and more business:
O.Status_Code='CL'
Since Status_Code
is a
character-type column, a simple expression to disable index use
without changing the results would simply concatenate an empty string
to the end of Status_Code
:
On Oracle and DB2: O.Status_Code||''='CL' On SQL Server: O.Status_Code+''='CL'
For number-type columns, you can add 0
:
O.Region_ID+0=137
All databases have some sort of function that evaluates to the
first argument when the argument is not null and otherwise returns the
second argument. On Oracle, the function is NVL( )
. On
SQL Server and DB2, it is COALESCE( )
. If both
arguments are the same column, the function always returns the same
result as the bare column, regardless of the column type. Therefore,
this makes a handy recipe to deactivate index use regardless of column
type:
On Oracle: NVL(O.Order_Date,O.Order_Date)=<Value>
On DB2 and SQL Server: COALESCE(O.Order_Date,O.Order_Date)=<Value>
In a join condition, a join that disables an indexed path to
O.Region_ID
(but not to R.Region_ID
) could look like this:
O.Region_ID+0=R.Region_ID
Using the type-independent approach, this same join would look like this:
NVL(O.Region_ID,O.Region_ID)=R.Region_ID
Apart from unintentionally disabled indexes, there are two issues that sometimes disable desired join orders:
Outer joins
Missing redundant join conditions
Consider an outer join query, in Oracle-style notation:
SELECT ... FROM Employees E, Locations L WHERE E.Location_ID=L.Location_ID(+)
or in the newer, universal notation:
SELECT ... FROM Employees E LEFT OUTER JOIN Locations L ON E.Location_ID=L.Location_ID
This query requests employee records with their matching
locations, when employees have locations; otherwise, null location
data is used for employees that have no matching locations. Based on
the request, it is clear that the query cannot effectively drive
from Locations
to Employees
, since even employees without
locations are needed. Consider a case in which this query is just a
template to which an application adds conditions that depend on
search criteria provided by an end user. If the end user wants to
see employees for a particular location, the application might
create this query:
SELECT ... FROM Employees E LEFT OUTER JOIN Locations L ON E.Location_ID=L.Location_ID WHERE L.Description='Headquarters'
In the outer case of the join from Employees
to Locations
, L.Description
will be assigned a generated
value of null
, and the condition
on L.Description
will be false
. Only the inner case of the join
will return rows that might meet the restriction on L.Description
, so now it makes perfect
sense to drive the query in the other join order, from Locations
to Employees
. However, the existence of the
outer join often prevents automated optimizers from allowing this
reversed order on the outer joins, so you need to make the join
explicitly an inner join to get the reversed join direction:
SELECT ... FROM Employees E INNER JOIN Locations L ON E.Location_ID=L.Location_ID WHERE L.Description='Headquarters'
Normally, between any number of tables, the join count
is the number of tables minus one. For example, between three
tables, you expect to find two joins. Occasionally, a query permits
an extra, redundant join. For example, if you have an Addresses
table that contains all
addresses significant to the company, it might have a one-to-zero or
one-to-one relationship with the earlier Locations
table, which contains only
locations owned by the company and which references Addresses
through a matching primary key.
In this case, you might find a query like the following:
SELECT ... FROM Employees E, Locations L, Addresses A WHERE E.Location_ID=L.Location_ID AND E.Location_ID=A.Address_ID AND A.ZIP_Code=95628
By transitivity (if
a=b and
b=c, then
a=c), you can deduce that
the condition L.Location_ID=A.Address_ID
must be
true
for all rows this query
would return. However, that condition is not explicit in the query,
and not all databases will deduce it and fill it in if it is left
out. The best plan, in this case, will likely begin with all
addresses within that ZIP Code and immediately join to Locations
to discard all addresses except
the one or two that correspond to company locations, before joining
to Employees
. Since that join
order requires the missing join condition to support an indexed path
from Addresses
to Locations
, you should make the missing
join condition explicit:
SELECT ... FROM Employees E, Locations L, Addresses A WHERE E.Location_ID=L.Location_ID AND E.Location_ID=A.Address_ID AND L.Location_ID=A.Address_ID AND A.ZIP_Code=95628
Since you do not want to follow the join from Addresses
to Employees
directly, you could also remove,
if necessary, the redundant join condition E.Location_ID=A.Address_ID
, to discourage
that unwanted join operation.
Forcing joins in the direction you want, using the
earlier techniques for preventing use of the wrong indexes, will
prevent many undesired join orders. What do you do when you want the
database to follow a particular join direction eventually, but not too
early in the execution plan? You cannot afford to disable an index,
because you must use that index eventually, just not too early.
Consider the following two joins, in which you want to start the query
with reads of T1
and then join to
T2
before joining to T3
:
... AND T1.Key2_ID=T2.Key2_ID AND T1.Key3_ID=T3.Key3_ID ...
Here, you want to follow nested loops into both T2
and T3
, following indexes in the keys mentioned
and reaching T2
before reaching
T3
. To postpone the join you want
to happen later, make it depend (or at least to appear to depend) on
data from the join that must happen earlier. Here is a
solution:
... AND T1.Key2_ID=T2.Key2_ID AND T1.Key3_ID+0*T2.Key2_ID=T3.Key3_ID ...
You and I know that the second version is logically equivalent
to the first. However, the database just finds an expression on the
left side of the second join that depends on both T1
and T2
(not recognizing that no value from T2
can change the result), so it won’t try
to perform the join to T3
until
after T2
.
If necessary, you can string together joins like this to
completely constrain a join order. For each join after the first, add
a logically irrelevant component referencing one of the columns added
in the preceding join to the join expression. For example, if you want
to reach tables T1
through T5
in numerical order, you can use the
following. Notice that the join condition for the T3
table uses the expression 0*T2.Key2_ID
to force the join to T2
to occur first. Likewise, the join
condition for the T4
table uses
0*T3.Key3_ID
to force T3
to be joined first.
... AND T1.Key2_ID=T2.Key2_ID AND T1.Key3_ID+0*T2.Key2_ID=T3.Key3_ID AND T1.Key4_ID+0*T3.Key3_ID=T4.Key4_ID AND T1.Key4_ID+0*T4.Key4_ID=T5.Key5_ID ...
I’ll apply this method to a concrete example. Consider the following SQL, adapted from Chapter 3:
SELECT E.First_Name, E.Last_Name, E.Salary, LE.Description, M.First_Name, M.Last_Name, LM.Description FROM Locations LE, Locations LM, Employees M, Employees E WHERE E.Last_Name = 'Johnson' AND E.Manager_ID=M.Employee_ID AND E.Location_ID=LE.Location_ID AND M.Location_ID=LM.Location_ID AND LE.Description='Dallas'
Assume that you have an execution plan that drives from the
index on the employee’s last name, but you find that the join to the
employee’s location (alias LE
) to
discard employees at locations other than Dallas
is unfortunately happening last,
after the other joins (to M
and
LM
). You should join to LE
immediately from E
, to minimize the number of rows you need
to join to the other two tables. Starting from E
, the join to LM
is not immediately possible, so if you
prevent the join to M
before
LE
, you should get the join order
you want. Here’s how:
SELECT E.First_Name, E.Last_Name, E.Salary, LE.Description, M.First_Name, M.Last_Name, LM.Description FROM Locations LE, Locations LM, Employees M, Employees E WHERE E.Last_Name = 'Johnson' AND E.Manager_ID+0*LE.Location_ID=M.Employee_ID AND E.Location_ID=LE.Location_ID AND M.Location_ID=LM.Location_ID AND LE.Description='Dallas'
The key here is that I’ve made the join to M
dependent on the value from LE
. The expression 0*LE.Location_ID
forces the optimizer to
join to LE
before M
. Because of the multiply-by-zero, the
added expression has no effect on the results returned by the
query.
Most queries with subqueries can logically drive from
either the outer query or the subquery. Depending on the selectivity
of the subquery condition, either choice can be best. The choice
generally arises for queries with EXISTS
or IN
conditions. You can always convert
between an EXISTS
condition on a
correlated subquery and the equivalent IN
condition on a noncorrelated subquery.
For example, you can convert this:
SELECT ... FROM Departments D WHERE EXISTS (SELECT NULL FROM Employees E WHERE E.Department_ID=D.Department_ID)
to this:
SELECT ... FROM Departments D WHERE D.Department_ID IN (SELECT E.Department_ID FROM Employees E)
The first form implies that the database drives from the outer
query to the subquery. For each row returned by the outer query, the
database executes the join in the subquery. The second form implies
that you begin with the list of distinct departments that have
employees, as found in the noncorrelated subquery, and drive from that
list into the matching list of such departments in the outer query.
Sometimes, the database itself follows this implied join order,
although some databases can make the conversion internally if their
optimizer finds that the alternate order is better. To make your own
SQL more readable and to make it work well regardless of whether your
database can convert the forms internally, use the form that implies
the order you want. To force that order even when the database could
make the conversion, use the same join-direction-forcing technique
used in Section 4.1.4.
Thus, an EXISTS
condition that
forces the outer query to execute first would look like this:
SELECT ... FROM Departments D WHERE EXISTS (SELECT NULL FROM Employees E WHERE E.Department_ID=D.Department_ID+0)
For the contrary order, an IN
condition that forces the implied driving order from the subquery to
the outer query would look like this:
SELECT ... FROM Departments D WHERE D.Department_ID IN (SELECT E.Department_ID+0 FROM Employees E)
This latter order would be a bad idea, unless you found a strange case in which you had more departments than employees!
You can have several subqueries in which the database either
must drive from the outer query to the subquery (such as NOT EXISTS
subqueries) or should drive in
that order. Such a case implies a choice of the order of execution of
the subqueries. You can also have choices about whether to execute
subqueries after completing the outer query, or at the first
opportunity, as soon as the correlation join is possible, or at some
point between these extremes.
The first tactic for controlling the order of subquery execution
is simply to list the subqueries in order in the WHERE
clause (i.e., the top subquery to be
executed should be listed first). This is one of the few times when
WHERE
-clause order seems to
matter.
Rarely, the database will execute a subquery sooner than you would like. The same tactic for postponing joins (described in Section 4.1.4) works for correlation joins, the joins in subqueries that correlate the subqueries to the outer queries. For example, consider this query:
SELECT ... FROM Orders O, Customers C, Regions R WHERE O.Status_Code='OP' AND O.Customer_ID=C.Customer_ID AND C.Customer_Type_Code='GOV' AND C.Region_ID=R.Region_ID AND EXISTS (SELECT NULL FROM Order_Details OD WHERE O.Order_ID=OD.Order_ID AND OD.Shipped_Flag='Y')
For this query you might find that the subquery runs as soon as
you reach the driving Orders
table,
but you might wish to perform the join to Customers
first, to discard nongovernmental
orders, before you take the expense of the subquery execution. In this
case, this would be the transformation to postpone the correlation
join:
SELECT ... FROM Orders O, Customers C, Regions R WHERE O.Status_Code='OP' AND O.Customer_ID=C.Customer_ID AND C.Customer_Type_Code='GOV' AND C.Region_ID=R.Region_ID AND EXISTS (SELECT NULL FROM Order_Details OD WHERE O.Order_ID+0*C.Customer_ID=OD.Order_ID AND OD.Shipped_Flag='Y')
Notice the addition of +0*C.Customer_ID
to the subquery’s WHERE
clause. This ensures the join to
Customers
occurs first, before the
subquery executes.
On any cost-based optimizer (that is, for any query except one running on the Oracle rule-based optimizer, since only Oracle has a rule-based optimizer), the second most common source of poor execution plans (after missing indexes) is missing statistics on the tables, columns, and indexes involved in the query. In all, cost-based optimizers do a fairly good job of finding the best plan without help when they have good information to begin with. However, when they are missing information—for example, because a table or index has been rebuilt without regenerating statistics for that object—they tend to make terrible assumptions.
If you are running on any database except Oracle, or if you are on Oracle’s cost-based optimizer (as is most common and as Oracle recommends) and not forcing the rule-based optimizer, the first thing you should try if you are not getting the execution plan you want is to regenerate statistics on every table and index relevant to the query. Standard statistics will usually suffice to get reasonable execution plans.
Cost-based optimizers usually assume that data is uniformly distributed. For example, if the optimizer statistics show a table of 1,000,000 rows with 50,000 distinct values for some indexed foreign key, the database will optimize on the assumption that every value of that key will match exactly 20 rows. For most indexed columns, like foreign keys, this assumption of a uniform data distribution works well. However, some columns have highly skewed distributions, such as status, code, or type columns, or foreign keys to status or type tables. For example, consider this query:
SELECT ... FROM Orders WHERE Status_Code = 'OP'
There might only be three or four values of Status_Code
across a 1,000,000-row Orders
table, but if 'OP
' means this is an open order, not yet
fulfilled or cancelled, this condition is far more selective than the
optimizer would expect based solely on the number of distinct values.
If the column had an index, the optimizer might never use that index
if it knew only the small number of distinct indexed values. However,
on some databases, you can generate added statistics that let the
database know not only the number of distinct values but also the
distribution of those values, a necessary step when you have such
highly skewed distributions.
This last technique is dangerous, and I recommend it only as a last resort. Sometimes, you want to simulate a large database on a small, development database. If you can extrapolate (or, better, measure from an actual database) statistics that apply to a large database, you can manually modify the data-dictionary tables that store those statistics for the optimizer, to fool the optimizer into thinking it is working with a large database. The small database will have statistics that show large tables with many distinct values on most indexes. This is a handy way to see execution plans that will apply to production volumes when you have only a test database with toy data volumes. For such toy-sized databases, there is no great risk to this approach. On production databases, the optimizer will occasionally make better choices if it has the wrong data, usually if it has data that exaggerates the selectivity of desired indexes or that exaggerates the size of a table when a full table scan is undesirable.
Imagine reversing the logic the optimizer follows: ask “What would I need to believe about the tables and indexes of this query to find an alternative plan (the alternative that you, the human optimizer, want) much more attractive?” It is not hard to fool the optimizer into doing what you want rather than what it would choose on its own, if you lie to it about the statistics. However, on production systems, this is dangerous in several ways:
As soon as anyone regenerates statistics for the tables or indexes, the optimizer will revert to the original error, unless the manual statistics-tweak is reapplied. You will have to rigorously control statistics generation to prevent this.
As soon as the database optimizer improves—with the next release, perhaps—it is denied the chance to exploit those improvements with correct data.
Most importantly, every other query against the tables and indexes with false statistics is at risk and will potentially be harmed, just to help the one query you wanted to tune when you fudged the statistics.
I have never needed to play this card to get an adequately optimized plan on Oracle, SQL Server, or DB2, and I recommend you avoid it if possible.
Oracle currently offers two completely different optimizers, the rule-based optimizer (RBO) and the cost-based optimizer (CBO), and the methods for tuning on each differ.
The RBO is Oracle’s original automated optimizer, back from the days of Oracle Version 6 and earlier. By rule-based, Oracle means that the optimizer uses only fixed properties of the tables, indexes, and SQL to guess an optimum execution plan from a set of simple rules of thumb (or heuristics) built into the automated optimizer. The RBO uses no data about the sizes of the tables or indexes, or about the distribution of data within those objects. It does use data on the fixed properties of the indexes: whether they are unique, which columns they cover, in which order, and how well those match up with the most selective-looking filter conditions and joins in the SQL. As tables grow and data distributions change, the RBO should go right on delivering the same plan indefinitely, as long as you don’t alter the indexes (for example, from unique to nonunique) or change the table structure (for example, from an ordinary table to a partitioned table). However, at some future time, perhaps even in Oracle Database 10g, Oracle will drop all support for the rule-based optimizer, and cost-based optimization will become your only choice.
Since Oracle7, the RBO has been even more stable than before, because Oracle chose to freeze the RBO code beginning with Oracle7, except for rare, slight changes necessary to deliver functionally correct (as opposed to necessarily optimum) results. Therefore, an execution plan that is correct on the RBO today will likely stay unchanged until Oracle drops the RBO altogether. This is appealing from the perspective of stability, although the dark side of this stability is that the execution plans never get any better either.
Execution plans on the RBO never change to adapt to changing data distributions, and this is often cited as an argument to switch to the CBO. However, in my own experience, data-distribution change is the least of the reasons for cost-based optimization. In over 10 years, I have yet to find a single case in which it was important to use different execution plans for different real-world data distributions with the same SQL.
I have seen many cases in which one plan is not perfectly optimal for all real-world data distributions, but in all these cases, one robust plan exists that is at least nearly optimal across the board.
Another argument cited in favor of the CBO is that it can deliver parallel execution plans, plans that can bring multiple processors to bear on the SQL statement at once. I have not found this to be a compelling argument, since I have yet to find a real-world case in which the optimum SQL, with the optimum database design, required parallel execution for adequate performance. I expect some such cases exist in data-warehousing environments, which are not where most of my experience lies, I admit, but almost all cases in which parallel execution plans appear to shine are really covering up some mistake in database design, indexing, or application design, compensating for design deficiencies with horsepower. That, by itself, would not be such a bad thing; extra horsepower might be cheaper than fixing the application. However, parallel plans are usually in service of large batch processes, competing heavily for resources with online processes that are more critical to end users. Therefore, parallel plans often rob needed resources from other processes that are more critical.
These are the strongest arguments against using the RBO:
It will become unavailable in some future release, perhaps during Oracle Database 10g, and you will not be able to use an older release forever.
The CBO keeps getting better, while the RBO is stuck with all the old problems it has ever had.
The CBO has a huge inherent advantage in the information available to it to calculate the best plan.
The RBO cannot take advantage of features created since the CBO appeared in Oracle7, and in most cases the RBO will simply push queries that involve newer object types, such as bit-mapped indexes, off to the CBO. (See the following section, Section 4.2.1, for details about which features the RBO cannot handle.)
That said, the RBO does a surprisingly good job; its heuristics are well designed to get along with the tiny amount of information that the RBO uses to guess the best plan. In Chapter 6, I will describe properties of what I call a robust execution plan, one that behaves well across a wide range of data distributions. The RBO almost always delivers a robust plan when the necessary indexes are available and when the developer has not prevented use of an index with some index-disabling expression, as discussed earlier in this chapter. Given the right indexes, you can almost always get the best robust plan on either optimizer, with manual tuning. With automated tuning, the biggest advantage of the CBO is that it is more resourceful when dealing with imperfect indexing and nonoptimally written SQL; more often, it delivers at least an adequate plan in these cases, without manual tuning. When more than one robust plan is possible, the CBO is also more likely to find the best robust plan, while the RBO will pick one without knowing relative costs, unless you manually tune the SQL .
It is unrealistic to optimize Oracle queries simultaneously for both the rule-based and the cost-based optimizers. Therefore, you should understand the factors that lead Oracle to choose which optimizer it applies, so that you can control those factors and get the optimizer you choose.
The RBO cannot handle certain object types and object properties that did not yet exist when Oracle froze the RBO code. However, rather than simply have its code error out, Oracle modified the RBO code just enough to let it recognize the cases it cannot handle and to have it pass those cases on to the CBO. Thus, even if you think you have set up your system for rule-based optimization, the following circumstances will absolutely force cost-based optimization:
Bit-mapped indexes on any column of a table referenced in the SQL, even if those indexes are on columns the SQL does not touch.
Function-based indexes in a table referenced in the SQL, if such an index is on an expression the SQL references.
Partitioned tables touched by the SQL.
Tables or indexes configured with parallel degree. The optimizer interprets these as a command to find parallel execution plans, which the RBO does not know how to do. As for bit-mapped indexes, indexes configured with parallel degree will disable use of the RBO on a table referenced by your SQL, even if the parallel-degree index is on columns the SQL does not touch.
If the tables and indexes involved in your SQL do not prevent using the RBO, Oracle chooses between the RBO and the CBO as follows:
If any SELECT
keyword in
the SQL (even in a subquery or a view definition) is followed by
any valid hint other than /*+ RULE
*/
or /*+ CHOOSE */
,
Oracle will use the CBO.
Otherwise, if any SELECT
keyword in the SQL (even in a subquery or a view definition) is
followed by /*+ CHOOSE */
and
there are any statistics on any table or index referenced by the
SQL, Oracle will choose the CBO.
Otherwise, if any SELECT
keyword in the SQL (even in a subquery or a view definition) is
followed by /*+ RULE */
, Oracle
will choose the RBO.
Otherwise, if the session optimizer_mode
parameter is set at the
session level (by ALTER SESSION SET
OPTIMIZER_MODE=
<Your_Choice>
;
), Oracle will choose according to that
session-level parameter.
Otherwise, if the optimizer_mode
parameter is set for the
database instance, in the init.ora file, Oracle will choose
according to that instance-level parameter.
Otherwise, Oracle will choose according to the ultimate
default optimizer_mode
parameter, CHOOSE
.
In the last three steps of this decision cascade, Oracle chooses
according to an optimizer_mode
parameter, which you or your DBA sets. These are the four possible
parameter values and how they affect the choice:
RULE
Oracle uses rule-based optimization.
ALL_ROWS
Oracle uses cost-based optimization with the goal of minimizing the cost of the whole query. This default version of cost-based optimization sometimes results in nonrobust plans (plans that use join methods other than nested-loops), with risks described in Chapter 6. However, the optimizer chooses these plans only when it calculates that they are faster than the best robust plans.
FIRST_ROWS
Oracle uses cost-based optimization with the goal of
minimizing the cost of reaching the first rows from the query.
In practice, this tends to favor robust, nested-loops plans
similar to those plans the rule-based optimizer favors but built
with much more knowledge of the data distributions and probable
execution costs. The FIRST_ROWS
optimization level creates
the same effect as the OPTIMIZE FOR 1
ROW
hint on DB2 and the OPTION(FAST 1)
hint on SQL
Server.
CHOOSE
Oracle uses cost-based optimization, as for the ALL_ROWS
goal, unless no table or
index involved in the query has optimization statistics
available, in which case Oracle uses rule-based
optimization.
Here’s a quick way to check the instance-level parameter for
optimizer_mode
:
SELECT VALUE FROM V$PARAMETER WHERE NAME = 'optimizer_mode';
When you have an execution plan in PLAN_TABLE
, a quick way to see whether it is
cost-based is to run the following query:
SELECT POSITION FROM PLAN_TABLE WHERE ID=0;
This returns the cost of the entire execution plan, in arbitrary units, when the plan is cost-based. When cost is not null, you have a cost-based plan.
Most of the methods for controlling rule-based execution plans
are the universal techniques of controlling plans, covered in the
first section of this chapter. The primary Oracle-specific method of
tuning under a rule-based default optimizer_mode
is simply to switch modes to
cost-based optimization, usually with a hint such as /*+
FIRST_ROWS
*/
. In other words, you can always control a plan via hints,
and hints (with the exception of the /*+ RULE
*/
hint) in a statement cause Oracle to use the CBO for that
statement.
However, if you prefer not to use cost-based optimization, thus
precluding the use of hints, one RBO-specific technique remains: in the FROM
clause, list tables and their aliases
in exactly the reverse order from the join order you want. This
usually gives enough control of the join order, without using the
techniques described earlier in Section 4.1.4. In
particular, eligible, unique joins toward primary keys happen in the
reverse order they are listed in the FROM
clause, without changing the join
conditions. For example, consider this query:
SELECT /*+ RULE */ E.First_Name, E.Last_Name, E.Salary, LE.Description, M.First_Name, M.Last_Name, LM.Description FROM Locations LM, Employees M, Locations LE, Employees E WHERE E.Last_Name = 'Johnson' AND E.Manager_ID=M.Employee_ID AND E.Location_ID=LE.Location_ID AND M.Location_ID=LM.Location_ID AND LE.Description='Dallas';
Unlike the earlier version of this query in Chapter 3, which had the wrong order
in the FROM
clause, you now get the
correct join order. In this correct execution plan, E
joins to LE
before joining to M
or LM
,
as shown by the plan output:
SQL>@ex
PLAN
----------------------------------------------------------------------
SELECT STATEMENT
NESTED LOOPS
NESTED LOOPS
NESTED LOOPS
TABLE ACCESS BY INDEX ROWID 4*EMPLOYEES
INDEX RANGE SCAN EMPLOYEE_LAST_NAME
TABLE ACCESS BY INDEX ROWID 3*LOCATIONS
INDEX UNIQUE SCAN LOCATION_PKEY
TABLE ACCESS BY INDEX ROWID 2*EMPLOYEES
INDEX UNIQUE SCAN EMPLOYEE_PKEY
TABLE ACCESS BY INDEX ROWID 1*LOCATIONS
INDEX UNIQUE SCAN LOCATION_PKEY
When the RBO otherwise has no preference based on the conditions
and indexes, the RBO joins tables by working from right to left in the
FROM
clause. However, this method
offers only limited control by itself, because the RBO follows its
other rules of thumb before considering the join order in the FROM
clause. For example, the RBO always
chooses to perform unique indexed reads and joins before doing indexed
range scans, when it can.
There are two main parts involved in tuning on the Oracle CBO:
Providing the optimizer with good statistics about the tables and indexes, so it can calculate the costs of alternatives accurately. This is effectively a prerequisite to any manual tuning on a CBO.
Adding hints to queries that the CBO fails to optimize well even with complete statistics about the tables and indexes that the queries reference.
Proving that a little knowledge is a dangerous thing,
cost-based optimizers often do a terrible job if they do not have
statistics on all the tables and indexes involved in the query. It
is therefore imperative to maintain statistics on tables and indexes
reliably, including regenerating statistics whenever table volumes
change much or tables or indexes are rebuilt. It is safest to
regenerate statistics periodically, during times that load is
relatively quiet, such as nightly or at least weekly. The best way
to generate and update statistics is with Oracle’s DBMS_STATS
package, documented at length in Oracle8i Supplied PL/SQL
Packages Reference and Oracle9i Supplied PL/SQL
Packages and Types Reference. Here is a simple example of
using DBMS_STATS
to generate
statistics for a whole schema, Appl_Prod
, sampling 10% of the data in the
larger tables and cascading statistics collection to the
indexes:
BEGIN DBMS_STATS.GATHER_SCHEMA_STATS ('Appl_Prod',10, CASCADE => TRUE); END; /
Often, queries include conditions on highly skewed distributions, such as conditions on special types, codes, or flags, when these columns have only a few values. Normally, the CBO evaluates selectivity of a condition based on the assumption that all nonnull values of a column are equally selective. This assumption generally works well for foreign and primary keys that join business entities, but it is inaccurate when the columns have permanent special meanings and certain meanings apply much more rarely than others.
For example, in an Orders
table, you might have a Status_Code
column with three possible
values: 'CL
' for
closed (i.e., fulfilled) orders, 'CA
' for cancelled
orders, and 'OP
' for
open orders. Most orders, by far, would be
fulfilled, once the application has been running for a few months. A
steady, significant fraction of orders would end up cancelled, so
that value would also eventually point to a large list of orders.
However, as long as the business keeps up with incoming orders, the
number of open orders would remain moderate and steady, even as data
accumulates for years. Quite early, a condition specifying Status_Code='OP
' would be selective enough
to justify indexed access, if you had an index with that leading
column, and it is important to enable the optimizer to realize this
fact, preferably without a lot of manual tuning. Enabling the CBO to
recognize when a column is selective requires two things:
The SQL must mention the specific selective value, rather
than use a bind variable, prior to Oracle
9i Database. Use of bind variables is
commonly attractive, since it makes SQL more general and easier
to share between processes. However, this need to hardcode
especially selective values is the exception to that rule. If
you use Status_Code=:1
instead of Status_Code='OP
',
prior to Oracle 9i, you will deny the CBO
potential knowledge of the selectivity of the condition at parse
time, when it does not yet know whether the bind variable
:1
will be assigned a common
or a rare Status_Code
.
Fortunately, in these cases, the usual reason to prefer using
bind variables does not generally apply; since these special
codes have special business meanings, it is unlikely that the
SQL ever requires substituting a different value than the single
selective value.
In Oracle 9i, Oracle introduced bind variable peeking, wherein Oracle checks the first value assigned to each bind variable (when a query sees its first hard parse) when choosing an execution plan. This eliminates the need to specify fixed values in place of bind variables, as long as all the values to be bound are of similar selectivity. However, if the bind variable will be bound sometimes to selective values and sometimes to nonselective values, you still must hardcode the values to obtain different plans in the two cases.
You need to provide the CBO with special statistics that quantify how rare the uncommon code, type, or status values are, so it can know which values are highly selective.
Oracle stores special statistics on distribution when you
request them, based on sorting the rows for a column and arranging
the sorted list into a specified number of buckets that each contain
the same number of rows. Since Oracle already knows that the range
each bucket holds has the same number of rows, Oracle needs to know
only the value-range endpoints in each bucket. In the current
example, with 20 buckets, the first bucket might hold the range
'CA
' to 'CA
', and the second bucket might hold the
range 'CA
' to 'CL
‘. The next 17 buckets would hold the
most common range, 'CL
' to
'CL
‘. The last bucket would hold
the range 'CL
' to 'OP
', which includes the rarest value. From
this, Oracle can deduce that the selectivity of the column is 5-10%
for the value 'CA
', 85-95% for
the value 'CL
', and 0-5% for the
value 'OP
‘. Since you want the
optimizer to know more closely how selective the 'OP
' value is, you would choose more
buckets than this, perhaps the maximum of 254. (Oracle compresses
the bucket information when so few values apply, so the large number
of buckets should be inexpensive.) To create 254 buckets for the
example case, in the schema owned by Appl_Prod
, use this:
BEGIN DBMS_STATS.GATHER_TABLE_STATS ('Appl_Prod','Orders', METHOD_OPT => 'FOR COLUMNS SIZE 254 Status_Code'), END; /
Generate the histogram statistics after you generate the general table statistics, because table-statistics generation deletes earlier histogram statistics.
Oracle uses hints for manual control of cost-based
optimization. Syntactically, these hints take the form of comments,
like /*+
<Hint_String>
*/
, immediately following the SELECT
keyword. Oracle recognizes that
this syntax encloses a hint, not a comment, by the +
at the beginning
and by the location of the hint, which must immediately follow
SELECT
. However, since these are
comments from the point of view of standard SQL syntax, they do not
interfere with parsing the SQL if the SQL is also to be executed on
non-Oracle databases.
Oracle hints don’t help get a fast execution plan on non-Oracle databases either, but unfortunately, it is not currently possible to share manually tuned SQL on multiple vendor databases and have the manual tuning work uniformly well on them all.
Each hint directly affects only the SELECT
block that has the comment. Thus,
to control the order of joins and index choices within a subquery,
place the hint after the SELECT
keyword that begins the subquery. But to affect the outer-query
order of joins and index choices, place a hint immediately after the
outer-query SELECT
.
There are two basic extremes involved in tuning with hints:
Use as little direction as possible to get the execution plan you want, or at least to get close enough to the plan you want for reasonable performance. This approach reasons that the CBO has more information than you have and should be left free to adapt to changing data distributions and to take advantage of improvements in Oracle with future releases. By leaving the CBO the maximum degrees of freedom, you maximize its power to optimize well for you in the future. However, until you try, you won’t know how much direction the CBO will need if it did not get the plan right in the first place, so this approach is likely to be iterative, involving the addition of one hint at a time until the CBO delivers a good plan.
If you did not get the plan you wanted from the CBO automatically, assume the CBO has made bad assumptions that will propagate to distort all of its calculations. Therefore, leave it with little freedom, specifying essentially the whole plan you want.
If you are confident in your chosen execution plan, as you should be if you apply the methods I describe later in this book, there is little reason to hold back from fully specifying that plan. I have yet to find a case where a well-chosen, robust execution plan needed to evolve to handle new data distributions or new database features. On the other hand, it is easy for SQL with a partially restricting set of hints to go wrong, especially if some table or index loses its statistics. When the CBO chooses incorrectly, the error that made the CBO choose incorrectly will likely propagate over the entire plan. For example, consider this query:
SELECT E.First_Name, E.Last_Name, E.Salary, LE.Description, M.First_Name, M.Last_Name, LM.Description FROM Locations LM, Employees M, Locations LE, Employees E WHERE E.Hire_Date > :1 AND E.Manager_ID=M.Employee_ID AND E.Location_ID=LE.Location_ID AND M.Location_ID=LM.Location_ID
At parse time, when the optimizer does its work, it cannot
know that the bind variable :1
will likely be set to a value in the current week, so it makes a
conservative assumption about the selectivity of that condition on
Hire_Date
. Having made that
assumption, it might not only forego using an index on Hire_Date
(depending on the data
distribution), but it might also further calculate that it will hit
most of the rows of all the joined tables as well, and the CBO might
choose full table scans with hash joins on them. Even if you
instruct the CBO to use the index on Hire_Date
, it still retains its initial
assumption that the driving condition is unselective, and will
likely retain its poor choices for the other joins and table-access
methods. This is really no flaw in the optimizer; it cannot know
what the application developer knows about the likely values to be
assigned to the bind variable. However, the consequence is that, if
you need to be any more specific than just specifying ALL_ROWS
or
FIRST_ROWS
, chances are
relatively high that the optimizer will need help across the board,
to correct for some incorrect assumption somewhere.
ALL_ROWS
and FIRST_ROWS
hints are a safe way to begin
optimization. If you are using the rule-based optimizer, you can
safely try out these cost-based approaches with a hint, even
before you do the work of finding the best execution plan. If the
result is already fast enough, you might save yourself any further
work. If your optimization is already cost-based, under either the
ALL_ROWS
or FIRST_ROWS
mode, try the other one. If
an optimizer_mode
hint alone
solves your problem, the optimizer is making reasonable
assumptions and you can trust it.
These are the main hints to control table-access methods:
INDEX(
<Alias_Name>
<Index_Name>
)
This directs Oracle, when possible, to access the alias
<Alias_Name>
using the index
named <Index_Name>
. Repeat
this hint for each index/alias combination you need to
control.
FULL(
<Alias_Name>
)
This directs Oracle, when possible, to access the alias
<Alias_Name>
using a full
table scan. Repeat this hint for each full table scan you
require.
INDEX_DESC(
<Alias_Name>
<Index_Name>
)
This directs Oracle, when possible, to access the alias
<Alias_Name>
using the index
named <Index_Name>
, reaching
the rows in descending order (the reverse of the normal
index-sorted order). Repeat this hint for each index/alias
combination you need to control, although it is unlikely you
will need it more than once in a query.
The INDEX
and FULL
hints are common and easy to use. The
INDEX_DESC
hint is useful only
rarely, but it is occasionally vital to use. For example, if you
want to know all about the last employee hired in April, you might
use this query:
SELECT * FROM Employees E WHERE Hire_Date>=TO_DATE('2003-04-01','YYYY-MM-DD') AND Hire_Date< TO_DATE('2003-05-01','YYYY-MM-DD') ORDER BY Hire_Date DESC
You’ll find the most recently hired employee you want at the
top of the list of rows returned by this query. To avoid reading all
the data for other employees hired in April, you might think to add
a condition AND ROWNUM=1
to the
query. However, this sometimes will not yield the desired result,
because (depending on the data) Oracle will sometimes apply that
condition before performing the descending sort. If Oracle uses a
full table scan, it will return the first employee hired in April it
finds in the table, likely the least recently hired. If it uses a
simple index range scan on an index on Hire_Date
, it will begin, as range scans
generally do by default, at the low end of the index range,
returning the first employee hired in April. However, the INDEX_DESC
hint, with the index Employee_Hire_Date
on the Hire_Date
column, neatly solves the
problem, returning the desired row with just a single logical I/O to
the table:
SELECT /*+ INDEX_DESC(E Employee_Hire_Date) */ * FROM Employees E WHERE Hire_Date>=TO_DATE('2003-04-01','YYYY-MM-DD') AND Hire_Date< TO_DATE('2003-05-01','YYYY-MM-DD') AND ROWNUM=1
Note that I removed the explicit ORDER BY
clause, since it gives the false
impression that it has effect, given the condition on ROWNUM
.
The preceding example might strike you as risky code, apt to break functionally, for example, if someone drops or renames the index used. It is risky, and I recommend it only if the value of the performance improvement exceeds the cost of the risk of incorrect results. This is a clear case for SQL syntax that allows such top-n queries that take full advantage of the best indexed path. With current syntax, I haven’t found a solution that is both optimal and functionally safe.
There are several other table-access hints that I have not described in this section, but I have never found them necessary.
These are the main hints to control the order of execution for joins and subqueries:
This hint, unlike the others, usually requires that you
alter the body of the SQL (or at least the FROM
clause) to get the plan you want,
since the hint refers to the FROM
-clause order. Notice that the
desired FROM
-clause order will
be precisely the opposite of the best FROM
-clause order you would choose for
rule-based optimization. That’s because the RBO works from right
to left, whereas this hint causes the CBO to work through the
FROM
clause from left to
right.
LEADING(
<Alias_Name>
)
In the absence of an ORDERED
hint, this selects the
driving table, the first table in the join order. Although
this gives less control over the join order than the ORDERED
hint, it does not require
modifying the FROM
clause.
Often, getting just the driving table correct is all you need
to get at least close to the performance of the optimal plan.
Later choices in the join order tend to matter less and will
likely be well chosen by the optimizer, without your
help.
PUSH_SUBQ
This hint instructs the optimizer to perform correlated subqueries at the first opportunity, as soon as the outer query reaches the join columns needed to evaluate them. Oracle’s CBO normally performs correlated subqueries only after completing all the joins in the outer query.
The ORDERED
and LEADING
hints are common and
straightforward to use. The PUSH_SUBQ
hint is occasionally
useful.
When it comes to subqueries, Oracle offers hint-based control
only at the two extremes: executing subqueries as early or as late
as possible. However, you can gain full control of when subqueries
execute if you combine the PUSH_SUBQ
hint with the earlier methods of
postponing correlated joins. For example, consider the earlier
query:
SELECT ... FROM Orders O, Customers C, Regions R WHERE O.Status_Code='OP' AND O.Customer_ID=C.Customer_ID AND C.Customer_Type_Code='GOV' AND C.Region_ID=R.Region_ID AND EXISTS (SELECT NULL FROM Order_Details OD WHERE O.Order_ID+0*C.Customer_ID=OD.Order_ID AND OD.Shipped_Flag='Y')
Without a hint, Oracle would execute the EXISTS
check after joining all three
outer-query tables. The point of the expression O.Order_ID+0*C.Customer_ID
was to delay
the EXISTS
check until after the
join to C
, but not after the join
to R
. However, without any hint,
all EXISTS
conditions are
automatically delayed until after all outer-query joins. To force
the EXISTS
condition to execute
between the joins to C
and
R
, use both the hint and the
correlating-join-postponing expression:
SELECT /*+ PUSH_SUBQ */ ... FROM Orders O, Customers C, Regions R WHERE O.Status_Code='OP' AND O.Customer_ID=C.Customer_ID AND C.Customer_Type_Code='GOV' AND C.Region_ID=R.Region_ID AND EXISTS (SELECT NULL FROM Order_Details OD WHERE O.Order_ID+0*C.Customer_ID=OD.Order_ID AND OD.Shipped_Flag='Y')
Now, the PUSH_SUBQ
hint
causes Oracle to execute the EXISTS
condition as early as possible, and
the expression O.Order_ID+0*C.Customer_ID
ensures that
“as early as possible” doesn’t come until after the join to C
.
These are the main hints to control the join methods:
USE_NL(
<List_Of_Aliases>
)
This directs Oracle, when possible, to join the tables
indicated in the alias list by using nested loops. The alias
list is without commas—for example, USE_NL(T1 T2
T3)
.
USE_HASH(
<List_Of_Aliases>
)
This directs Oracle, when possible, to join to the
tables indicated in the alias list by using hash joins. The
alias list is without commas—for example, USE_HASH(T1
T2 T3)
.
Here’s an example to illustrate the most frequently useful
hints to yield complete control of an execution plan. I’ll force the
join order, the access method to every table, and the join method to
every table. Consider the earlier example tuned for the RBO, shown
at the end of Section
4.2.2. To fully force the same plan, but substitute a hash
join for the first nested-loops join, with the employee locations
read through the index on Description
, use this query:
SELECT /*+ ORDERED USE_NL(M LM) USE_HASH(LE) INDEX(E Employee_Last_Name) INDEX(LE Location_Description) INDEX(M Employee_Pkey) INDEX(LM Location_Pkey) */ E.First_Name, E.Last_Name, E.Salary, LE.Description, M.First_Name, M.Last_Name, LM.Description FROM Employees E, Locations LE, Employees M, Locations LM WHERE E.Last_Name = 'Johnson' AND E.Manager_ID=M.Employee_ID AND E.Location_ID=LE.Location_ID AND M.Location_ID=LM.Location_ID AND LE.Description='Dallas'
This results in the execution plan, as shown here:
SQL>@ex
PLAN
----------------------------------------------------------------------
SELECT STATEMENT
NESTED LOOPS
NESTED LOOPS
HASH JOIN
TABLE ACCESS BY INDEX ROWID 1*EMPLOYEES
INDEX RANGE SCAN EMPLOYEE_LAST_NAME
TABLE ACCESS BY INDEX ROWID 2*LOCATIONS
INDEX RANGE SCAN LOCATION_DESCRIPTION
TABLE ACCESS BY INDEX ROWID 3*EMPLOYEES
INDEX UNIQUE SCAN EMPLOYEE_PKEY
TABLE ACCESS BY INDEX ROWID 4*LOCATIONS
INDEX UNIQUE SCAN LOCATION_PKEY
DB2 offers relatively sparse vendor-specific tools to control execution plans, so the methods used to tune on DB2 are comparatively indirect. There are three main steps involved in tuning on DB2:
Provide the optimizer with good statistics about the tables and indexes, so it can calculate the costs of alternatives accurately.
Choose the optimization level that DB2 applies to your query.
Modify the query to prevent execution plans that you do not want, mainly using the methods described earlier in Section 4.1.
Proving that a little knowledge is a dangerous thing, cost-based
optimizers often do a terrible job if they do not have statistics on
all the tables and indexes involved in a query. It is therefore
imperative to maintain statistics on tables and indexes reliably; this
includes regenerating statistics anytime table volumes change much or
anytime tables or indexes are rebuilt. It is safest to regenerate statistics periodically,
during times when load is relatively quiet, nightly or at least
weekly. Edit a file runstats_schema.sql from the Unix prompt
and type the following commands, replacing
<Schema_Name>
with the name of the
schema that contains the objects you wish to run statistics on:
-- File called runstats_schema.sql SELECT 'RUNSTATS ON TABLE<Schema_Name>
.' || TABNAME || ' AND INDEXES ALL;' FROM SYSCAT.TABLES WHERE TABSCHEMA = '<Schema_Name>
';
To use this script, log into db2, escape to the shell prompt with
quit;
, and run the following two
commands from the Unix shell:
db2 +p -t < runstats_schema.sql > tmp_runstats.sql grep RUNSTATS tmp_runstats.sql | db2 +p -t > tmp_anal.out
These commands can be scheduled to run automatically. Check tmp_anal.out in case any of the analyses fail.
Often, queries include conditions on highly skewed distributions, such as conditions on special types, codes, or flags, when these columns have only a few values. Normally, the CBO evaluates selectivity of a condition based on the assumption that all nonnull values of a column are equally selective. This assumption generally works well for foreign and primary keys that join business entities, but it breaks down when the columns have permanent special meanings and certain meanings apply much more rarely than others.
For example, in an Orders
table, you might have a Status_Code
column with three possible values: 'CL
' for closed (i.e.,
fulfilled) orders, 'CA
' for
cancelled orders, and 'OP
' for open orders.
Most orders, by far, would be fulfilled; so, once the application has
been running for a few months, you’d expect 'CL
' to point to a large and steadily
increasing number of orders. A steady, significant fraction of orders
would end up cancelled, so 'CA
'
would also eventually point to a large list of orders. However, as
long as the business keeps up with incoming orders, the number of open
orders would remain moderate and steady, even as data accumulates for
years. Quite early, a condition that specified Status_Code='OP
' would be selective enough
to justify indexed access, if you had an index with that leading
column, and it is important to enable the optimizer to realize this
fact, preferably without a lot of manual tuning. This requires two
things:
The SQL must mention the specific selective value, rather
than use a bind variable. Use of bind variables is commonly
attractive, since it makes SQL more general and easier to share
between processes. However, this need to hardcode especially
selective values is the exception to that rule. If you use
Status_Code= ?
instead of
Status_Code='OP
', you will deny
the CBO potential knowledge of the selectivity of the condition at
parse time, when it does not yet know whether the bind variable
?
will be assigned to a common
or a rare Status_Code
.
Fortunately, in these cases, the usual reason to prefer using bind
variables does not generally apply; since these special codes have
special business meanings, it is unlikely that the SQL will ever
require substituting a different value than the single selective
value.
You need to provide the CBO with special statistics that quantify how rare the uncommon code, type, or status values are, so it can know which values are highly selective.
DB2 stores special statistics on distribution, when you request
them. To create distribution statistics for the example case, given an
index named Order_Stts_Code
and the
schema owned by ApplProd
, use the
following command:
RUNSTATS ON TABLE ApplProd.Orders WITH DISTRIBUTION FOR INDEX ApplProd.Order_Stts_Code;
Anytime you have a column with a skewed distribution and an index that you wish to use when your condition on the column has a high degree of selectivity, be sure to create distribution statistics in the manner shown here.
DB2 offers multiple optimization levels. An
optimization level is basically a ceiling on how
clever the optimizer attempts to be when it considers the range of
possible execution plans. At optimization level 0
, DB2 chooses the lowest cost plan within a
subset of the plans it considers at level 1
; at level 1
, it considers just a subset of the plans
it considers at level 2
; and so on.
Nominally, the highest optimization level should always yield the best
plan, because it chooses the lowest cost plan from the widest possible
range of alternatives. However, the plans enabled by the higher
optimization levels tend to be less robust and often prove
disappointing. In spite of the optimizer’s calculations to the
contrary, these less robust plans often run longer than the best
robust plan that the lower-level optimization sees. Higher levels of
optimization can also take longer to parse, since the optimizer has
additional degrees of freedom to explore. Ideally, you parse every
statement at the lowest level that is capable of finding the best
execution plan for a given query.
DB2 offers seven levels of optimization: 0
, 1
,
2
, 3
, 5
,
7
, and 9
.[1] Level 5
is normally
the default, although database administration can override this
default choice. I have never needed levels of optimization higher than
5
; levels 7
and 9
appear mainly to enable relatively exotic query transformations that
are rarely useful. However, I have frequently found excellent results
with the lowest level of optimization, level 0
, when level 5
produced a poor plan. Before executing a
query (or checking an execution plan), set level 0
with the following SQL statement:
SET CURRENT QUERY OPTIMIZATION 0;
When you wish to return to level 5
for other queries that require it, use the
same syntax, replacing 0
with
5
. If you find a poor plan at level
5
, I recommend trying level
0
after first verifying correct
statistics on the tables and indexes involved. Level 0
frequently yields just the sort of robust
plans that usually work best for real-world applications.
Most manual tuning on DB2 uses the SQL changes described earlier in Section 4.1. However, one particular manual technique deserves special mention, because it proves useful more often on DB2 than on Oracle and SQL Server. DB2 stores index records even for null values of indexed columns, and it appears to treat null like just another indexed value.
When DB2 lacks special statistics on distribution (see Section 4.3.1), DB2
estimates the selectivity of Indexed_Column
IS NULL
to be just as high as Indexed_Column = 198487573
or any other
nonnull value. Therefore, older DB2 versions often choose to drive to
selective-looking IS NULL
conditions on indexed columns. Occasionally, this works out fine.
However, in my experience, IS NULL
conditions are rarely anywhere near as selective as the average
individual nonnull value, and indexed access driven by IS NULL
conditions is almost always a
mistake.
Therefore, when you find an IS
NULL
condition on an indexed column in a DB2 query, you
often should prevent use of the index. The simplest equivalent
condition that prevents index use is COALESCE(Indexed_Column, Indexed_Column) IS
NULL
. This version is perfectly equivalent to the original
condition Indexed_Column IS NULL
,
but the COALESCE( )
function
prevents index use.
In addition to tuning techniques that can apply to any database, there are three useful techniques specific to DB2 that I describe in the following sections.
One sometimes useful technique is simply to list inner joins first in your FROM clause. This appears never to hurt, and on older versions of DB2 I have seen this simple technique produce greatly improved execution plans.
Older versions of DB2 can take minutes to parse
queries with more than about 12 outer joins, and even then they
might fail with errors. Fortunately, there is a workaround for this
problem, using the following template for the SQL. The workaround
uses DB2’s nested-tables syntax, in which an outer query contains
another query inside a FROM
clause that is treated like a single table for purposes of the outer
query:
SELECT ... FROM (SELECT ... FROM (SELECT ... FROM<all inner joins and
ten outer joins>
WHERE<Conditions pertinent
to this innermost nested table>
) T1 LEFT OUTER JOIN<Joins for the 11th
through 20th outer join>
WHERE<Conditions, if any, pertinent
to this outermost nested table>
) T2 LEFT OUTER JOIN<The rest of the outer joins (at most 10)>
WHERE<Conditions, if any, pertinent to the outer query>
This template applies to a query with 21-30 outer-joined
tables. With 11-20 outer-joined tables, you need only a single
nested table. With more than 30 outer-joined tables, you need even
deeper levels of nesting. In this syntax, DB2 effectively creates
nested views on the fly, as defined by the queries inside parentheses in the FROM
clauses. For purposes of handling
outer joins, DB2 handles each of these smaller queries
independently, sidestepping the problem of too many outer joins in a
single query.
At my former employer, TenFold, we found this technique so useful that we enhanced the EnterpriseTenFold product to automatically generate this extraordinarily complex SQL when required. Admittedly, it is not an easy solution for manually written SQL, but it still might be the only technique that works if you run into slow or failed parses of many-way outer joins on DB2.
Normally, DB2 calculates the cost of executing the entire query and chooses the plan it expects will run the fastest end to end. However, especially for online queries, you often care only about the first few rows and prefer to optimize to get the first rows soonest.
The technique to read the first rows fast, usually following
nested loops, is to add the clause OPTIMIZE FOR
<n>
ROWS
(or OPTIMIZE
FOR 1 ROW
), where <n>
is
the number of rows you actually need to see fast out of the larger
rowset that the query might theoretically return. This clause goes
at the very end of the query and instructs DB2 to optimize the cost
of returning just those first <n>
rows, without regard to the cost of the rest of the query execution.
If you actually know how many rows you want and trust the optimizer
to calculate the best plan, you can choose
<n>
on that basis. If you want to
force a robust, nested-loops plan as strongly as possible, just use
OPTIMIZE
FOR 1 ROW
.
In practice, this technique tends to dictate nested-loops
joins, because they avoid reading whole rowsets before even
beginning a join. However, it is possible for an explicit ORDER BY
clause to defeat any attempt to
reach the first rows fast. The ORDER
BY
clause usually requires a sort
following the complete query, usually postponing return of the first
row regardless of the execution plan. You can leave out a sort
condition if you want to force nested-loops joins by this technique,
performing the sort in your application if necessary. The OPTIMIZE FOR 1 ROW
hint is the equivalent
of the FIRST_ROWS
hint on Oracle
and the OPTION(FAST 1)
hint on
SQL Server.
Techniques to force precisely chosen execution plans on DB2 are sparse, in contrast to the extraordinary detail that DB2 reveals about the execution plan you already have and why DB2 chose it. However, in fairness, I should mention that the available techniques, in combination with DB2’s fairly good optimizer, have proven sufficient in my own experience.
There are three main steps involved in tuning on SQL Server:
Provide the optimizer with good statistics about the tables and indexes, so it can calculate the costs of alternatives accurately.
Modify the query to prevent execution plans that you do not want, mainly using methods specific to SQL Server.
Force a simple execution plan with FORCEPLAN
when necessary.
Proving that a little knowledge is a dangerous thing, cost-based
optimizers often do a terrible job if they do not have statistics on
all the tables and indexes involved in the query. It is therefore
imperative to maintain statistics on tables and indexes reliably; this
includes regenerating statistics anytime table volumes change much or
anytime tables or indexes are rebuilt. It is safest to regenerate
statistics periodically, during times when load is relatively quiet,
nightly or at least weekly. Run the following from Query Analyzer, then cut and paste the resulting
UPDATE STATISTICS
commands into the
query window and run them as well:
-- file called updateall.sql -- update your whole database SELECT 'UPDATE STATISTICS ', name FROM sysobjects WHERE type = 'U'
Often, queries include conditions on highly skewed distributions, such as conditions on special types, codes, or flags, when these columns have only a few values. SQL Server automatically maintains statistics on the distribution of indexed column values, enabling SQL Server to estimate selectivities automatically, even when indexed columns have skewed distributions.
Occasionally, it is useful to help the SQL Server estimate the
selectivity of condition with a skewed distribution even when the
distribution applies to a nonindexed column. In such a case, you need
to specially request data on that column. For example, to request a
statistics group named Eflag
on the
nonindexed column Exempt_Flag
of
the Employees
table, run:
CREATE STATISTICS EFlag on Employees(Exempt_Flag)
As an example of a case in which such skewed distributions might
apply, consider an Orders
table in
which you have a Status_Code
column
with three possible values: 'CL
'
for closed (i.e., fulfilled) orders, 'CA
' for cancelled
orders, and 'OP
' for
open orders. Most orders, by far, would be
fulfilled once the application has been running for a few months,
causing a steady rise in 'CL
'
values. A steady, significant fraction of orders would end up
cancelled, so the value 'CA
' would
also eventually point to a large list of orders. However, as long as
the business keeps up with incoming orders, the number of open orders
would remain moderate and steady, even as data accumulates for years.
Quite early, a condition that specified Status_Code='OP
' would be selective enough
to prefer joining to the corresponding table early, even if Status_Code
is not indexed, and it is
important to enable the optimizer to realize this fact, preferably
without a lot of manual tuning. This requires that your SQL actually
mention the specific value that applies to the condition, rather than
use a generic stored procedure that only fills in the value of the
constant after the parse, at execution time.
You should usually tune SQL Server with hints. Hints
generally go in either the FROM
clause, when they apply to a specific table access, or in the SQL
Server OPTION( )
clause at the very
end of a query. These are the most useful hints:
WITH (INDEX(
<Index_Name>
))
Immediately following a table alias in a FROM
clause, this hint instructs SQL
Server to use the specified index to access that table alias.
The older alternative syntax INDEX=
<Index_Name>
is also supported,
but it might be dropped in the future, so I don’t recommend it.
Even more obsolete and dangerous is the still-supported method
of naming the internal object ID that corresponds to the desired
index. Naming the index you want with the ID is horribly
unreliable, because the index will get a new ID if anyone ever
drops it and recreates it, or if the application moves to a new
SQL Server database.
WITH (INDEX(0))
Immediately following a table alias in a FROM
clause, this hint instructs SQL
Server to use a full table scan to reach that table
alias.
WITH (NOLOCK)
Immediately following a table alias in a FROM
clause, this hint instructs SQL
Server to read the specified table alias without requiring read
locks or otherwise enforcing a consistent read. Read locks on
SQL Server can create a bottleneck when combined with heavy
update activity on a table. This hint avoids such a bottleneck,
potentially at the cost of a consistent view of the data as of a
single moment in time.
LOOP
and HASH
These two different hints can each immediately precede the
JOIN
keyword in the FROM
clause, instructing SQL Server to
perform the specified join with the specified join method. These
hints require the new-style join syntax with the JOIN
keyword in the FROM
clause. The presence of even a
single hint of this type also forces all joins to take place in
the same order the aliases are listed in the FROM
clause.
OPTION(LOOP JOIN)
This hint goes at the end of a query and forces all joins to follow nested loops.
OPTION(FORCE ORDER)
This hint goes at the end of a query and forces all joins
to take place in the same order the aliases are listed in the
FROM
clause.
OPTION(FAST 1)
This hint simply instructs SQL Server to attempt to reach
the first returned rows as quickly as possible, which generally
favors a nested-loops execution plan. Its effect is much like
the OPTION(LOOP JOIN)
hint,
although in theory SQL Server might recognize that no execution
plan could reach the first rows quickly in a query with an
explicit ORDER BY
, nullifying
any effect from OPTION(FAST
1)
. The OPTION(FAST
1)
hint is the equivalent of the FIRST_ROWS
hint on Oracle and the
OPTIMIZE FOR 1 ROW
hint on
DB2.
These hints can be combined. You can place multiple hints within
a single WITH
clause, separating
them with commas—for example, WITH
(INDEX(Employee_First_Name),
NOLOCK)
. Multiple hints in a single OPTION
clause also are separated by
commas—for example, OPTION(LOOP JOIN, FORCE
ORDER)
. Together, these hints give full control of the join
order, the join methods, and the table access methods.
I’ll demonstrate tuning with hints on a couple of queries. If you choose a robust all-nested-loops plan that drives from the employee last name to the other tables in optimum order, using the primary keys to reach the other tables, this query’s hints force the desired plan:
SELECT E.First_Name, E.Last_Name, E.Salary, LE.Description, M.First_Name, M.Last_Name, LM.Description FROM Employees E WITH (INDEX(Employee_Last_Name)) INNER JOIN Locations LE WITH (INDEX(Location_PKey)) ON E.Location_ID=LE.Location_ID INNER JOIN Employees M WITH (INDEX(Employee_PKey)) ON E.Manager_ID=M.Employee_ID INNER JOIN Locations LM WITH (INDEX(Location_PKey)) ON M.Location_ID=LM.Location_ID WHERE E.Last_Name = 'Johnson' AND LE.Description='Dallas' OPTION(LOOP JOIN, FORCE ORDER)
SET SHOWPLAN_TEXT ON
(as
described in Chapter 3)
generates the following results when you run this query from SQL
Server Query Analyzer:
StmtText ----------------------------------------------------------------- |--Bookmark Lookup(...(...[Locations] AS [LM])) |--Nested Loops(Inner Join) |--Bookmark Lookup(...(...[Employees] AS [M])) | |--Nested Loops(Inner Join) | |--Filter(WHERE:([LE].[Description]='Dallas')) | | |--Bookmark Lookup(...(...[Locations] AS [LE])) | | |--Nested Loops(Inner Join) | | |--Bookmark Lookup(...(...[Employees] AS [E])) | | | |--Index Seek(...(...(wrapped line)
[Employees].[Employee_Last_Name](wrapped line)
AS [E]), SEEK:([E].[Last_Name]='Johnson') ORDERED) | | |--Index Seek(...(...[Locations].[Location_PKey](wrapped line)
AS [LE]), SEEK:([LE].[Location_ID]=[E].[Location_ID]) ORDERED) | |--Index Seek(...(...[Employees].[Employee_PKey](wrapped line)
AS [M]), SEEK:([M].[Employee_ID]=[E].[Manager_ID]) ORDERED) |--Index Seek(..(...[Locations].[Location_PKey](wrapped line)
AS [LM]), SEEK:([LM].[Location_ID]=[M].[Location_ID]) ORDERED) (12 row(s) affected)
If you don’t want to specify all nested loops, you might need
the join HASH
and LOOP
hints, as shown in the following
alternative to the last query:
SELECT E.First_Name, E.Last_Name, E.Salary, LE.Description, M.First_Name, M.Last_Name, LM.Description FROM Employees E WITH (INDEX(Employee_Last_Name)) INNER HASH JOIN Locations LE WITH (INDEX(Location_Description)) ON E.Location_ID=LE.Location_ID INNER LOOP JOIN Employees M WITH (INDEX(Employee_PKey)) ON E.Manager_ID=M.Employee_ID INNER LOOP JOIN Locations LM WITH (INDEX(Location_PKey)) ON M.Location_ID=LM.Location_ID WHERE E.Last_Name = 'Johnson' AND LE.Description='Dallas'
The preceding query delivers the following execution plan,
triggered by SET
SHOWPLAN_TEXT ON
:
StmtText ------------------------------------------------------------- |--Bookmark Lookup(...(...[Locations] AS [LM])) |--Nested Loops(Inner Join) |--Bookmark Lookup(...(...[Employees] AS [M])) | |--Nested Loops(Inner Join) | |--Hash Match(Inner Join...(wrapped line)
([E].[Location_ID])=([LE].[Location_ID])...) | | |--Bookmark Lookup(...(...[Employees] AS [E])) | | | |--Index Seek(...(...[Employees].[Employee_Last_Name](wrapped line)
AS [E]), SEEK:([E].[Last_Name]='Johnson') ORDERED) | | |--Bookmark Lookup(...(...[Locations] AS [LE])) | | |--Index Seek(...(...[Locations].[Location_Description](wrapped line)
AS [LE]), SEEK:([LE].[Description]='Dallas') ORDERED) | |--Index Seek(...(...[Employees].[Employee_PKey](wrapped line)
AS [M]), SEEK:([M].[Employee_ID]=[E].[Manager_ID]) ORDERED) |--Index Seek(...(...[Locations].[Location_PKey](wrapped line)
AS [LM]), SEEK:([LM].[Location_ID]=[M].[Location_ID]) ORDERED) (11 row(s) affected)
There are two basic extremes involved in tuning with hints such as those in this example:
Use as little direction as possible to get the execution plan you want, or at least to get close enough to the plan you want for reasonable performance. This approach reasons that SQL Server has more information than you have, and it should be left free to adapt to changing data distributions and take advantage of improvements in SQL Server with future releases. By leaving SQL Server the maximum degrees of freedom, you maximize its power to optimize well for you in the future. However, until you try, you won’t know how much direction the SQL Server will need if it did not get the plan right in the first place, so this approach is likely to be iterative, involving adding one hint at a time until SQL Server delivers a good plan.
If you did not get the plan you wanted from SQL Server automatically, assume the database has made bad assumptions that will propagate to distort all of its calculations. Therefore, leave it with very little freedom, specifying essentially the whole plan you want.
If you are confident in your chosen execution plan, as you
should be if you apply the methods I describe later in this book,
there is little reason to hold back from fully specifying that plan. I
have yet to find a case in which a well-chosen, robust execution plan
needed to evolve to handle new data distributions or new database
features. On the other hand, it is easy for SQL with a partially
restricting set of hints to go wrong, especially if some table or
index loses its statistics. When SQL Server chooses incorrectly, the
error that made the database choose incorrectly is likely to propagate
over the entire plan. However, the OPTION(FAST 1)
hint is the sort of
instruction that can be useful even when SQL Server has perfect
information, simply specifying that the time to reach the first row is
more important than the time to reach the last row.
An older method to tune on both Microsoft SQL Server and
Sybase is the FORCEPLAN
option. You
execute this option with a standalone SQL statement:
SET FORCEPLAN ON
This option affects all SQL for that connection until you execute this statement:
SET FORCEPLAN OFF
When FORCEPLAN
is ON
, the database is instructed to perform
only the simplest optimization on the SQL it sees. It generally uses
nested-loops execution plans that drive through indexes and join
tables in the same order you list them in the FROM
clause. When this is the sort of plan
you want, SET FORCEPLAN
can be
ideal, not only forcing the plan simply, but even saving parse time
that would otherwise be wasted considering a much wider range of
plans, especially for joins of many tables. It is a blunt-edged sword,
so to speak, so only use it when you know the FROM
-clause join order is the correct join
order and you want nested loops.
[1] Levels 4
, 6
, and 8
are not available, presumably for
historical reasons, although I have never found these reasons
documented.
3.133.127.37