In Chapter 20, “Using Indexes,” we showed that the presence of an index can improve the execution time of certain statements. The question remains, though, whether the optimizer can always develop the best processing strategy for all statements. Unfortunately, the answer is that it cannot. Some statements are written in such a way that the optimizer is in no position to develop the fastest processing strategy. This occurs principally when WHERE
clause conditions are too complex or when the optimizer is taken along a “false trail.” In addition, even when indexes are available, the optimizer sometimes chooses a sequential processing strategy for those statements.
Practice has shown that a certain number of general forms of SQL statements are not easily optimized and give rise to long processing times. By reformulating such statements, you can give the optimizer a better chance of developing an optimal processing strategy. In this chapter, we provide a number of guidelines for formulating “faster” statements. In other words, we are giving the optimizer a “helping hand.”
In view of the size of the tables in the sample database, almost every SQL statement is fast. The result is that the guidelines in this chapter will not improve the execution time of the statements. However, you can fill the PLAYERS
table with many thousands of rows, and that way you can test whether the guidelines apply. For this purpose, you should execute the same two statements that we have used in Section 20.9.3, in Chapter 20, to show the impact of indexes.
All optimizers are not the same. Big differences in quality exist between the optimizers of the various products. One optimizer can devise a better processing strategy for a larger number of statements than another. The guidelines we present do not apply to all SQL statements and to all situations. We advise you to examine them in the context of your product. We also advise you to look for additional guidelines applicable to your product.
In most cases, SQL will not use an index if the condition in a WHERE
clause contains the OR
operator. These statements can be rewritten in two ways. In certain circumstances, we can replace the condition with one containing an IN
operator, or we can replace the complete statement with two SELECT
statements linked with UNION
.
Example 29.1. Get the names and initials of players 6, 83, and 44.
SELECT NAME, INITIALS FROM PLAYERS WHERE PLAYERNO = 6 OR PLAYERNO = 83 OR PLAYERNO = 44
SQL will not use the index on the PLAYERNO
column, although we assume that such an index has been defined. However, we can replace the condition in the SELECT
statement simply by an IN
operator. Then, SQL will probably use the index.
SELECT NAME, INITIALS FROM PLAYERS WHERE PLAYERNO IN (6, 83, 44)
For UPDATE
and DELETE
statements, the same applies.
Example 29.2. Get the players who joined the club in 1980, plus the players who live in Stratford.
SELECT * FROM PLAYERS WHERE JOINED = 1980 OR TOWN = 'Stratford'
In this situation, SQL will develop a sequential processing strategy regardless of the presence of indexes on the TOWN
and JOINED
columns. However, we cannot replace the condition with an IN
operator as in the previous example. Instead, we can replace the entire statement with two SELECT
statements combined with UNION
:
SELECT * FROM PLAYERS WHERE JOINED = 1980 UNION SELECT * FROM PLAYERS WHERE TOWN = 'Stratford'
In this situation, it is not possible to replace UPDATE
and DELETE
statements with a UNION
. In such a case, two separate statements are required.
Example 29.3. Update the penalty amount to $150 for all penalties that are equal to $100 or that were incurred on December 1, 1980.
UPDATE PENALTIES SET AMOUNT = 150 WHERE AMOUNT = 100 OR PAYMENT_DATE = '1980-12-01'
Another formulation is:
UPDATE PENALTIES SET AMOUNT = 150 WHERE AMOUNT = 100
and
UPDATE PENALTIES SET AMOUNT = 150 WHERE PAYMENT_DATE = '1980-12-01'
Let us return to the example with the SELECT
statement. With UNION
, SQL automatically executes a DISTINCT
and all duplicate rows are removed. However, there are no duplicate rows in this example because the SELECT
clause includes the primary key of the PLAYERS
table.
If the original SELECT
statement had looked like the following one (no primary key column in the SELECT
clause), an alternative formulation with UNION
would not have been possible. The reason is that the following statement could produce duplicate rows, whereas the version with the UNION
operator removes duplicate rows from the result. The two formulations would give different results.
SELECT NAME FROM PLAYERS WHERE JOINED = 1980 OR TOWN = 'Stratford'
If the original statement had contained DISTINCT
, the alternative would have been possible.
In the previous section, we recommended using the UNION
operator. We do not mean, however, that UNION
should be used whether it is relevant or not. This operator must also be used with care.
Example 29.4. Get, for each match, the match number and the difference between the number of sets won and lost.
SELECT MATCHNO, WON - LOST FROM MATCHES WHERE WON >= LOST UNION SELECT MATCHNO, LOST - WON FROM MATCHES WHERE WON < LOST
Odds are, during the processing of this statement, SQL will browse the entire MATCHES
table twice. This can be prevented by using an ABS
function in the SELECT
statement:
SELECT MATCHNO, ABS(WON - LOST) FROM MATCHES
SQL will browse the MATCHES
table only once and execute the calculation for each row. The expression in the SELECT
statement is somewhat more complex than the one used in the previous statement, but the extra processing time caused by this is easily compensated for by the gain in processing time caused by browsing the table only once.
If the condition in a WHERE
clause contains the NOT
operator, SQL will generally not use an index. Replace a NOT
operator, if possible, with a comparison operator.
When an index is defined on a column that occurs in a calculation or scalar function, that index will not be used.
Example 29.7. Find the players who joined the club ten years before 1990.
SELECT * FROM PLAYERS WHERE JOINED + 10 = 1990
On the left of the comparison equal to operator, there is an expression that contains both a column name and a literal. To the right of the same operator is another literal. The index on the JOINED
column will not be used. A faster execution could be expected with the following formulation:
SELECT * FROM PLAYERS WHERE JOINED = 1980
Now, the expression to the left of the comparison operator contains only one column name. In other words, the column has been isolated.
If you look in the condition of a WHERE
clause for values in a particular range using the AND
operator, SQL will generally not use an index. We can replace such a condition with a BETWEEN
operator.
Example 29.8. Find the player numbers of the players born in the period from January 1, 1962, to December 31, 1965.
SELECT PLAYERNO FROM PLAYERS WHERE BIRTH_DATE >= '1962-01-01' AND BIRTH_DATE <= '1965-12-31'
An index on the BIRTH_DATE
column will not be used here. The index will be used if we adjust the condition as follows:
SELECT PLAYERNO FROM PLAYERS WHERE BIRTH_DATE BETWEEN '1962-01-01' AND '1965-12-31'
In some cases, when an index is defined on a column used with the LIKE
operator in a WHERE
clause condition, the index will not be considered. If the mask in the LIKE
operator begins with a percentage sign or an underscore character, the index cannot be used.
Example 29.9. Find the players whose names end with the letter n.
SELECT * FROM PLAYERS WHERE NAME LIKE '%n'
The index will not be used, and, unfortunately, there is no alternative solution for this example.
Sometimes, joins can be accelerated easily by adding an extra condition to the WHERE
clause, which does not change the end result.
Example 29.10. Get the payment number and name of the player for all penalties incurred for player 44.
SELECT PAYMENTNO, NAME FROM PENALTIES AS PEN, PLAYERS AS P WHERE PEN.PLAYERNO = P.PLAYERNO AND PEN.PLAYERNO = 44
Sometimes, SQL can develop a more efficient processing strategy if the condition is extended with a redundant condition, as shown earlier. Obviously, the result of the statement does not change.
SELECT PAYMENTNO, NAME FROM PENALTIES AS PEN, PLAYERS AS P WHERE PEN.PLAYERNO = P.PLAYERNO AND PEN.PLAYERNO = 44 AND P.PLAYERNO = 44
In a SELECT
statement, conditions can be specified in two places, in the WHERE
and the HAVING
clauses. Always try to place as many conditions as possible in the WHERE
clause and as few as possible in the HAVING
clause. The main reason is that indexes are not used for conditions specified in the HAVING
clause.
Example 29.11. Find, for each player with a number higher than 40
, the player number and the number of penalties incurred.
SELECT PLAYERNO, COUNT(*) FROM PENALTIES GROUP BY PLAYERNO HAVING PLAYERNO >= 40
The condition in the HAVING
clause can also be specified in the WHERE
clause. This makes the HAVING
clause completely superfluous:
SELECT PLAYERNO, COUNT(*) FROM PENALTIES WHERE PLAYERNO >= 40 GROUP BY PLAYERNO
The SELECT
clause of a main query formulates which data is to be presented. Avoid the use of unnecessary columns because it can affect the processing speed in a negative way.
You are allowed to specify multiple expressions in the SELECT
clause of a subquery if that subquery is linked to the main query with the EXISTS
operator. However, the end result of the SELECT
statement is not affected by the expressions specified. Therefore, the advice is to formulate only one expression consisting of one literal in the SELECT
clause.
Specifying DISTINCT
in the SELECT
clause leads to the removal of duplicate rows from a result. This can have a negative effect on the processing time. Therefore, avoid the use of DISTINCT
when it is not required or even redundant. In Section 9.4, in Chapter 9, “SELECT
Statement: SELECT
Clause and Aggregation Functions,” we described when DISTINCT
is superfluous. DISTINCT
is not necessary in subqueries.
Example 29.13. Find, for each match, the match number and the name of the player.
SELECT DISTINCT MATCHNO, NAME FROM MATCHES, PLAYERS WHERE MATCHES.PLAYERNO = PLAYERS.PLAYERNO
DISTINCT
is unnecessary here because the SELECT
clause contains the primary key of the MATCHES
table, as well as a condition on the primary key of the PLAYERS
table.
In Chapter 13, “Combining Table Expressions,” we discussed the ALL
option for the set operators UNION
, INTERSECT
, and EXCEPT
. Adding ALL
to these operators has the effect that duplicate rows are not removed from the result. The ALL
option has a function that is comparable to ALL
in the SELECT
clause; see Section 9.4, in Chapter 9. If ALL
is not specified, all rows have to be sorted to be able to remove duplicate rows (sorting takes places behind the scenes). In other words, the guidelines given in the previous section also apply to the ALL
option: If possible, use ALL
in conjunction with the set operators.
Example 29.14. Find the names and initials of the players who live in Stratford and Douglas.
SELECT NAME, INITIALS FROM PLAYERS WHERE TOWN = 'Stratford' UNION ALL SELECT NAME, INITIALS FROM PLAYERS WHERE TOWN = 'Douglas'
Explanation. Because of the presence of the keyword ALL
, SQL will not perform a sort to remove possible duplicate rows. Luckily, this result will never return duplicate rows because each player lives in only one town. So, in this example, a sort would always be performed unnecessarily, thus wasting performance.
The outer join was a late addition to SQL. The result is that many statements still do not make use of it. The UNION
operator is used many times to simulate an outer join. Here is an example.
Example 29.15. Find, for each player, the player number, name, and penalties incurred by him or her; order the result by player number.
This question used to be solved with the following construct:
SELECT PLAYERS.PLAYERNO, NAME, AMOUNT FROM PLAYERS, PENALTIES WHERE PLAYERS.PLAYERNO = PENALTIES.PLAYERNO UNION SELECT PLAYERNO, NAME, NULL FROM PLAYERS WHERE PLAYERNO NOT IN (SELECT PLAYERNO FROM PENALTIES) ORDER BY 1
However, this is a complex statement for SQL. Such statements seldom have a fast processing time. For example, the PLAYERS
table is accessed twice, once in each select block. Avoid this type of formulation and use the new formulation, in which the outer join is formulated explicitly:
SELECT PLAYERNO, NAME, AMOUNT FROM PLAYERS LEFT OUTER JOIN PENALTIES USING (PLAYERNO) ORDER BY 1
SQL automatically performs data type conversions. The following condition, for example, is correct even if the numeric PLAYERNO
column is compared with an alphanumeric literal:
WHERE PLAYERNO = '44'
Converting data types adversely affects the processing speed, obviously. If this type of conversion is not really required, try to avoid it.
When you formulate joins, it is possible that the sequence of the tables in the FROM
clause can affect the processing speed. The rule is: Specify the largest table last in the FROM
clause. Thus, the following FROM
clause:
FROM PLAYERS, TEAMS
would be better if replaced by the following because the PLAYERS
table is the larger table:
FROM TEAMS, PLAYERS
Many optimizers will not use an index when processing conditions with the ALL
operator. Replace an ALL
operator, if possible, with one of the aggregation functions: MIN
or MAX
.
Example 29.16. Get the player numbers, names, and dates of birth of the oldest players. (We already used this example in Section 8.12.)
SELECT PLAYERNO, NAME, BIRTH_DATE FROM PLAYERS WHERE BIRTH_DATE <= ALL (SELECT BIRTH_DATE FROM PLAYERS)
We can replace the ALL
operator here with the MIN
function.
SELECT PLAYERNO, NAME, BIRTH_DATE FROM PLAYERS WHERE BIRTH_DATE = (SELECT MIN(BIRTH_DATE) FROM PLAYERS)
The same reasoning applies to the ANY
operator.
Example 29.17. Find the player numbers, names, and dates of birth of the players who are not among the oldest players.
SELECT PLAYERNO, NAME, BIRTH_DATE FROM PLAYERS WHERE BIRTH_DATE > ANY (SELECT BIRTH_DATE FROM PLAYERS)
We can also replace the ANY
operator with the MIN
function in this example.
SELECT PLAYERNO, NAME, BIRTH_DATE FROM PLAYERS WHERE BIRTH_DATE > (SELECT MIN(BIRTH_DATE) FROM PLAYERS)
The difference between a “fast” and a “slow” statement depends on the number of rows in the tables: the more rows, the bigger the difference. The number of rows in the sample database is small. However, in Section 20.8, in Chapter 20, the
|
This chapter clearly shows that the current optimizers are not optimal yet. In some cases, the optimizer cannot determine the most efficient processing strategy. This can lead to poor processing times. This applies not only to database servers with SQL as their database language, but also to any system that has to determine the processing strategy itself.
A lot of research is being carried out to improve optimizers. Experience shows that each new version of an SQL product is faster than its predecessor. This trend will continue in the years to come. One day, optimizers will always find better strategies than most human programmers. E. F. Codd, founder of the relational model, put it as follows [CODE82]:
If suitable fast access paths are supported, there is no reason why a high-level language such as SQL . . . should result in less efficient runtime code . . . than a lower level language . . . .
3.16.130.201