Chapter 4
IN THIS CHAPTER
The UNION
statement
The INTERSECT
statement
The EXCEPT
statement
The JOIN
statements
In Chapter 3 of this minibook, I show you how, by using nested queries, data can be drawn from multiple tables to answer a question that involves different ideas. Another way to collect information from multiple tables is to use the relational operators UNION
, INTERSECT
, EXCEPT
, and JOIN
. SQL’s UNION
, INTERSECT
, and EXCEPT
operators are modeled after the union, intersect, and except operators of relational algebra. Each one performs a very specific combining operation on the data in two or more tables. The JOIN
operator, on the other hand, is considerably more flexible. A number of different joins exist, and each performs a somewhat different operation. Depending on what you want in terms of information retrieved from multiple tables, one or another of the joins or the other relational operators is likely to give it to you. In this chapter, I show you each of SQL’s relational operators, cover how it works, and discuss what you can use it for.
The UNION
operator is the SQL implementation of the union operator used in relational algebra. SQL’s UNION
operator enables you to draw information from two or more tables that have the same structure. Same structure means
When these criteria are met, the tables are union-compatible. The union of two tables returns all the rows that appear in either table and eliminates duplicates.
Suppose you have created a database for a business named Acme Systems that sells and installs computer products. Acme has two warehouses that stock the products, one in Fort Deposit, Alabama, and the other in East Kingston, New Hampshire. It contains two union-compatible tables, named DEPOSIT and KINGSTON. Both tables have two columns, and corresponding columns are of the same type. In fact, corresponding columns have identical column names (although this condition isn’t required for union compatibility).
DEPOSIT lists the names and quantity in stock of products in the Fort Deposit warehouse. KINGSTON lists the same information about the East Kingston warehouse. The UNION
of the two tables gives you a virtual result table containing all the rows in the first table plus all the rows in the second table. For this example, I put just a few rows in each table to illustrate the operation:
SELECT * FROM DEPOSIT ;
ProductName QuantityInStock
----------- ---------------
185_Express 12
505_Express 5
510_Express 6
520_Express 2
550_Express 3
SELECT * FROM KINGSTON ;
ProductName QuantityInStock
----------- ---------------
185_Express 15
505_Express 7
510_Express 6
520_Express 2
550_Express 1
SELECT * FROM DEPOSIT
UNION
SELECT * FROM KINGSTON ;
ProductName QuantityInStock
----------- ---------------
185_Express 12
185_Express 15
505_Express 5
505_Express 7
510_Express 6
520_Express 2
550_Express 3
550_Express 1
The UNION DISTINCT
operator functions identically to the UNION
operator without the DISTINCT
keyword. In both cases, duplicate rows are eliminated from the result set. In this example, because both warehouses had the same number of 510_Express and 520_Express products, those rows in both tables were exact duplicates, only one of which was returned.
This example shows how UNION works, but it isn’t very practical. In most cases, I imagine Acme’s manager would not care which products were stocked in exactly the same numbers at both warehouses, and thus partially removed from the result set. All the information is present, but the user must be savvy enough to realize that the total number of units of 510_Express is actually 12 rather than 6, and the total number of units of 520_Express is 4 rather than 2.
As mentioned previously, the UNION
operation normally eliminates any duplicate rows that result from its operation, which is the desired result most of the time. Sometimes, however, you may want to preserve duplicate rows. On those occasions, use UNION ALL
.
The following code shows you what UNION ALL
produces when it’s used with the DEPOSIT and KINGSTON tables:
SELECT * FROM DEPOSIT
UNION ALL
SELECT * FROM KINGSTON ;
ProductName QuantityInStock
----------- ---------------
185_Express 12
505_Express 5
510_Express 6
520_Express 2
550_Express 3
185_Express 15
505_Express 7
510_Express 6
520_Express 2
550_Express 1
You can sometimes form the union of two tables even if they are not union-compatible. If the columns you want in your result set are present and compatible in both source tables, you can perform a UNION CORRESPONDING
operation. Only the specified columns are considered, and they are the only columns displayed in the result set.
Suppose ACME Systems opens a third warehouse, in Jefferson, Maine. A new table named JEFFERSON is added to the database, which includes Product and QuantityInStock columns (as the DEPOSIT and KINGSTON tables do), but also has an additional column named QuantityOnHold. A UNION
or UNION ALL
of JEFFERSON with either DEPOSIT or KINGSTON would not return any rows because there is not a complete match between all the columns of JEFFERSON and all the columns of the other two tables. However, you can still add the JEFFERSON data to that of either DEPOSIT or KINGSTON by specifying only the columns in JEFFERSON that correspond with the columns in the other table. Here’s a sample query:
SELECT *
FROM JEFFERSON
UNION CORRESPONDING BY
(ProductName, QuantityInStock)
SELECT *
FROM KINGSTON ;
The result table holds the products and the quantities in stock at both warehouses. As with the simple UNION
, duplicates are eliminated. Thus, if the Jefferson warehouse happens to have the same quantity of a particular product that the Kingston warehouse has, the UNION CORRESPONDING
operation loses one of those rows. To avoid this problem, use UNION ALL CORRESPONDING
.
The UNION
operation produces a result table containing all rows that appear in at least one of the source tables. If you want only rows that appear in all the source tables, you can use the INTERSECT
operation, which is the SQL implementation of relational algebra’s intersect operation. I illustrate INTERSECT
by returning to the Acme Systems warehouse table:
SELECT * FROM DEPOSIT ;
ProductName QuantityInStock
----------- ---------------
185_Express 12
505_Express 5
510_Express 6
520_Express 2
550_Express 3
SELECT * FROM KINGSTON ;
ProductName QuantityInStock
----------- ---------------
185_Express 15
505_Express 7
510_Express 6
520_Express 2
550_Express 1
Only rows that appear in all source tables show up in the INTERSECT
operation’s result table:
SELECT *
FROM DEPOSIT
INTERSECT
SELECT *
FROM KINGSTON;
ProductName QuantityInStock
----------- ---------------
510_Express 6
520_Express 2
The result table shows that the Fort Deposit and East Kingston warehouses both have exactly the same number of 510_Express and 520_Express products in stock, a fact of dubious value. Note that, as was the case with UNION
, INTERSECT DISTINCT
produces the same result as the INTERSECT
operator used alone. In this example, only one of the identical rows displaying each of two products is returned.
The ALL
and CORRESPONDING
keywords function in an INTERSECT
operation the same way they do in a UNION
operation. If you use ALL
, duplicates are retained in the result table. If you use CORRESPONDING
, the intersected tables need not be union-compatible, although the corresponding columns need to have matching types and lengths.
Consider another example: A municipality keeps track of the phones carried by police officers, firefighters, parking enforcement officers, and other city employees. A database table called PHONES contains data on all phones in active use. Another table named OUT, with an identical structure, contains data on all phones that have been taken out of service. No cellphone should ever exist in both tables. With an INTERSECT
operation, you can test to see whether such an unwanted duplication has occurred:
SELECT *
FROM PHONES
INTERSECT CORRESPONDING BY (PhoneID)
SELECT *
FROM OUT ;
If the result table contains any rows, you know you have a problem. You should investigate any PhoneID entries that appear in the result table. The corresponding phone is either active or out of service; it can’t be both. After you detect the problem, you can perform a DELETE
operation on one of the two tables to restore database integrity.
The UNION
operation acts on two source tables and returns all rows that appear in either table. The INTERSECT
operation returns all rows that appear in both the first and the second table. In contrast, the EXCEPT
(or EXCEPT DISTINCT
) operation returns all rows that appear in the first table but that do not also appear in the second table.
Returning to the municipal phone database example, say that a group of phones that had been declared out of service and returned to the vendor for repairs have now been fixed and placed back into service. The PHONES table was updated to reflect the returned phones, but the returned phones were not removed from the OUT table as they should have been. You can display the PhoneID numbers of the phones in the OUT table, with the reactivated ones eliminated, using an EXCEPT
operation:
SELECT *
FROM OUT
EXCEPT CORRESPONDING BY (PhoneID)
SELECT *
FROM PHONES;
This query returns all the rows in the OUT table whose PhoneID is not also present in the PHONES table. These are the phones still out of service.
The UNION
, INTERSECT
, and EXCEPT
operators are valuable in multitable databases in which the tables are union-compatible. In many cases, however, you want to draw data from multiple tables that have very little in common. JOIN
s are powerful relational operators that combine data from multiple tables into a single result table. The source tables may have little (or even nothing) in common with each other.
SQL supports a number of types of JOIN
s. The best one to choose in a given situation depends on the result you’re trying to achieve.
Any multitable query is a type of JOIN
. The source tables are joined in the sense that the result table includes information taken from all the source tables. The simplest JOIN
is a two-table SELECT
that has no WHERE
clause qualifiers. Every row of the first table is joined to every row of the second table. The result table is referred to as the Cartesian product of the two source tables — the direct product of the two sets. (The less fancy name for the same thing is cross join.) The number of rows in the result table is equal to the number of rows in the first source table multiplied by the number of rows in the second source table.
For example, imagine that you’re the personnel manager for a company, and that part of your job is to maintain employee records. Most employee data, such as home address and telephone number, is not particularly sensitive. But some data, such as current salary, should be available only to authorized personnel. To maintain security of the sensitive information, you’d probably keep it in a separate table that is password protected. Consider the following pair of tables:
EMPLOYEE COMPENSATION
-------- ------------
EmpID Employ
FName Salary
LName Bonus
City
Phone
Fill the tables with some sample data:
EmpID FName LName City Phone
----- ----- ----- ---- -----
1 Jenny Smith Orange 555-1001
2 Bill Jones Newark 555-3221
3 Val Brown Nutley 555-6905
4 Justin Time Passaic 555-8908
Employ Salary Bonus
------ ------ -----
1 63000 10000
2 48000 2000
3 54000 5000
4 52000 7000
Create a virtual result table with the following query:
SELECT *
FROM EMPLOYEE, COMPENSATION ;
which can also be written
SELECT *
FROM EMPLOYEE CROSS JOIN COMPENSATION ;
Both of the above formulations do exactly the same thing. This query produces
EmpID FName LName City Phone Employ Salary Bonus
----- ----- ----- ---- ----- ------ ------ -----
1 Jenny Smith Orange 555-1001 1 63000 10000
1 Jenny Smith Orange 555-1001 2 48000 2000
1 Jenny Smith Orange 555-1001 3 54000 5000
1 Jenny Smith Orange 555-1001 4 52000 7000
2 Bill Jones Newark 555-3221 1 63000 10000
2 Bill Jones Newark 555-3221 2 48000 2000
2 Bill Jones Newark 555-3221 3 54000 5000
2 Bill Jones Newark 555-3221 4 52000 7000
3 Val Brown Nutley 555-6905 1 63000 10000
3 Val Brown Nutley 555-6905 2 48000 2000
3 Val Brown Nutley 555-6905 3 54000 5000
3 Val Brown Nutley 555-6905 4 52000 7000
4 Justin Time Passaic 555-8908 1 63000 10000
4 Justin Time Passaic 555-8908 2 48000 2000
4 Justin Time Passaic 555-8908 3 54000 5000
4 Justin Time Passaic 555-8908 4 52000 7000
The result table, which is the Cartesian product of the EMPLOYEE and COMPENSATION tables, contains considerable redundancy. Furthermore, it doesn’t make much sense. It combines every row of EMPLOYEE with every row of COMPENSATION. The only rows that convey meaningful information are those in which the EmpID number that came from EMPLOYEE matches the Employ number that came from COMPENSATION. In those rows, an employee’s name and address are associated with that same employee’s compensation.
When you’re trying to get useful information out of a multitable database, the Cartesian product produced by a cross join is almost never what you want, but it’s almost always the first step toward what you want. By applying constraints to the JOIN
with a WHERE
clause, you can filter out the unwanted rows. The most common JOIN
that uses the WHERE
clause filter is the equi-join.
An equi-join is a cross join with the addition of a WHERE
clause containing a condition specifying that the value in one column in the first table must be equal to the value of a corresponding column in the second table. Applying an equi-join to the example tables from the previous section brings a more meaningful result:
SELECT *
FROM EMPLOYEE, COMPENSATION
WHERE EMPLOYEE.EmpID = COMPENSATION.Employ ;
This produces the following:
EmpID FName LName City Phone Employ Salary Bonus
----- ------ ----- ---- ----- ------ ------ -----
1 Jenny Smith Orange 555-1001 1 63000 10000
2 Bill Jones Newark 555-3221 2 48000 2000
3 Val Brown Nutley 555-6905 3 54000 5000
4 Justin Time Passaic 555-8908 4 52000 7000
In this result table, the salaries and bonuses on the right apply to the employees named on the left. The table still has some redundancy because the EmpID column duplicates the Employ column. You can fix this problem by specifying in your query which columns you want selected from the COMPENSATION table:
SELECT EMPLOYEE.*,COMPENSATION.Salary,COMPENSATION.Bonus
FROM EMPLOYEE, COMPENSATION
WHERE EMPLOYEE.EmpID = COMPENSATION.Employ ;
This produces the following result:
EmpID FName LName City Phone Salary Bonus
----- ----- ----- ---- ----- ------ -----
1 Jenny Smith Orange 555-1001 63000 10000
2 Bill Jones Newark 555-3221 48000 2000
3 Val Brown Nutley 555-6905 54000 5000
4 Justin Time Passaic 555-8908 52000 7000
This table tells you what you want to know, but doesn’t burden you with any extraneous data. The query is somewhat tedious to write, however. To avoid ambiguity, it makes good sense to qualify the column names with the names of the tables they came from. However, writing those table names repeatedly can be tiresome.
You can cut down on the amount of typing by using aliases (or correlation names). An alias is a short name that stands for a table name. If you use aliases in recasting the preceding query, it comes out like this:
SELECT E.*, C.Salary, C.Bonus
FROM EMPLOYEE E, COMPENSATION C
WHERE E.EmpID = C.Employ ;
In this example, E is the alias for EMPLOYEE, and C is the alias for COMPENSATION. The alias is local to the statement it’s in. After you declare an alias (in the FROM
clause), you must use it throughout the statement. You can’t use both the alias and the long form of the table name.
Mixing the long form of table names with aliases creates confusion. Consider the following example, which is confusing:
SELECT T1.C, T2.C
FROM T1 T2, T2 T1
WHERE T1.C > T2.C ;
In this example, the alias for T1 is T2, and the alias for T2 is T1. Admittedly, this isn’t a smart selection of aliases, but it isn’t forbidden by the rules. If you mix aliases with long-form table names, you can’t tell which table is which.
The preceding example with aliases is equivalent to the following SELECT
with no aliases:
SELECT T2.C, T1.C
FROM T1, T2
WHERE T2.C > T1.C ;
SQL enables you to join more than two tables. The maximum number varies from one implementation to another. The syntax is analogous to the two-table case:
SELECT E.*, C.Salary, C.Bonus, Y.TotalSales
FROM EMPLOYEE E, COMPENSATION C, YTD_SALES Y
WHERE E.EmpID = C.Employ
AND C.Employ = Y.EmpNo ;
This statement performs an equi-join on three tables, pulling data from corresponding rows of each one to produce a result table that shows the salespeople’s names, the amount of sales they are responsible for, and their compensation. The sales manager can quickly see whether compensation is in line with production.
The natural join is a special case of an equi-join. In the WHERE
clause of an equi-join, a column from one source table is compared with a column of a second source table for equality. The two columns must be the same type and length and must have the same name. In fact, in a natural join, all columns in one table that have the same names, types, and lengths as corresponding columns in the second table are compared for equality.
Imagine that the COMPENSATION table from the preceding example has columns EmpID, Salary, and Bonus rather than Employ, Salary, and Bonus. In that case, you can perform a natural join of the COMPENSATION table with the EMPLOYEE table. The traditional JOIN
syntax looks like this:
SELECT E.*, C.Salary, C.Bonus
FROM EMPLOYEE E, COMPENSATION C
WHERE E.EmpID = C.EmpID ;
This query is a natural join. An alternate syntax for the same operation is the following:
SELECT E.*, C.Salary, C.Bonus
FROM EMPLOYEE E NATURAL JOIN COMPENSATION C ;
A condition join is like an equi-join, except the condition being tested doesn’t have to be equality (although it can be). It can be any well-formed predicate. If the condition is satisfied, the corresponding row becomes part of the result table. The syntax is a little different from what you have seen so far, in that the condition is contained in an ON
clause rather than a WHERE
clause.
Suppose Acme Systems wants to know which products the Fort Deposit warehouse has in larger numbers than does the East Kingston warehouse. This question is a job for a condition join:
SELECT *
FROM DEPOSIT JOIN KINGSTON
ON DEPOSIT.QuantityInStock > KINGSTON.QuantityInStock ;
Within the predicate of a condition join, ON
syntax is used in place of WHERE
syntax.
The column-name join is like a natural join, but it’s more flexible. In a natural join, all the source table columns that have the same name are compared with each other for equality. With the column-name join, you select which same-name columns to compare. You can choose them all if you want, making the column-name join effectively a natural join. Or you may choose fewer than all same-name columns. In this way, you have a great degree of control over which cross product rows qualify to be placed into your result table.
Suppose you are Acme Systems, and you have shipped the exact same number of products to the East Kingston warehouse that you have shipped to the Fort Deposit warehouse. So far, nothing has been sold, so the number of products in inventory in East Kingston should match the number in Fort Deposit. If there are mismatches, it means that something is wrong. Either some products were never delivered to the warehouse, or they were misplaced or stolen after they arrived. With a simple query, you can retrieve the inventory levels at the two warehouses.
SELECT * FROM DEPOSIT ;
ProductName QuantityInStock
----------- ---------------
185_Express 12
505_Express 5
510_Express 6
520_Express 2
550_Express 3
SELECT * FROM KINGSTON ;
ProductName QuantityInStock
----------- ---------------
185_Express 15
505_Express 7
510_Express 6
520_Express 2
550_Express 1
For such small tables, it is fairly easy to see which rows don’t match. However, for a table with thousands of rows, it’s not so easy. You can use a column-name join to see whether any discrepancies exist. I show only two columns of the DEPOSIT and KINGSTON tables, to make it easy to see how the various relational operators work on them. In any real application, such tables would have additional columns, and the contents of those additional columns would not necessarily match. With a column-name join, the join operation considers only the columns specified.
SELECT *
FROM DEPOSIT JOIN KINGSTON
USING (ProductName, QuantityInStock) ;
Note the USING
keyword, which tells the DBMS which columns to use.
The result table shows only the rows for which the number of products in stock at Fort Deposit equals the number of products in stock at East Kingston:
ProductName QuantityInStock ProductName QuantityInStock
----------- --------------- ----------- ---------------
510_Express 6 510_Express 6
520_Express 2 520_Express 2
Wow! Only two products match. There is a definite “shrinkage” problem at one or both warehouses. Acme needs to get a handle on security.
By now, you’re probably getting the idea that joins are pretty esoteric and that it takes an uncommon level of spiritual discernment to deal with them adequately. You may have even heard of the mysterious inner join and speculated that it probably represents the core or essence of relational operations. Well, ha! The joke is on you: There’s nothing mysterious about inner joins. In fact, all the joins covered so far in this chapter are inner joins. I could have formulated the column-name join in the last example as an inner join by using the following syntax:
SELECT *
FROM DEPOSIT INNER JOIN KINGSTON
USING (ProductName, QuantityInStock) ;
The result is the same.
The inner join is so named to distinguish it from the outer join. An inner join discards all rows from the result table that don’t have corresponding rows in both source tables. An outer join preserves unmatched rows. That’s the difference. Nothing metaphysical about it.
When you’re joining two tables, the first one (call it the one on the left) may have rows that don’t have matching counterparts in the second table (the one on the right). Conversely, the table on the right may have rows that don’t have matching counterparts in the table on the left. If you perform an inner join on those tables, all the unmatched rows are excluded from the output. Outer joins, however, don’t exclude the unmatched rows. Outer joins come in three types: the left outer join, the right outer join, and the full outer join.
In a query that includes a join, the left table is the one that precedes the keyword JOIN
, and the right table is the one that follows it. The left outer join preserves unmatched rows from the left table but discards unmatched rows from the right table.
To understand outer joins, consider a corporate database that maintains records of the company’s employees, departments, and locations. Tables 4-1, 4-2, and 4-3 contain the database’s sample data.
TABLE 4-1 LOCATION
LocationID |
CITY |
1 |
Boston |
3 |
Tampa |
5 |
Chicago |
TABLE 4-2 DEPT
DeptID |
LocationID |
NAME |
21 |
1 |
Sales |
24 |
1 |
Admin |
27 |
5 |
Repair |
29 |
5 |
Stock |
TABLE 4-3 EMPLOYEE
EmpID |
DeptID |
NAME |
61 |
24 |
Kirk |
63 |
27 |
McCoy |
Now suppose that you want to see all the data for all employees, including department and location. You get this with an equi-join:
SELECT *
FROM LOCATION L, DEPT D, EMPLOYEE E
WHERE L.LocationID = D.LocationID
AND D.DeptID = E.DeptID ;
This statement produces the following result:
1 Boston 24 1 Admin 61 24 Kirk
5 Chicago 27 5 Repair 63 27 McCoy
This results table gives all the data for all the employees, including their location and department. The equi-join works because every employee has a location and a department.
Suppose now that you want the data on the locations, with the related department and employee data. This is a different problem because a location without any associated departments may exist. To get what you want, you have to use an outer join, as in the following example:
SELECT *
FROM LOCATION L LEFT OUTER JOIN DEPT D
ON (L.LocationID = D.LocationID)
LEFT OUTER JOIN EMPLOYEE E
ON (D.DeptID = E.DeptID);
This join pulls data from three tables. First, the LOCATION table is joined to the DEPT table. The resulting table is then joined to the EMPLOYEE table. Rows from the table on the left of the LEFT OUTER JOIN
operator that have no corresponding row in the table on the right are included in the result. Thus, in the first join, all locations are included, even if no department associated with them exists. In the second join, all departments are included, even if no employee associated with them exists. The result is as follows:
1 Boston 24 1 Admin 61 24 Kirk
5 Chicago 27 5 Repair 63 27 McCoy
3 Tampa NULL NULL NULL NULL NULL NULL
5 Chicago 29 5 Stock NULL NULL NULL
1 Boston 21 1 Sales NULL NULL NULL
The first two rows are the same as the two result rows in the previous example. The third row (3 Tampa) has nulls in the department and employee columns because no departments are defined for Tampa and no employees are stationed there. (Perhaps Tampa is a brand new location and has not yet been staffed.) The fourth and fifth rows (5 Chicago and 1 Boston) contain data about the Stock and the Sales departments, but the employee columns for these rows contain nulls because these two departments have no employees. This outer join tells you everything that the equi-join told you plus the following:
The rows returned in the preceding example aren’t guaranteed to be in the order you want. The order may vary from one implementation to the next. To make sure that the rows returned are in the order you want, add an ORDER BY
clause to your SELECT
statement, like this:
SELECT *
FROM LOCATION L LEFT OUTER JOIN DEPT D
ON (L.LocationID = D.LocationID)
LEFT OUTER JOIN EMPLOYEE E
ON (D.DeptID = E.DeptID)
ORDER BY L.LocationID, D.DeptID, E.EmpID;
I’m sure you have figured out by now how the right outer join behaves. It preserves unmatched rows from the right table but discards unmatched rows from the left table. You can use it on the same tables and get the same result by reversing the order in which you present tables to the join:
SELECT *
FROM EMPLOYEE E RIGHT OUTER JOIN DEPT D
ON (D.DeptID = E.DeptID)
RIGHT OUTER JOIN LOCATION L
ON (L.LocationID = D.LocationID) ;
In this formulation, the first join produces a table that contains all departments, whether they have an associated employee or not. The second join produces a table that contains all locations, whether they have an associated department or not.
The full outer join combines the functions of the left outer join and the right outer join. It retains the unmatched rows from both the left and the right tables. Consider the most general case of the company database used in the preceding examples. It could have
To show all locations, departments, and employees, regardless of whether they have corresponding rows in the other tables, use a full outer join in the following form:
SELECT *
FROM LOCATION L FULL OUTER JOIN DEPT D
ON (L.LocationID = D.LocationID)
FULL OUTER JOIN EMPLOYEE E
ON (D.DeptID = E.DeptID) ;
The function of the ON
and WHERE
clauses in the various types of joins is potentially confusing. These facts may help you keep things straight:
ON
clause is part of the inner, left, right, and full joins. The cross join and UNION
join don’t have an ON
clause because neither of them does any filtering of the data.ON
clause in an inner join is logically equivalent to a WHERE
clause; the same condition could be specified either in the ON
clause or a WHERE
clause.ON
clauses in outer joins (left, right, and full joins) are different from WHERE
clauses. The WHERE
clause simply filters the rows returned by the FROM
clause. Rows rejected by the filter are not included in the result. The ON
clause in an outer join first filters the rows of a cross product and then includes the rejected rows, extended with nulls.The performance of queries that include joins depends, to a large extent, on which columns are indexed, and whether the index is clustering or not. A table can have only one clustering index, where data items that are near each other logically, such as 'Smith'
and 'Smithson'
, are also near each other physically on disk. Using a clustering index to sequentially step through a table speeds up hard disk retrievals and thus maximizes performance.
A clustering index works well with multipoint queries, which look for equality in nonunique columns. This is similar to looking up names in a telephone book. All the Smiths are listed together on consecutive pages. Most or all of them are located on the same hard disk cylinder. You can access multiple Smiths with a single disk seek operation. A nonclustering index, on the other hand, would not have this advantage. Each record typically requires a new disk seek, greatly slowing down operation. Furthermore, you probably have to touch every index to be sure you have not missed one. This is analogous to searching the greater Los Angeles telephone book for every instance of Area Code 626. Most of the numbers are in Area 213, but there will be instances of 626 sprinkled throughout the book.
Consider the following sample query:
SELECT Employee.FirstName, Employee.LastName, Student.Major
FROM Employee, Students
WHERE Employee.IDNum = Student.IDNum ;
This query returns the first and last names and the majors of university employees who are also students. How long it takes to run the query depends on how the tables are indexed. If Employee has a clustering index on IDNum, records searched are on consecutive pages. If Employee and Student both have clustering indexes on IDNum, the DBMS will likely use a merge join, which reads both tables in sorted order, minimizing the number of disk accesses needed. Such clustering often eliminates the need for a costly ORDER BY
clause because the records are already sorted in the desired order.
The one disadvantage of clustered indexes is that they can become “tired” after a number of updates have been performed, causing the generation of overflow pages, which require additional disk seeks. Rebuilding the index corrects this problem. By tired, I mean less helpful. Every time you add or delete a record, the index loses some of its advantage. A deleted record must be skipped over, and added records must be put on an overflow page, which will usually require a couple of extra disk seeks.
Some modern DBMS products perform automatic clustered index maintenance, meaning they rebuild clustered indexes without having to be told to do so. If you have such a product, then the disadvantage that I just noted goes away.
18.221.53.5