Chapter 26

Advanced SELECT Statements

Abstract

This is a collection of common SQL programming idioms for queries. We show how correlated subqueries are used and discuss the ANSI/ISO Standard infixed join operators. The chapter covers replacing missing values in columns and rows. There is a detailed Stable Marriages solution.

Keywords

“Data Smear”

Aggregate functions

Algol

Best-fit greedy algorithm

Block structured languages

Box Packing

Combinatorial explosions

Comparisons

Correlation

CROSS JOIN

CUBE

Daniel Morgan

DEFAULT

Derived table

Donald E. Knuth

Duplicate rows

Dwain Camps

Edsger Dijkstra

Frédéric Brouard

FULL OUTER JOIN

GOTO statement

Greedy algorithm

GROUPING < column name >

GROUPING SET

Gusfierld Dan Irving

INNER JOIN

Interpolation

JOINs

LEAD and LAG

LEFT JOIN

Mixed Data

NATURAL JOIN,

OUTER JOIN

Permutation

Preserved table

PRIMARY KEY

Procedural programming

RIGHT JOIN

Robert W. Stable

ROLLUP

SELECT Statements

Smood

SQL-99

Stable marriages

Structured Programming

Summation versus addition

TRUE

UNKNOWN

FALSE

UNIQUE

Unpreserved table

USING clause

VALUES

Wirth Nicklaus

In Chapter 22, we took a look at the basic SELECT statement. Now we need to look at other ways to nest subqueries and build the working table in the FROM clause with JOINs and other syntax. This is a collection of solutions to problems for which you might not think about using SQL. I also give a bit history.

26.1 Correlated Subqueries

One of the classics of Software Engineering is a short paper by the late Edsger Dijkstra entitled Go To Statement Considered Harmful (communications of the ACM, Vol. 11, No. 3, March 1968, pp. 147–148). In this paper, he argued for dropping the GOTO statement from programming languages in favor of what we now called Structured Programming.

One of his observations was that programs that used only BEGIN-END blocks, WHILE loops and IF-THEN-ELSE statements were easier to read and maintain. Programs that jumped around via GOTO statements were harder to follow because the execution path could have arrived at a statement label from anywhere in the code. Algol, the first of the blocked structured programming languages, had all of those control structures but still kept the GOTO—it was considered a fundamental part of programming. Before Dijkstra, nobody had really understood the power of limiting the scope of variables and control structures in procedural code. The basic idea of a scope is that a block of code can only reference variables that are declared in the block. Failing to find a local variable, the containing blocks are then inspected from the inside out.

We have the same concept in SQL queries. A correlated subquery is a subquery that references columns in the tables of its containing query. The general structure might look like:

SELECT T1.x, T1.y, T1.z
FROM TableOne AS T1
 WHERE T1.x
 = (SELECT T2.x, T2.y, T2.z
 FROM TableTwo AS T2
WHERE T2.y
 = (SELECT foobar
 FROM TableThree AS T3
WHERE T3.x = T1.x
 AND T3.y = T2.y
 AND T3.z = 42));

Look at the innermost query. The predicate (T3.z = 42) is local to this query. The predicate (T3.y = T2.y) works because this query is in the scope of the query with T2 in the FROM clause. Likewise, The predicate (T3.x = T1.x) works because this query is in the scope of the query with T1 in the FROM clause. If I had not qualified the table names in the innermost WHERE clause, the predicate (T3.x = x) would refer to the most local x, which gives us (T3.x = T3.x), which is always TRUE or UNKNOWN. That is absurd.

But a predicate like (T3.z = floob) might reference table T1, T2, or T3, depending on which ones has the nearest column floob. Which table would be determined by working outward. This is why it is important to qualify column names.

The tables can reference the same table under a different correlation name. Consider a query to find all the students who are younger than the oldest student of their gender:

SELECT S1.stud_nbr, S1.stud_name, S1.sex_code, S1.stud_age
FROM Students AS S1
 WHERE stud_age
 < (SELECT MAX(stud_age)
FROM Students AS S2
 WHERE S1.sex_code = S2.sex_code);

1. Let’s work it out in detail. The fiction in SQL is that we create local tables S1 and S2 which happen to have the same data and structure as the Students table. A copy of the Students table is made for each correlation name, S1 and S2. Obviously, this is not how it is implemented in a real SQL compiler. Following the same steps we used in simple SELECT Statements, expand the outer query.

stud_nbrstud_namesex_codestud_age
1Smith116
2Smyth217
3Smoot216
4Adams217
5Jones116
6Celko117
7Vennor216
8Murray118

t0010

2. When you get to the WHERE clause, and find the innermost query, you will see that you need to get data from the containing query. The model of execution says that each outer row has the subquery executed on it in parallel with the other rows. Assume we are working on student (1, ‘Smith’), who is male. The query in effect becomes:
SELECT 1, 'Smith', 1, 16
FROM Students AS S1
 WHERE 16 < (SELECT MAX(stud_age)
 FROM Students AS S2
WHERE 1 = S2.sex_code);
As an aside, the search conditions (1 = S2.sex_code) and (S2.sex_code = 1) are equivalent. The choice is largely a matter of the programmer’s culture—do you read from left to right or right to left?

3. The subquery can now be calculated for male students; the maximum stud_age is 18. When we expand this out for all the other rows, this will give us the effect of this set of parallel queries.
SELECT 1, 'Smith', 1, 16 FROM Students AS S1 WHERE 16 < 18;
SELECT 2, 'Smyth', 2, 17 FROM Students AS S1 WHERE 17 < 17;
SELECT 3, 'Smoot', 2, 16 FROM Students AS S1 WHERE 16 < 17;
SELECT 4, 'Adams', 2, 17 FROM Students AS S1 WHERE 17 < 17;
SELECT 5, 'Jones', 1, 16 FROM Students AS S1 WHERE 16 < 18;
SELECT 6, 'Celko', 1, 17 FROM Students AS S1 WHERE 17 < 18;
SELECT 7, 'Vennor', 2, 16 FROM Students AS S1 WHERE 16 < 17;
SELECT 8, 'Murray', 1, 18 FROM Students AS S1 WHERE 18 < 18;

4. These same steps have been done for each row in the containing query. The model is that all of the subqueries are resolved at once. With cheaper and cheaper parallel hardware, this might be true some day, but no implementation really does it that way currently. The usual approach in real SQL compilers is to build procedural loops in the database engine that scan through both tables. What table is in what loop is decided by the optimizer. The final results are

stud_nbrstud_namesex_codestud_age
1Smith116
3Smoot216
5Jones116
6Celko117
7Vennor216

t0015

Again, no real product works this way, but it has to produce the same results as this process.

There is no limit to the depth of nesting of correlated subqueries in theory. In practice, it is probably a good heuristic to keep the nesting under five levels.

These examples used scalar subqueries, but you can also use correlated subqueries that return a collection of tables. For example, to find all of Professor Celko’s students, we might use this query.

SELECT S1.stud_nbr, S1.stud_name, S1.sex_code, S1.stud_age
FROM Students AS S1
 WHERE S1.stud_nbr
 IN (SELECT T.stud_nbr
FROM Teachers AS T1
 WHERE S1.stud_nbr = T1.stud_nbr
 AND T1.teacher_name = 'Celko'),

Another problem is that many SQL programmers do not fully understand the rules for scope of derived table names. If an infixed join is given a derived table name, then all of the table names inside it are hidden from containing expressions. For example, this will fail

SELECT a, b, c -- wrong!
 FROM (Foo
 INNER JOIN
 Bar
 ON Foo.y >= Bar.x) AS Foobar (x, y)
INNER JOIN
Flub
ON Foo.y <= Flub.z;

Because the table name Foo is not available to the second INNER JOIN. However, this will work:

SELECT a, b, c
 FROM (Foo
 INNER JOIN
 Bar
 ON Foo.y >= Bar.x) AS Foobar (x, y)
INNER JOIN
Flub
ON Foobar.y <= Flub.z;

If you start nesting lots of derived table expressions, you can force an order of execution in the query. It is not a generally a good idea to try to outguess the optimizer.

So far, I have shown fully qualified column names. It is a good programming practice, but it is not required. Assume that Foo and Bar both have a column named w. These statements will produce an ambiguous name error:

SELECT a, b, c
FROM Foo
 INNER JOIN
 Bar ON y >= x
 INNER JOIN
 Flub ON y <= w;
SELECT a, b, c
FROM Foo, Bar, Flub
 WHERE y BETWEEN x AND w

But this statement will work from inside the parentheses first and then does the outermost INNER JOIN last.

SELECT a, b, c
 FROM Foo
 INNER JOIN
 (Bar
INNER JOIN
Flub ON y <= w)
 ON y >= x;

If Bar did not have a column named w, then the parser would go to the next containing expression, find Foo.w, and use it.

26.2 Infixed INNER JOINs

SQL-92 added new syntax for various JOINs using infixed operators in the FROM clause. The JOIN operators are quite general and flexible, allowing you to do things in a single statement that you could not do in the older notation. The basic syntax is

< joined table > ::=
 < cross join > | < qualified join > | (< joined table >)
< cross join > ::= < table reference > CROSS JOIN < table reference >
< qualified join > ::=
 < table reference > [NATURAL] [< join type >] JOIN
 < table reference > [< join specification >]
< join specification > ::= < join condition > | < named columns join >
< join condition > ::= ON < search condition >
< named columns join > ::= USING (< join column list >)
< join type > ::= INNER | < outer join type > [OUTER] | UNION
< outer join type > ::= LEFT | RIGHT | FULL
< join column list > ::= < column name list >
< table reference > ::=
 < table name > [[AS] < correlation name>[(< derived column list >)]]
 | < derived table >
[AS] < correlation name > [(< derived column list >)]
 | < joined table >
< derived table > ::= < table subquery >
< column name list > ::=
 < column name > [{ < comma > < column name > }. . .]

An INNER JOIN is done by forming the CROSS JOIN and then removing the rows that do not meet the JOIN search conditions given in the ON clause, just like we did with the original FROM.. WHERE syntax. The ON clause can be as elaborate as you want to make it, as long as it refers to tables and columns within its scope. If a < qualified join > is used without a < join type >, INNER is implicit. But it is good documentation to spell out all of the JOIN operators.

However, in the real world, most INNER JOINs are done using equality tests on columns with the same names in different tables, rather than on elaborate search conditions. Equi-JOINs are so common that Standard SQL has two shorthand ways of specifying them. The USING (c1, c2, …, cn) clause takes the column names in the list and replaces them with the clause ON ((T1.c1, T1.c2, …, T1.cn) = (T2.c1, T2.c2, …, T2.cn)). Likewise, the NATURAL option is shorthand for a USING() clause that is a list of all the column names that are common to both tables. If NATURAL is specified, a JOIN specification cannot be given; it is already there.

A strong warning: do not use NATURAL JOIN in production code. Any change to the column names or addition new columns will change the join at run time. This is also why you do not use “SELECT *” in production code. But the NATURAL JOIN is more dangerous. As Daniel Morgan pointed out a NATURAL JOIN between two tables with a vague generic column name like “comments” for unrelated data elements can give you a meaningless join containing megabytes of text.

The same sort of warning applies to the USING clause. Neither of these options is widely implemented or used. If you find out that “product_id,” “product_nbr,” and “upc” were all used for the same data element in your schema, you should do a global change to make sure that one data element has one and only one name.

There was a myth among ACCESS programmers that the ON clause can contain only JOIN conditions and the WHERE can contain only search conditions. This is not true, and the difference in the position of the search conditions is not important. The product generated code in that format because this was the execution plan used by the simple compiler.

Having said this, separating the conditions this way can have some advantages for documentation. It becomes easy to remove the WHERE clause and have a candidate for a VIEW. But there are trade-offs.

26.3 OUTER JOINs

OUTER JOINs used to be done with proprietary vendor syntax. Today, the use of the Standard OUTER JOIN is universal. An OUTER JOIN is a JOIN that preserves all the rows in one or both tables, even when they do not have matching rows in the second table. The unmatched columns in the unpreserved table (this is the correct term) are filled with NULLs to complete the join and return rows with the right structure.

Let’s take a real-world situation. I have a table of orders and a table of suppliers that I wish to JOIN for a report to tell us how much business we did with each supplier. With a traditional inner join, the query would be this:

SELECT S.sup_id, S.sup_name, O.order_nbr, O.order_amt
FROM Suppliers AS S, Orders AS O
 WHERE S.sup_id = O.sup_id;

or with the infixed syntax

SELECT S.sup_id, S.sup_name, O.order_nbr, O.order_amt
FROM Suppliers AS S –- preserved table
 INNER JOIN
 Orders AS O
 ON S.sup_id = O.sup_id;

Some supplier totals include credits for returned merchandise, and our total business with them works out to zero dollars. Other suppliers never got an order from us at all, so we did zero dollars worth of business with them, too. But the first case will show up in the query result and be passed on to the report, whereas the second case will disappear in the INNER JOIN.

If we had used an OUTER JOIN, preserving the Suppliers table, we would have all the suppliers in the results. When a supplier with no orders was found in the Orders table, the order_nbr and order_amt columns would be given a NULL value in the result row.

SELECT S.sup_id, S.sup_name, O.order_nbr, O.order_amt
FROM Suppliers AS S
 OUTER LEFT JOIN
 Orders AS O
 ON S.sup_id = O.sup_id;

26.3.1 A Bit of History

Before the SQL-99 Standard, there was no Standard OUTER JOIN syntax, so you had to construct it by hand with a messy UNION in products like very early versions of DB2 from IBM like this:

SELECT sup_id, sup_name, order_amt -- regular INNER JOIN
FROM Suppliers, Orders
 WHERE Suppliers.sup_id = Orders.sup_id
UNION ALL
SELECT sup_id, sup_name, CAST(NULL AS INTEGER) -- preserved rows
FROM Suppliers
 WHERE NOT EXISTS
 (SELECT *
FROM Orders
WHERE Suppliers.sup_id = Orders.sup_id);

You have to use a NULL with the correct data type to make the UNION work, hence the CAST() functions. Some products are smart enough that just NULL by itself will be given the correct data type, but this is portable and safer.

The other alternative is to insert a constant of some sort to give a more meaningful result. This is easy in the case of a CHARACTER column, where a message like ‘{{NONE}}’ can be quickly understood. It is much harder in the case of a numeric column, where we could have a balance with a supplier that is positive, zero, or negative because of returns and credits. There really is a difference between a vendor that we did not use and a vendor whose returns and credits canceled out its orders.

In the second edition of this book, I described the proprietary OUTER JOIN extensions in detail. Today, they are gone and replaced by the Standard syntax. The vendor extensions were all different in syntax or semantics or both. Since they are mercifully gone, I am not going to tell you about them in this edition.

The name LEFT OUTER JOIN comes from the fact that the preserved table is on the left side of the operator. Likewise, a RIGHT OUTER JOIN would have the preserved table on the right-hand side and the FULL OUTER JOIN preserves both tables.

Here is how OUTER JOINs work in Standard SQL. Assume you are given:

Table1
ab
1w
2x
3y
4z

t0020

Table2
ac
1r
2s
3t

t0025

and the OUTER JOIN expression

Table1
LEFT OUTER JOIN
Table2
ON Table1.a = Table2.a -- JOIN condition
 AND Table2.c = 't'; −− single table filter condition

We call Table1 the “preserved table” and Table2 the “unpreserved table” in the query. What I am going to give you is a little different, but equivalent to the ANSI/ISO standards.

(1) We build the CROSS JOIN of the two tables. Scan each row in the result set.

(2) If all search conditions in the ON clause test TRUE for that row, then you keep it.

(3) If the predicate tests FALSE or UNKNOWN for that row, then keep the columns from the preserved table, convert all the columns from the unpreserved table to NULLs and remove the duplicates. So let us execute this by hand:

Let entity = passed the join search condition

Let entity = passed the filter search condition

Step One and two:

Table1 CROSS JOIN Table2
abacNotes
1w1rentity
1w2s
1w3tentity
2x1r
2x2sentity
2x3tentity
3y1r
3y2s
3y3tentity entity
4z1r
4z2s
4z3tentity

t0030

Step three:

Table1 LEFT OUTER JOIN Table2
abacNotes
3y3tentity row(s)from step two
1wNULLNULLSets of duplicates
1wNULLNULL
1wNULLNULL
2xNULLNULL
2xNULLNULL
2xNULLNULL
3yNULLNULLentity removed in step two
3yNULLNULLentity removed in step two
4zNULLNULL
4zNULLNULL
4zNULLNULL

t0035

the final results:

Table1 LEFT OUTER JOIN Table2
abac
1wNULLNULL
2xNULLNULL
3y3t
4zNULLNULL

t0040

The basic rule is that every row in the preserved table is represented in the results in at least one result row.

Consider the two famous Chris Date tables from his “Suppliers and Parts” database used in his textbooks.

SupParts
sup_idpart_nbrpart_qty
S1P1100
S1P2250
S2P1100
S2P2250

t0045

Suppliers
sup_id
S1
S2
S3

If you write the OUTER JOIN with only the join predicate in the ON clause, like this:

SELECT Suppliers.sup_id, SupParts.part_nbr, SupParts.part_qty
FROM Suppliers
 LEFT OUTER JOIN
 SupParts
 ON Supplier.sup_id = SupParts.sup_id
 WHERE part_qty < 200;

You get:

sup_idpart_nbrpart_qty
S1P1100
S2P1100

but if we put the search predicate in the ON clause, we get this result.

SELECT Suppliers.sup_id, SupParts.part_nbr, SupParts.part_qty
 FROM Suppliers
 LEFT OUTER JOIN
 SupParts
 ON Supplier.sup_id = SupParts.sup_id
AND part_qty < 200;

You get:

sup_idpart_nbrpart_qty
S1P1100
S2P1100
S3NULLNULL

Another problem was that you could not show the same table as preserved and unpreserved in the proprietary syntax options, but it is easy in Standard SQL. For example, to find the students who have taken Math 101 and might have taken Math 102:

SELECT C1.stud_nbr, C1.math_course, C2.math_course
FROM (SELECT * FROM Courses WHERE math_course = 'Math 101') AS C1
 LEFT OUTER JOIN
 (SELECT * FROM Courses WHERE math_course = 'Math 102') AS C2
 ON C1.stud_nbr = C2.stud_nbr;

A third problem is that the order of execution matters with a chain of OUTER JOINs. That is to say, ((T1 OUTER JOIN T2) OUTER JOIN T3) does not produce the same results as (T1 OUTER JOIN (T2 OUTER JOIN T3)).

26.3.2 NULLs and OUTER JOINs

The NULLs that are generated by the OUTER JOIN can occur in columns derived from source table columns that have been declared to be NOT NULL. Even if you tried to avoid all the problems with NULLs by making every column in every table of your schema NOT NULL, they could still occur in OUTER JOIN and OLAP function results. However, a table can have NULLs and still be used in an OUTER JOIN. Consider different JOINs on the following two tables, which have NULLs in the common column:

T1
ax
1r
2v
3NULL

t0065

T2
bx
7r
8s
9NULL

t0070

A natural INNER JOIN on column x can only match those values that are equal to each other. But NULLs do not match to anything, even to other NULLs. Thus, there is one row in the result, on the value r in column x in both tables.

Now, do a LEFT OUTER JOIN on the tables, which will preserve table T1, and you get

T1 LEFT OUTER JOIN T2 ON T1.x = T2.x
aT1.xbT2.x
1r7r
2vNULLNULL
3NULLNULLNULL

t0075

Again, there are no surprises. The original INNER JOIN row is still in the results. The other two rows of T1 that were not in the equi-JOIN do show up in the results, and the columns derived from table T2 are filled with NULLs. The RIGHT OUTER JOIN would also behave the same way. The problems start with the FULL OUTER JOIN, which looks like this:

T1 FULL OUTER JOIN T2 ON (T1.x = T2.x)
aT1.xbT2.x
1r7r
2vNULLNULL
3NULLNULLNULL
NULLNULL8s
NULLNULL9NULL

t0080

The way this result is constructed is worth explaining in detail.

First do an INNER JOIN on T1 and T2, using the ON clause condition, and put those rows (if any) in the results. Then all rows in T1 that could not be joined are padded out with NULLs in the columns derived from T2 and inserted into the results. Finally, take the rows in T2 that could not be joined, pad them out with NULLs, and insert them into the results. The bad news is that the original tables cannot be reconstructed from an OUTER JOIN. Look at the results of the FULL OUTER JOIN, which we will call R1, and SELECT the first columns from it:

SELECT T1.a, T1.x FROM R1
ax
1r
2v
3NULL
NULLNULL
NULLNULL

t0085

The created NULLs remain and could not be differentiated from the original NULLs. But you cannot throw out those duplicate rows because they may be in the original table T1. There is now a function, GROUPING (< column name >), used with the CUBE, ROLLUP, and GROUPING SET() options which returns a 1 for original NULLs or data and 0 for created NULLs. Your vendor may allow this function to be used with the OUTER JOINs.

26.3.3 NATURAL Versus Searched OUTER JOINs

It is worth mentioning in passing that Standard SQL has a NATURAL LEFT OUTER JOIN, but it is not implemented in most versions of SQL.

A NATURAL JOIN has only one copy of the common column pairs in its result. The searched OUTER JOIN has both of the original columns, with their table-qualified names. The NATURAL JOIN has to have a correlation name for the result table to identify the shared columns. We can build a NATURAL LEFT OUTER JOIN by using the COALESCE() function to combine the common column pairs into a single column and put the results into a VIEW where the columns can be properly named, thus:

CREATE VIEW NLOJ12 (x, a, b)
AS SELECT COALESCE(T1.x, T2.x), T1.a, T2.b
 FROM T1 LEFT OUTER JOIN T2 ON T1.x = T2.x;
NLOJ12
xab
r17
v2NULL
NULL3NULL

t0090

Unlike the NATURAL JOINs, the searched OUTER JOIN does not have to use a simple one-column equality as the JOIN search condition. The search condition can have several search conditions, use other comparisons, and so forth. For example,

T1 LEFT OUTER JOIN T2 ON (T1.x < T2.x)
aT1.xbT2.x
1r8s
2vNULLNULL
3NULLNULLNULL

t0095

as compared to

T1 LEFT OUTER JOIN T2 ON (T1.x > T2.x)
aT1.xbT2.x
1rNULLNULL
2v7r
2v8s
3NULLNULLNULL

t0100

26.3.4 Self OUTER JOINs

There is no rule that forbids an OUTER JOIN on the same table. In fact, this kind of self-join is a good trick for “flattening” a normalized table into a horizontal report. To illustrate the method, start with a skeleton table defined as

CREATE TABLE Credits
(student_nbr INTEGER NOT NULL,
 course_name CHAR(8) NOT NULL,
 PRIMARY KEY (student_nbr, course_name));

This table represents student ids and a course name for each class they have taken. However, our rules say that students cannot get credit for ‘CS-102’ until they have taken the prerequisite ‘CS-101’ course; they cannot get credit for ‘CS-103’ until they have taken the prerequisite ‘CS-102’ course, and so forth. Let’s first load the table with some sample values.

Notice that student #10 has both courses, student #20 has only the first of the series, and student #3 jumped ahead of sequence and therefore cannot get credit for his ‘CS-102’ course until he goes back and takes ‘CS-101’ as a prerequisite.

Credits
student_nbrcourse_name
1CS-101
1CS-102
2CS-101
3CS-102

t0105

What we want is basically a histogram (bar chart) for each student, showing how far he or she has gone in his or her degree programs. Assume that we are only looking at two courses; the result of the desired query might look like this (NULL is used to represent a missing value):

 (1, 'CS-101', 'CS-102')
 (2, 'CS-101', NULL)

Clearly, this will need a self-JOIN, since the last two columns come from the same table, Credits. You have to give correlation names to both uses of the Credits table in the OUTER JOIN operator when you construct a self OUTER JOIN, just as you would with any other SELF-JOIN, thus:

SELECT student_nbr, C1.course_name, C2.course_name
FROM Credits AS C1
 LEFT OUTER JOIN
 Credits AS C2
 ON C1.stud_nbr_nbr = C2.stud_nbr_nbr
AND C1.course_name = 'CS-101'
AND C2.course_name = 'CS-102';

26.3.5 Two or More OUTER JOINs

Some relational purists feel that every operator should have an inverse, and therefore they do not like the OUTER JOIN. Others feel that the created NULLs are fundamentally different from the explicit NULLs in a base table and should have a special token. SQL uses its general-purpose NULLs and leaves things at that. Getting away from theory, you will also find that vendors have often done strange things with the ways their products work.

A major problem is that OUTER JOIN operators do not have the same properties as INNER JOIN operators. The order in which FULL OUTER JOINs are executed will change the results (a mathematician would say that they are not associative). To show some of the problems that can come up when you have more than two tables, let us use three very simple two column tables. Notice that some of the column values match and some do not match, but the three tables have all possible pairs of column names in them.

 CREATE TABLE T1 (a INTEGER NOT NULL, b INTEGER NOT NULL);
 INSERT INTO T1 VALUES (1, 2);
 CREATE TABLE T2 (a INTEGER NOT NULL, c INTEGER NOT NULL);
 INSERT INTO T2 VALUES (1, 3);
 CREATE TABLE T3 (b INTEGER NOT NULL, c INTEGER NOT NULL);
 INSERT INTO T3 VALUES (2, 100);

Now let’s try some of the possible orderings of the three tables in a chain of LEFT OUTER JOINS. The problem is that a table can be preserved or unpreserved in the immediate JOIN and in the opposite state in the containing JOIN.

SELECT T1.a, T1.b, T3.c
FROM ((T1 NATURAL LEFT OUTER JOIN T2)
 NATURAL LEFT OUTER JOIN T3);
Result
abc
12NULL
SELECT T1.a, T1.b, T3.c
FROM ((T1 NATURAL LEFT OUTER JOIN T3)
 NATURAL LEFT OUTER JOIN T2);
Result
abc
12100
SELECT T1.a, T1.b, T3.c
FROM ((T1 NATURAL LEFT OUTER JOIN T3)
 NATURAL LEFT OUTER JOIN T2);
Result
abc
NULLNULLNULL

Even worse, the choice of column, in the SELECT list can change the output. Instead of displaying T3.c, use T2.c and you will get:

SELECT T1.a, T1.b, T2.c
FROM ((T2 NATURAL LEFT OUTER JOIN T3)
 NATURAL LEFT OUTER JOIN T1);
Result
abc
NULLNULL3

t0125

The compiler should give you error messages about ambiguous column names.

26.3.6 OUTER JOINs and Aggregate Functions

At the start of this chapter, we had a table of orders and a table of suppliers, which were to be used to build a report to tell us how much business we did with each supplier. The query that will do this is

SELECT Suppliers.sup_id, sup_name, SUM(order_amt)
FROM Suppliers
 LEFT OUTER JOIN
 Orders
 ON Suppliers.sup_id = Orders.sup_id
 GROUP BY sup_id, sup_name;

Some suppliers’ totals include credits for returned merchandise, such that our total business with them worked out to zero dollars. Each supplier with which we did no business will have a NULL in its order_amt column in the OUTER JOIN. The usual rules for aggregate functions with NULL values apply, so these suppliers will also show a zero total amount. It is also possible to use a function inside an aggregate function, so you could write SUM(COALESCE(T1.x, T2.x)) for the common column pairs.

If you need to tell the difference between a true sum of zero and the result of a NULL in an OUTER JOIN, use the MIN() or MAX() function on the questionable column. These functions both return a NULL result for a NULL input, so an expression inside the MAX() function could be used to print the message MAX(COALESCE(order_amt, ‘No Orders’)), for example.

Likewise, these functions could be used in a HAVING clause, but that would defeat the purpose of an OUTER JOIN.

26.3.7 FULL OUTER JOIN

The FULL OUTER JOIN is a mix of the LEFT and RIGHT OUTER JOINs, with preserved rows constructed from both tables. The statement takes two tables and puts them in one result table. Again, this is easier to explain with an example than with a formal definition.

T1
ax
1r
2v
3NULL

t0130

T2
bx
7r
8s
9NULL

t0135

T1 FULL OUTER JOIN T2 ON (T1.x = T2.x)
aT1.xbT2.xnotes
1r7r-- T1 INNER JOIN T2
2vNULLNULL-- preserved from T1
3NULLNULLNULL-- preserved from T1
NULLNULL8s-- preserved from T2
NULLNULL9NULL-- preserved from T2

t0140

26.4 UNION JOIN Operators

There is also a UNION JOIN in Standard SQL which returns the results of a FULL OUTER JOIN without the rows that were in the INNER JOIN of the two tables. The only SQL product which has implemented it as of 2014 is MariaDB (www.mariadb.com) and it is part of the SAS statistical system (www.sas.com) in the PROC SQL options.

T1 UNION JOIN T2 ON(T1.x = T2.x)
aT1.xbT2.xNotes
2vNULLNULL-- preserved from T1
3NULLNULLNULL-- preserved from T1
NULLNULL8s-- preserved from T2
NULLNULL9NULL-- preserved from T2

t0145

As an example of this, you might want to combine the medical records of male and female patients into one table with this query.

SELECT *
FROM (SELECT 'male', prostate_flg FROM Males)
OUTER UNION
 (SELECT 'female', pregnancy_flg FROM Females);

to get a result table like this

Result
maleprostate_flgfemalepregnancy_flg
'male''no'NULLNULL
'male''no'NULLNULL
'male''yes'NULLNULL
'male''yes'NULLNULL
NULLNULL'female''no'
NULLNULL'female''no'
NULLNULL'female''yes'
NULLNULL'female''yes'

t0150

Frédéric Brouard came up with a nice trick for writing a similar join. That is, a join on one table, say a basic table of student data, with either a table of data particular to domestic students or another table of data particular to foreign students, based on the value of a parameter. This is one way to handle sub-types and supper-types in SQL. This differs from a true UNION JOIN in that it has to have a “root” table to use for the outer JOINs.

CREATE TABLE Students
(student_nbr INTEGER NOT NULL PRIMARY KEY,
 student_type CHAR(1) NOT NULL DEFAULT 'D'
CHECK (student_type IN ('D', 2, ..))
 . . .);
CREATE TABLE Domestic_Students
(student_nbr INTEGER NOT NULL PRIMARY KEY,
 REFERENCES Students(student_nbr),
 . . .);
CREATE TABLE Foreign_Students
(student_nbr INTEGER NOT NULL PRIMARY KEY,
 REFERENCES Students(student_nbr),
 . . .);
SELECT Students.*, Domestic_Students.*, Foreign_Students.*
FROM Students
 LEFT OUTER JOIN
 Domestic_Students
 ON CASE Students.stud_type
WHEN 'D' THEN 1 ELSE NULL END
= 1
LEFT OUTER JOIN
Foreign_Students
ON CASE Student.stud_type WHEN 2 THEN 1 ELSE NULL END
 = 1;

26.5 Scalar SELECT Expressions

A SELECT expression that returns a single row with a single value can be used where a scalar expression can be used. If the result of the scalar query is empty, it is converted to a NULL. This will sometimes, but not always, let you write an OUTER JOIN as a query within the SELECT clause; thus, this query will work only if each supplier has one or zero orders:

 SELECT sup_id, sup_name, order_nbr,
 (SELECT order_amt
FROM Orders
 WHERE Suppliers.sup_id = Orders.sup_id)
 AS order_amt
 FROM Suppliers;

However, I could write

 SELECT sup_id, sup_name,
(SELECT COUNT(*)
FROM Orders
 WHERE Suppliers.sup_id = Orders.sup_id)
 FROM Suppliers;

instead of writing

SELECT sup_id, sup_name, COUNT(*)
FROM Suppliers
 LEFT OUTER JOIN
 Orders
 ON Suppliers.sup_id = Orders.sup_id
 GROUP BY sup_id, sup_name;

26.6 Old Versus New JOIN Syntax

The infixed OUTER JOIN syntax was meant to replace several different vendor options which all had different syntax and semantics. It was absolutely needed. The INNER JOIN and OUTER JOIN operators are universal now. They are binary operators and programmers are used to binary operators—add, subtract, multiply, and divide are all binary operators. E-R diagrams use lines between tables to show a relational schema.

But this leads to a linear approach to problem solving that might not be such a good thing in SQL. Consider this statement which would have been written in the traditional syntax as:

SELECT a, b, c
 FROM Foo, Bar, Flub
 WHERE Foo.y BETWEEN Bar.x AND Flub.z;

With the infixed syntax, I can write this same statement in any of several ways. For example:

SELECT a, b, c
FROM (Foo
 INNER JOIN
 Bar
 ON Foo.y >= Bar.x)
INNER JOIN
Flub
ON Foo.y <= Flub.z;

or

SELECT a, b, c
 FROM Foo
 INNER JOIN
 (Bar
INNER JOIN
Flub
ON Foo.y <= Flub.z)
 ON Foo.y >= Bar.x;

I leave it to the reader to find all the permutations, with or without the parentheses.

Humans tend to see things that are close together as a unit or as having a relationship. It is a law of visual psychology and typesetting called the Law of Proximity. The extra reserved words in the infixed notation tend to work against proximity; you have to look in many places to find the parts of a.

The infixed notation invites a programmer to add one table at a time to the chain of JOINs. First I built and tested the Foo-Bar join and when I was happy with the results, I added Flub. Step-wise program refinement was one of the mantras of structured programming.

But look at the code; can you see that there is a BETWEEN relationship among the three tables? It is not easy, is it? In effect, you see only pairs of tables and not the whole problem. SQL is an “all-at-once” set-oriented language, not a “step-wise” language. This is much like the conceptual difference between addition with a simple binary + operator and the generalized n-ary summation operator with a Σ.

I am against infixed JOINs? No, but it is a bit more complicated than it first appears and if there are some OUTER JOINs in the mix, things can be very complicated. Just be careful with the new toys, kids.

26.7 Constrained Joins

We can relate two tables together based on quantities in each of them. These problems take the form of pairing items in one set with items in another. The extra restriction is that the set of pairs has constraints at the level of the result which a row-by-row join does not. Here the values are identifiers and cannot be repeated in the results.

Let us assume we have two tables, X and Y. Some possible situations are:

1. A row in X matches one and only one row in Y. There can be one matching function that applies to one set, or each set can have its own matching function.
An example of one matching function is an optimization with constraints. For example, you are filling an egg carton with a set of colored eggs given rules about how the colors can be arranged. A lot of logic puzzles use this model.
The classic example of two matching functions is the Stable Marriages problem where the men rank the women they want to marry and where the women rank the men they want to marry.

2. A row in X matches one or more rows in Y: knapsack or bin packing problems where one bin holds one or more items and we try to optimize the arrangement.
In all cases, there can be a unique answer or several answers or no valid answer at all. Let’s give some examples and code for them.

26.7.1 Inventory and Orders

The simplest example is filling customer orders from the inventories that we have at various stores. To make life easier, assume that we have only one product, process orders in increasing customer_id order (this could be temporal order as well), and draw from store inventory by increasing store_id (this could be nearest store).

CREATE TABLE Inventory
(store_id INTEGER NOT NULL PRIMARY KEY,
 item_qty INTEGER NOT NULL CHECK (item_qty >=0));
INSERT INTO Inventory (store_id, item_qty)
VALUES (10, 2), (20, 3), (30, 2);
CREATE TABLE Orders
(customer_id CHAR(5) NOT NULL PRIMARY KEY,
 item_qty INTEGER NOT NULL CHECK (item_qty > 0));
INSERT INTO Orders (customer_id, item_qty)
VALUES ('Bill', 4), ('Fred', 2);

What we want to do is fill Bill’s order for 4 units by taking 2 units from store #10, and 2 units from store #20. Next, we process Fred’s order with the 1 unit left in store #10 and 1 unit from store #3.

SELECT I.store_id, O.customer_id,
 (CASE WHEN O.end_running_qty <= I.end_running_qty
 THEN O.end_running_qty
 ELSE I.end_running_qty END
 - CASE WHEN O.start_running_qty >=I.start_running_qty
 THEN O.start_running_qty
 ELSE I.start_running_qty END)
 AS items_consumed_tally
 FROM (SELECT I1.store_id,
 SUM(I2.item_qty) - I1.item_qty,
 SUM(I2.item_qty)
 FROM Inventory AS I1, Inventory AS I2
 WHERE I2.store_id <= I1.store_id
 GROUP BY I1.store_id, I1.item_qty)
AS I (store_id, start_running_qty, end_running_qty)
 INNER JOIN
 (SELECT O1.customer_id,
 SUM(O2.item_qty) - O1.item_qty,
 SUM(O2.item_qty) AS end_running_qty
 FROM Orders AS O1, Orders AS O2
 WHERE O2.customer_id <= O1.customer_id
GROUP BY O1.customer_id, O1.item_qty)
 AS O(customer_id, start_running_qty, end_running_qty)
 ON O.start_running_qty < I.end_running_qty
 AND O.end_running_qty > I.start_running_qty;
store_idcustomer_iditems_consumed_tally
10Bill2
20Bill2
20Fred1
30Fred1

This can also be done with ordinal functions.

26.7.2 Stable Marriages

This is a classic programming problem from procedural language classes. The set up is fairly simple; you have a set of potential husbands and an equal sized set of potential wives. We want to pair them up into stable marriages.

What is a stable marriage? In 25 words or less, a marriage in which neither partner can do better. You have a set of n men and a set of n women. All the men have a preference scale for all the women that ranks them from 1 to n without gaps or ties. The women have the same ranking system for the men. The goal is to pair off the men and women into n marriages such that there is no pair in your final arrangement where Mr. X and Ms. Y are matched to each other when they both would rather be matched to someone else.

For example, let’s assume the husbands are (‘Joe Celko,’ ‘Brad Pitt’) and the wives are (‘Jackie Celko,’ ‘Angelina Jolie’). If Jackie got matched to Mr. Pitt, she would be quite happy. And I would enjoy Ms. Jolie’s company. However, Mr. Pitt and Ms. Jolie can both do better than us. Once they are paired up, they will stay that way, leaving Jackie and me still wed.

The classic Stable Marriage algorithms usually are based on backtracking. These algorithms try a combination of couples and then attempt to fix any unhappy matches. When the algorithm hits on a situation where nobody can improve their situation, they stop and give an answer.

Two important things to know about this problem: (1) there is always a solution and (2) there is often more than one solution. Remember that a stable marriage is not always a happy marriage. In fact, in this problem, while there is always at least one arrangement of stable marriages in any set, you most often find many different pairings that produce a set of stable marriages. Each set of marriages will tend to maximize either the happiness of the men or the women.

CREATE TABLE Husbands
(man CHAR(2) NOT NULL,
 woman CHAR(2) NOT NULL,
 PRIMARY KEY (man, woman),
 ranking INTEGER NOT NULL);
CREATE TABLE Wives
(woman CHAR(2) NOT NULL,
 man CHAR(2) NOT NULL,
 PRIMARY KEY (woman, man),
 ranking INTEGER NOT NULL);
CREATE TABLE Wife_Perms
(perm INTEGER NOT NULL PRIMARY KEY,
 wife_name CHAR(2) NOT NULL);
-- The men's preferences
INSERT INTO Husbands -- husband #10
VALUES ('h1', 'w1', 5), ('h1', 'w2', 2),
('h1', 'w3', 6), ('h1', 'w4', 8),
('h1', 'w5', 4), ('h1', 'w6', 3),
('h1', 'w7', 1), ('h1', 'w8', 7);
INSERT INTO Husbands -- husband #20
VALUES ('h2', 'w1', 6), ('h2', 'w2', 3),
('h2', 'w3', 2), ('h2', 'w4', 1),
('h2', 'w5', 8), ('h2', 'w6', 4),
('h2', 'w7', 7), ('h2', 'w8', 5);
INSERT INTO Husbands -- husband #3
VALUES ('h3', 'w1', 4), ('h3', 'w2', 2),
('h3', 'w3', 1), ('h3', 'w4', 3),
('h3', 'w5', 6), ('h3', 'w6', 8),
('h3', 'w7', 7), ('h3', 'w8', 5);
INSERT INTO Husbands -- husband #4
VALUES ('h4', 'w1', 8), ('h4', 'w2', 4),
('h4', 'w3', 1), ('h4', 'w4', 3),
('h4', 'w5', 5), ('h4', 'w6', 6),
('h4', 'w7', 7), ('h4', 'w8', 2);
INSERT INTO Husbands -- husband #5
VALUES ('h5', 'w1', 6), ('h5', 'w2', 8),
('h5', 'w3', 2), ('h5', 'w4', 3),
('h5', 'w5', 4), ('h5', 'w6', 5),
('h5', 'w7', 7), ('h5', 'w8', 1);
INSERT INTO Husbands -- husband #6
VALUES ('h6', 'w1', 7), ('h6', 'w2', 4),
('h6', 'w3', 6), ('h6', 'w4', 5),
('h6', 'w5', 3), ('h6', 'w6', 8),
('h6', 'w7', 2), ('h6', 'w8', 1);
INSERT INTO Husbands -- husband #7
VALUES ('h7', 'w1', 5), ('h7', 'w2', 1),
('h7', 'w3', 4), ('h7', 'w4', 2),
('h7', 'w5', 7), ('h7', 'w6', 3),
('h7', 'w7', 6), ('h7', 'w8', 8);
INSERT INTO Husbands -- husband #8
VALUES ('h8', 'w1', 2), ('h8', 'w2', 4),
('h8', 'w3', 7), ('h8', 'w4', 3),
('h8', 'w5', 6), ('h8', 'w6', 1),
('h8', 'w7', 5), ('h8', 'w8', 8);
-- The women's preferences
INSERT INTO Wives -- wife #1
VALUES ('w1', 'h1', 6), ('w1', 'h2', 3),
 ('w1', 'h3', 7), ('w1', 'h4', 1),
('w1', 'h5', 4), ('w1', 'h6', 2),
 ('w1', 'h7', 8), ('w1', 'h8', 5);
INSERT INTO Wives -- wife #2
VALUES ('w2', 'h1', 4), ('w2', 'h2', 8),
 ('w2', 'h3', 3), ('w2', 'h4', 7),
 ('w2', 'h5', 2), ('w2', 'h6', 5),
 ('w2', 'h7', 6), ('w2', 'h8', 1);
INSERT INTO Wives -- wife #3
VALUES ('w3', 'h1', 3), ('w3', 'h2', 4),
 ('w3', 'h3', 5), ('w3', 'h4', 6),
 ('w3', 'h5', 8), ('w3', 'h6', 1),
 ('w3', 'h7', 7), ('w3', 'h8', 2);
INSERT INTO Wives -- wife #4
VALUES ('w4', 'h1', 8), ('w4', 'h2', 2),
 ('w4', 'h3', 1), ('w4', 'h4', 3),
 ('w4', 'h5', 7), ('w4', 'h6', 5),
 ('w4', 'h7', 4), ('w4', 'h8', 6);
INSERT INTO Wives -- wife #5
VALUES ('w5', 'h1', 3), ('w5', 'h2', 7),
 ('w5', 'h3', 2), ('w5', 'h4', 4),
 ('w5', 'h5', 5), ('w5', 'h6', 1),
 ('w5', 'h7', 6), ('w5', 'h8', 8);
INSERT INTO Wives -- wife #6
VALUES ('w6', 'h1', 2), ('w6', 'h2', 1),
 ('w6', 'h3', 3), ('w6', 'h4', 6),
 ('w6', 'h5', 8), ('w6', 'h6', 7),
 ('w6', 'h7', 5), ('w6', 'h8', 4);
INSERT INTO Wives -- wife #7
VALUES ('w7', 'h1', 6), ('w7', 'h2', 4),
 ('w7', 'h3', 1), ('w7', 'h4', 5),
 ('w7', 'h5', 2), ('w7', 'h6', 8),
 ('w7', 'h7', 3), ('w7', 'h8', 7);
INSERT INTO Wives -- wife #8
VALUES ('w8', 'h1', 8), ('w8', 'h2', 2),
 ('w8', 'h3', 7), ('w8', 'h4', 4),
 ('w8', 'h5', 5), ('w8', 'h6', 6),
 ('w8', 'h7', 1), ('w8', 'h8', 3);
-- This auxiliary table helps us create all permutations of the wives.
INSERT INTO Wife_Perms
VALUES (1, 'w1'), (2, 'w2'), (4, 'w3'), (8, 'w4'),
(16, 'w5'), (32, 'w6'), (64, 'w7'), (128, 'w8'),

The query builds all permutation of wives and then filters them for blocking pairs in an elaborate NOT EXISTS() predicate.

SELECT A.wife_name AS h1, B.wife_name AS h2,
C.wife_name AS h3, D.wife_name AS h4,
E.wife_name AS h5, F.wife_name AS h6,
G.wife_name AS h7, H.wife_name AS h8
 FROM Wife_Perms AS A, Wife_Perms AS B,
Wife_Perms AS C, Wife_Perms AS D,
Wife_Perms AS E, Wife_Perms AS F,
Wife_Perms AS G, Wife_Perms AS H
 WHERE A.perm + B.perm + C.perm + D.perm
 + E.perm + F.perm + G.perm + H.perm = 255
AND NOT EXISTS
 (SELECT *
 FROM Husbands AS W, Husbands AS X, Wives AS Y, Wives AS Z
WHERE W.man = X.man
AND W.ranking > X.ranking
AND X.woman = Y.woman
AND Y.woman = Z.woman
AND Y.ranking > Z.ranking
AND Z.man = W.man
AND W.man||W.woman
 IN ('h1'||A.wife_name, 'h2'||B.wife_name,
'h3'||C.wife_name, 'h4'||D.wife_name,
'h5'||E.wife_name, 'h6'||F.wife_name,
h7'||G.wife_name, 'h8'||H.wife_name)
 AND Y.man||Y.woman
IN ('h1'||A.wife_name, 'h2'||B.wife_name,
 'h3'||C.wife_name, 'h4'||D.wife_name,
 'h5'||E.wife_name, 'h6'||F.wife_name,
 'h7'||G.wife_name, 'h8'|| H.wife_name))

The results look like this:

h1h2h3h4h5h6h7h8
w3w6w4w8w1w5w7w2
w3w6w4w1w7w5w8w2
w6w4w3w8w1w5w7w2
w6w3w4w8w1w5w7w2
w6w4w3w1w7w5w8w2
w6w3w4w1w7w5w8w2
w2w4w3w8w1w5w7w6
w2w4w3w1w7w5w8w6
w7w4w3w8w1w5w2w6
REFERENCES:
Gusfierld, Dan and Irving, Robert W. The Stable Marriage Problem: Structure & Algorithms. ISBN 0-262-07118-5.
Knuth, Donald E. CRM Proceedings & Lecture Notes, Vol #10, “Stable Marriage and Its Relation to Other Combinatorial Problems” ISBN 0-8218-0603-3.

t0160

This booklet, which reproduces seven expository lectures given by the author in November 1975, is a gentle introduction to the analysis of algorithms using the beautiful theory of stable marriages as a vehicle to explain the basic paradigms of that subject.

Wirth, Nicklaus; Algorithms + Data Structures = Programs. Section 3.6. ISBN 0-13-022418-9

This section gives an answer in Pascal and a short analysis of the algorithm. In particular, I used his data for my example. He gives several answers which give a varying “degrees of happiness” for husbands and wives.

26.7.3 Ball and Box Packing

This example was taken from the BeyondRelational Website SQL Challenge #22 in 2010 January. We have got some boxes and balls. Our job is to put balls into those boxes. But, wait a second. The balls should be filled into the boxes based on some rules and preferences configured by the user. Here are the rules.

1. A box can have only one ball

2. A ball can be placed only in one box

3. Number of balls and number of boxes will always be the same.

4. All boxes should be filled and all balls should be used

5. There will be a configuration table where the preferences of the user will be stored. The preference setting should be followed when putting a ball into a box.

CREATE TABLE Boxes
(box_nbr INTEGER NOT NULL PRIMARY KEY,
 box_name VARCHAR(20) NOT NULL);
INSERT INTO Boxes (box_nbr, box_name)
VALUES (1, 'Box 1'), (2, 'Box 2'), (3, 'Box 3'),
(4, 'Box 4'), (5, 'Box 5'), (6, 'Box 6'),
CREATE TABLE Balls
(ball_nbr INTEGER NOT NULL PRIMARY KEY,
 ball_name VARCHAR(20) NOT NULL);
INSERT INTO Balls (ball_name)
VALUES (1, 'Ball 1'), (2, 'Ball 2'), (3, 'Ball 3'),
(4, 'Ball 4'), (5, 'Ball 5'), (6, 'Ball 6'),
CREATE TABLE Preferences
(box_nbr INTEGER NOT NULL
REFERENCES Boxes (box_nbr),
 ball_nbr INTEGER NOT NULL
REFERENCES Balls (ball_nbr),
 PRIMARY KEY (box_nbr, ball_nbr));
INSERT INTO Preferences (box_nbr, ball_nbr)
VALUES (1, 1),
(2, 1), (2, 3),
(3, 2), (3, 3),
(4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
(5, 4), (5, 5),
(6, 5);
Results
box_nameball_name
11
23
32
46
54
65

This answer is done in parts to expose the logic via CTEs. The Balls_In_Boxes CTE gives us all the possible arrangements of six balls in six boxes. This is passed to the Preferred_Balls_In_Boxes CTE to apply the preference rules but allow duplicate balls if two or more boxes want them. Finally, the main query makes sure that we keep only the rows with unique balls.

The use of the IN() predicates to assure that the row has no duplicate columns is easy to extend to any number of items, but a bit bulky to read. But it is remarkably fast in the SQL engines where we tested it.

WITH
Balls_In_Boxes (bx1, bx2, bx3, bx4, bx5, bx6)
AS
(SELECT B1.ball_nbr, B2.ball_nbr, B3.ball_nbr,
B4.ball_nbr, B5.ball_nbr, B6.ball_nbr
FROM Balls AS B1, Balls AS B2, Balls AS B3,
Balls AS B4, Balls AS B5, Balls AS B6
WHERE B1.ball_nbr NOT IN (B2.ball_nbr, B3.ball_nbr, B4.ball_nbr, B5.ball_nbr, B6.ball_nbr)
AND B2.ball_nbr NOT IN (B3.ball_nbr, B4.ball_nbr, B5.ball_nbr, B6.ball_nbr)
AND B3.ball_nbr NOT IN (B4.ball_nbr, B5.ball_nbr, B6.ball_nbr)
AND B4.ball_nbr NOT IN (B5.ball_nbr, B6.ball_nbr)
AND B5.ball_nbr NOT IN (B6.ball_nbr)),
Preferred_Balls_In_Boxes (bx1, bx2, bx3, bx4, bx5, bx6)
AS
(SELECT bx1, bx2, bx3, bx4, bx5, bx6
 FROM Balls_In_Boxes AS BB
 WHERE BB.bx1
IN (SELECT ball_nbr
FROM Preferences AS P
 WHERE box_nbr = 1)
AND BB.bx2
IN (SELECT ball_nbr
FROM Preferences AS P
 WHERE box_nbr = 2)
AND BB.bx3
IN (SELECT ball_nbr
FROM Preferences AS P
 WHERE box_nbr = 3)
AND BB.bx4
IN (SELECT ball_nbr
FROM Preferences AS P
 WHERE box_nbr = 4)
AND BB.bx5
IN (SELECT ball_nbr
FROM Preferences AS P
 WHERE box_nbr = 5)
AND BB.bx6
IN (SELECT ball_nbr
FROM Preferences AS P
 WHERE box_nbr = 6))
SELECT bx1, bx2, bx3, bx4, bx5, bx6
FROM Preferred_Balls_In_Boxes AS PBB1
 WHERE PBB1.bx NOT IN (PBB2.bx, PBB3.bx, PBB4.bx, PBB5.bx, PBB6.bx)
 AND PBB2.bx NOT IN (PBB3.bx, PBB4.bx, PBB5.bx, PBB6.bx)
 AND PBB3.bx NOT IN (PBB4.bx, PBB5.bx, PBB6.bx)
 AND PBB4.bx NOT IN (PBB5.bx, PBB6.bx)
 AND PBB5.bx NOT IN (PBB6.bx);

26.8 Dr. Codd's T-Join

In the Second Version of the relationnal model in 1990, Dr. E. F. Codd introduced a set of new theta operators, called T-operators, which were based on the idea of a best-fit or approximate equality (Codd, 1990). The algorithm for the operators is easier to understand with an example modified from Dr. Codd.

The problem is to assign the classes to the available classrooms. We want (class_size < room_size) to be true after the assignments are made. This will allow us a few empty seats in each room for late students. We can do this in one of two ways. The first way is to sort the tables in ascending order by classroom size and the number of students in a class. We start with the following tables and load them with the data that follows the DDL.

CREATE TABLE Rooms
(room_nbr CHAR(3) NOT NULL PRIMARY KEY,
 room_size INTEGER NOT NULL);
Classes
class_nbrclass_size
'c1'80
'c2'70
'c3'65
'c4'55
'c5'50
'c6'40

t0170

CREATE TABLE Classes
(class_nbr CHAR(3) NOT NULL PRIMARY KEY,
 class_size INTEGER NOT NULL);
Rooms
room_nbrroom_size
r170
r240
r350
r485
r530
r665
r755

t0175

The goal of the T-JOIN problem is to assign a class which is smaller than the classroom given it (class_size < room_size). Dr. Codd gives two approaches to the problem.

(1) Ascending Order Algorithm
Sort both tables into ascending order. Reading from the top of the Rooms table, match each class with the first room that will fit.

Classes
class_nbrclass_size
c640
c550
c455
c365
c270
c180

t0180

Rooms
room_nbrroom_size
r530
r240
r350
r755
r665
r170
r485

t0185

This gives us

Results
ClassesRooms
class_nbrclass_sizeroom_nbrroom_size
c270r485
c365r170
c455r665
c550r755
c640r350

t0190

(2) Descending Order Algorithm
Sort both tables into descending order. Reading from the top of the Classes table, match each class with the first room that will fit.

ClassesRooms
class_nbrclass_sizeroom_nbrroom_size
c180r485
c270r170
c365r665
c455r755
c550r350
c640r240
NULLNULLr530

t0195

Results
class_nbrclass_sizeroom_nbrroom_size
c180r485
c365r170
c455r665
c550r755
c640r3'50

t0200

Notice that the answers are different. Dr. Codd has never given a definition in relational algebra of the T-Join, so I proposed that we need one. Informally, for each class, we want the smallest room that will hold it, while maintaining the T-JOIN condition. Or for each room, we want the largest class that will fill it, while maintaining the T-JOIN condition. These can be two different things, so you must decide which table is the driver. But either way, I am advocating a “best fit” over Codd’s “first fit” approach.

Other theta conditions can be used in place of the “less than” shown here. If “less than or equal” is used, all the classes are assigned to a room in this case, but not in all cases. This is left to the reader as an exercise.

The first attempts in Standard SQL are versions of grouped by queries. They can, however, produce some rows that would be left out of the answers Dr. Codd was expecting. The first JOIN can be written as

SELECT class_nbr, class_size, MIN(room_size)
FROM Rooms, Classes
 WHERE Classes.class_size < Rooms.room_size
 GROUP BY class_nbr, class_size;

This will give a result table with the desired room sizes, but not the room numbers. You cannot put the other columns in the SELECT list, since it would conflict with the GROUP BY clause. But also note that the classroom with 85 seats (‘r4’) is used twice, once by class ‘c1’ and then by class ‘c2’:

class_sizeclass_sizeMIN(room_size)Notes
c18085entity room r4
c27085entity room r4
c36570
c45565
c55055
c64050

t0205

If you do a little arithmetic on the data, you find that we have 360 students and 395 seats, 6 classes and 7 rooms. Do you want to use the smallest number of rooms?

As it works out, the best fit of rooms to classes will leave the smallest room empty and pack the other rooms to capacity, thus:

SELECT class_nbr, class_size, MIN(room_size)
FROM Rooms, Classes
 WHERE Classes.class_size <= Rooms.room_size
 GROUP BY class_nbr, class_size;

26.8.1 A Procedural Approach

The place to start is with all the possible legal room assignments for a class. We have already seen this query:

SELECT R.room_nbr, C.class_nbr
FROM Rooms AS R, Classes AS C
 WHERE C.class_size <= R.room_size
 ORDER BY R.room_nbr, C.class_nbr;

At the extreme, if all the rooms and classes are the same size, then your have (n!) solutions. If all the rooms are different sizes, we can save ourselves combinatorial explosions, so let us agree that we will juggle the data to get that condition. This query will give us the pairs that are an exact fit, but we know that there will not be any ties for room size.

SELECT R.room_nbr, C.class_nbr
FROM Rooms AS R, Classes AS C
 WHERE C.class_size <= R.room_size
 GROUP BY R.room_nbr, C.class_nbr
HAVING MIN(R.room_size - C.class_size) = 0;
r1c2
r6c3
r7c4
r3c5
r2c6

This leaves us with class {c1} and rooms {r4, r5} yet to be used. We can then repeat a limited version of the basic Pairs query.

SELECT R.room_nbr, C.class_nbr
FROM Rooms AS R, Classes AS C
 WHERE C.class_size <= R.room_size
 AND R.room_nbr IN ('r4', 'r5')
 AND C.class_nbr IN ('c1'),
r4c1

We can now union these result sets and have an answer. I took some extra time to show the details to demonstrate how we can implement a best-fit, greedy algorithm. Let us get a bigger data set and work with it.

INSERT INTO Classes (class_nbr, class_size)
VALUES
('c01', 106), ('c02', 105), ('c03', 104), ('c04', 100), ('c05', 99),
('c06', 90), ('c07', 89), ('c08', 88), ('c09', 83), ('c10', 82),
('c11', 81), ('c12', 65), ('c13', 50), ('c14', 49), ('c15', 30),
('c16', 29), ('c17', 28), ('c18', 20), ('c19', 19);
INSERT INTO Rooms (room_nbr, room_size)
VALUES
('r01', 102), ('r02', 101), ('r03', 95), ('r04', 94), ('r05', 85),
('r06', 70), ('r07', 55), ('r08', 54), ('r09', 35), ('r10', 34),
('r11', 25), ('r12', 18);

To see how this will work, let’s add another column to the table of legal (class_nbr, room_nbr) pairs to see how well they fit.

WITH Pairs (room_nbr, class_nbr, fit)
AS
(SELECT R.room_nbr, C.class_nbr,
(R.room_size - C.class_size)
FROM Rooms AS R, Classes AS C
 WHERE C.class_size <= R.room_size)
 SELECT P1.room_nbr, P1.class_nbr, fit
 FROM Pairs AS P1
WHERE P1.fit
= (SELECT MIN(P2.fit)
FROM Pairs AS P2
 WHERE P1.room_nbr = P2.room_nbr);
r01c042
r02c041
r03c065
r04c064
r05c092
r06c125
r07c135
r08c134
r09c155
r10c154
r11c185

This time we did not have an exact fit, so let’s look for (fit = 1) into a working table and remove ‘r02’ and ‘c04’ from their tables. The second step is

r02c041

We can now remove the best fit rooms and classes and look the next set of remaining best fits.

WITH Pairs (room_nbr, class_nbr, fit)
AS
(SELECT R.room_nbr, C.class_nbr,
(R.room_size - C.class_size)
 FROM Rooms AS R, Classes AS C
 WHERE C.class_size <= R.room_size)
(SELECT P1.room_nbr, P1.class_nbr, fit
 FROM Pairs AS P1
WHERE P1.fit = 2);
r05c092

Here is the step for a fit of 3, 4, 5, and 6

r01c053
r04c064
r08c134
r10c154
r06c125
r11c185
r03c076
r07c146
r09c166

At this point, the original tables of rooms and classes cannot be paired. Notice that as we removed rooms and classes, the fit numbers changed. This is a characteristic of greedy algorithms; taking the “big bites” can leave sub-optimal leftovers on the plate.

26.9 Missing Values in Data

In 2014, Dwain Camps published an article on filling gaps in a sequence with the nearest prior value and then doing a linear distribution of the delta (changes). His choice of a name for the problem was the ‘Data Smear,’ and he found several solutions with window functions. Let’s begin with a table of fictional meter reading that is full of gaps.

CREATE TABLE Foo_Meters
(reading_nbr INTEGER NOT NULL PRIMARY KEY,
foo_reading INTEGER NOT NULL);
INSERT INTO Foo_Meters
VALUES
(1, 121), (2, NULL), (3, NULL),
(4, 312), (5, NULL), (6, NULL), (7, NULL),
(8, 123), (9, NULL), (10, NULL),
(11, 415), (12, NULL), (13, NULL), (14, NULL), (15, NULL),
(16, 200);

26.9.1 Last Known Value

The number of NULL (unknown) values between known values is not consistently the same. I arranged them on the same line to make it easy to see which values would be carried over which rows.

SELECT reading_nbr, foo_reading,
MAX(foo_reading)
OVER (PARTITION BY reading_cnt) AS prior_foo_reading
FROM (SELECT reading_nbr, foo_reading,
COUNT(foo_reading)
OVER (ORDER BY reading_nbr) AS reading_cnt
 FROM Foo_Meters)
AS prior_foo_reading;
reading_nbrfoo_readingprior_foo_reading
1121121
2NULL121
3NULL121
4312312
5NULL312
6NULL312
7NULL312
8123123
9NULL123
10NULL123
11415415
12NULL415
13NULL415
14NULL415
15NULL415
16200200

t0240

26.9.2 Sequence Missing Readings

Suppose we want something slightly different filling in the NULL values, for example, a sequence of the same PARTITION we constructed can be used in this case, and so can the ROW_NUMBER in the preceding example.

SELECT reading_nbr, foo_reading,
COALESCE(foo_reading,
ROW_NUMBER()
OVER (PARTITION BY c
ORDER BY reading_nbr) -1) AS s
 FROM (SELECT reading_nbr, foo_reading,
COUNT(foo_reading) OVER (ORDER BY reading_nbr) AS c
 FROM Foo_Meters) AS A;
Results:
reading_nbrfoo_readings
1121121
2NULL1
3NULL2
4312312
5NULL1
6NULL2
7NULL3
8123123
9NULL1
10NULL2
11415415
12NULL1
13NULL2
14NULL3
15NULL4
16200200

t0245

26.9.3 Smoothed Result

Suppose we want a bit smoother transition from our trailing value to our following value. This could be particularly appealing if we have many NULL rows between sparse data points. Let’s illustrate this by looking at the first four rows of the above results set.

reading_nbrfoo_readinglinear_reading
1121121
2NULL216.500000121+(1/3)*(312–121)=184.6667
3NULL216.500000121+(2/3)*(312–121)=248.3333
4312312

First let’s look at some intermediate results using ROW_NUMBER() and the COUNT window aggregate.

SELECT reading_nbr, foo_reading, s, m, x
 FROM (SELECT reading_nbr, foo_reading,
MAX(foo_reading)
 OVER (PARTITION BY c) AS s,
ROW_NUMBER()
 OVER (PARTITION BY c ORDER BY reading_nbr DESC) AS n,
ROW_NUMBER()
 OVER (PARTITION BY c ORDER BY reading_nbr) – 1 AS m,
1 + COUNT(CASE WHEN foo_reading IS NULL THEN 1 END)
 OVER (PARTITION BY c) AS x
 FROM (SELECT reading_nbr, foo_reading,
 COUNT(foo_reading) OVER (ORDER BY reading_nbr) AS c
 FROM Foo_Meters) AS A
) AS A;

Our intermediate results look pretty useful:

reading_nbrfoo_readingsmx
112112103
2NULL12113
3NULL12123
431231204
5NULL31214
6NULL31224
7NULL31234
812312303
9NULL12313
10NULL12323
1141541505
12NULL41515
13NULL41525
14NULL41535
15NULL41545
1620020001

t0255

You can see that all three of our calculated columns are what we need to produce the smoothing calculation shown prior. The first, s, is of course our original smear. The columns m and x, when m is divided by x, produce the multiplier we need. We have omitted the LEAD result, but in the end, we’ll use that also (so n is likewise still required).

Putting that all together, we get this.

SELECT reading_nbr, foo_reading,
CASE
WHEN foo_reading IS NOT NULL
THEN foo_reading
ELSE s + (1.0 * m / x)
 * (LEAD(foo_reading, n, s)
OVER (ORDER BY reading_nbr) - s)
 END AS computed_foo_reading
FROM (SELECT reading_nbr, foo_reading, (MAX(foo_reading) OVER (PARTITION BY c)) AS s,
 (ROW_NUMBER() OVER (PARTITION BY c ORDER BY reading_nbr DESC)) AS n,
 (ROW_NUMBER() OVER (PARTITION BY c ORDER BY reading_nbr) - 1) AS m,
 1 + COUNT(CASE WHEN foo_reading IS NULL
 THEN 1 ELSE NULL END)
 OVER (PARTITION BY c) AS x
FROM (SELECT reading_nbr, foo_reading,
(COUNT(foo_reading) OVER (ORDER BY reading_nbr)) AS c
FROM Foo_Meters) AS F1
) AS F2;

We have just done a simple linear interpolation in T-SQL!

reading_nbrfoo_readingcomputed_foo_reading
1121121.00
2NULL184.67
3NULL248.33
4312312.00
5NULL337.75
6NULL363.50
7NULL389.25
11415415.00
12NULL372.00
13NULL329.00
14NULL286.00
15NULL243.00
16200200.00

26.10 Missing and Mixed Data in Rows

Collecting data is not always easy. When you sample or stage data, you can wind up with rows that have missing data (usually shown with NULLs) or overlapping data. This data will be shown on one row. We will need business rules that will depend on the particulars of each problem. The rules we want in this example are:

1. If we have multiple starting dates, then use the earliest known date. If there is a NULL in the data, then use it since we are uncertain.

2. If we have multiple ending dates, then use the latest known date. If there is a NULL in the data, then use it since we are uncertain.

As an example, assume we are taking samples of some kind within a date range, but the data is messy and we have multiple rows for multiple samples we have to reduce to a single row. Here is the DDL and some sample data.

CREATE TABLE Samples
(sample_id INTEGER NOT NULL,
 first_date DATE,
 second_date DATE,
UNIQUE (sample_id, first_date, second_date),
 CHECK (first_date < second_date)
);

Note the use of UNIQUE versus PRIMARY KEY because of the NULL-able column. Many new SQL programmers do not know that UNIQUE does not imply NOT NULL, like PRIMARY KEY.

INSERT INTO Samples
VALUES
(1, '2012-02-24', '2012-02-27'),
(1, '2012-02-24', '2012-03-06'),
(2, '2012-02-24', '2012-03-06'),
(2, '2012-01-05', '2012-01-06'),
(3, NULL, '2012-03-07'),
(3, '2012-02-01', '2012-03-07'),
(4, '2012-02-01', NULL),
(4, '2012-01-01', '2012-02-10'),
(5, NULL, NULL);

Samples one and two have known values on their date ranges, so they are easy. But the other samples have missing dates. Normally, doing an aggregate will drop the NULLs instead of making them dominate. I am calling it “first_date” and “second_date” since we have not decided which start and end dates we will finally use. The names “candidate_start_date” and “candidate_end_date” are a bit stuffy. The decision is to use the latest ending date.

SELECT sample_id,
CASE WHEN COUNT(*) <> COUNT(first_date)
THEN NULL ELSE MIN(first_date) END AS first_date,
CASE WHEN COUNT(*) <> COUNT(second_date)
THEN NULL ELSE MAX(second_date) END
AS second_date
 FROM Samples
GROUP BY sample_id, second_date;

The use of CASE has to be at the same level of aggregation as the grouping for the query allows for this construct.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.49.252