In Chapter 22, we took a look at the basic SELECT statement. Now we need to look at other ways to nest subqueries and build the working table in the FROM clause with JOINs and other syntax. This is a collection of solutions to problems for which you might not think about using SQL. I also give a bit history.
One of the classics of Software Engineering is a short paper by the late Edsger Dijkstra entitled Go To Statement Considered Harmful (communications of the ACM, Vol. 11, No. 3, March 1968, pp. 147–148). In this paper, he argued for dropping the GOTO statement from programming languages in favor of what we now called Structured Programming.
One of his observations was that programs that used only BEGIN-END blocks, WHILE loops and IF-THEN-ELSE statements were easier to read and maintain. Programs that jumped around via GOTO statements were harder to follow because the execution path could have arrived at a statement label from anywhere in the code. Algol, the first of the blocked structured programming languages, had all of those control structures but still kept the GOTO—it was considered a fundamental part of programming. Before Dijkstra, nobody had really understood the power of limiting the scope of variables and control structures in procedural code. The basic idea of a scope is that a block of code can only reference variables that are declared in the block. Failing to find a local variable, the containing blocks are then inspected from the inside out.
We have the same concept in SQL queries. A correlated subquery is a subquery that references columns in the tables of its containing query. The general structure might look like:
SELECT T1.x, T1.y, T1.z FROM TableOne AS T1 WHERE T1.x = (SELECT T2.x, T2.y, T2.z FROM TableTwo AS T2 WHERE T2.y = (SELECT foobar FROM TableThree AS T3 WHERE T3.x = T1.x AND T3.y = T2.y AND T3.z = 42));
Look at the innermost query. The predicate (T3.z = 42) is local to this query. The predicate (T3.y = T2.y) works because this query is in the scope of the query with T2 in the FROM clause. Likewise, The predicate (T3.x = T1.x) works because this query is in the scope of the query with T1 in the FROM clause. If I had not qualified the table names in the innermost WHERE clause, the predicate (T3.x = x) would refer to the most local x, which gives us (T3.x = T3.x), which is always TRUE or UNKNOWN. That is absurd.
But a predicate like (T3.z = floob) might reference table T1, T2, or T3, depending on which ones has the nearest column floob. Which table would be determined by working outward. This is why it is important to qualify column names.
The tables can reference the same table under a different correlation name. Consider a query to find all the students who are younger than the oldest student of their gender:
SELECT S1.stud_nbr, S1.stud_name, S1.sex_code, S1.stud_age FROM Students AS S1 WHERE stud_age < (SELECT MAX(stud_age) FROM Students AS S2 WHERE S1.sex_code = S2.sex_code);
1. Let’s work it out in detail. The fiction in SQL is that we create local tables S1 and S2 which happen to have the same data and structure as the Students table. A copy of the Students table is made for each correlation name, S1 and S2. Obviously, this is not how it is implemented in a real SQL compiler. Following the same steps we used in simple SELECT Statements, expand the outer query.
stud_nbr | stud_name | sex_code | stud_age |
1 | Smith | 1 | 16 |
2 | Smyth | 2 | 17 |
3 | Smoot | 2 | 16 |
4 | Adams | 2 | 17 |
5 | Jones | 1 | 16 |
6 | Celko | 1 | 17 |
7 | Vennor | 2 | 16 |
8 | Murray | 1 | 18 |
2. When you get to the WHERE clause, and find the innermost query, you will see that you need to get data from the containing query. The model of execution says that each outer row has the subquery executed on it in parallel with the other rows. Assume we are working on student (1, ‘Smith’), who is male. The query in effect becomes:
SELECT 1, 'Smith', 1, 16
FROM Students AS S1
WHERE 16 < (SELECT MAX(stud_age)
FROM Students AS S2
WHERE 1 = S2.sex_code);
As an aside, the search conditions (1 = S2.sex_code) and (S2.sex_code = 1) are equivalent. The choice is largely a matter of the programmer’s culture—do you read from left to right or right to left?
3. The subquery can now be calculated for male students; the maximum stud_age is 18. When we expand this out for all the other rows, this will give us the effect of this set of parallel queries.
SELECT 1, 'Smith', 1, 16 FROM Students AS S1 WHERE 16 < 18;
SELECT 2, 'Smyth', 2, 17 FROM Students AS S1 WHERE 17 < 17;
SELECT 3, 'Smoot', 2, 16 FROM Students AS S1 WHERE 16 < 17;
SELECT 4, 'Adams', 2, 17 FROM Students AS S1 WHERE 17 < 17;
SELECT 5, 'Jones', 1, 16 FROM Students AS S1 WHERE 16 < 18;
SELECT 6, 'Celko', 1, 17 FROM Students AS S1 WHERE 17 < 18;
SELECT 7, 'Vennor', 2, 16 FROM Students AS S1 WHERE 16 < 17;
SELECT 8, 'Murray', 1, 18 FROM Students AS S1 WHERE 18 < 18;
4. These same steps have been done for each row in the containing query. The model is that all of the subqueries are resolved at once. With cheaper and cheaper parallel hardware, this might be true some day, but no implementation really does it that way currently. The usual approach in real SQL compilers is to build procedural loops in the database engine that scan through both tables. What table is in what loop is decided by the optimizer. The final results are
Again, no real product works this way, but it has to produce the same results as this process.
There is no limit to the depth of nesting of correlated subqueries in theory. In practice, it is probably a good heuristic to keep the nesting under five levels.
These examples used scalar subqueries, but you can also use correlated subqueries that return a collection of tables. For example, to find all of Professor Celko’s students, we might use this query.
SELECT S1.stud_nbr, S1.stud_name, S1.sex_code, S1.stud_age FROM Students AS S1 WHERE S1.stud_nbr IN (SELECT T.stud_nbr FROM Teachers AS T1 WHERE S1.stud_nbr = T1.stud_nbr AND T1.teacher_name = 'Celko'),
Another problem is that many SQL programmers do not fully understand the rules for scope of derived table names. If an infixed join is given a derived table name, then all of the table names inside it are hidden from containing expressions. For example, this will fail
SELECT a, b, c -- wrong! FROM (Foo INNER JOIN Bar ON Foo.y >= Bar.x) AS Foobar (x, y) INNER JOIN Flub ON Foo.y <= Flub.z;
Because the table name Foo is not available to the second INNER JOIN. However, this will work:
SELECT a, b, c FROM (Foo INNER JOIN Bar ON Foo.y >= Bar.x) AS Foobar (x, y) INNER JOIN Flub ON Foobar.y <= Flub.z;
If you start nesting lots of derived table expressions, you can force an order of execution in the query. It is not a generally a good idea to try to outguess the optimizer.
So far, I have shown fully qualified column names. It is a good programming practice, but it is not required. Assume that Foo and Bar both have a column named w. These statements will produce an ambiguous name error:
SELECT a, b, c FROM Foo INNER JOIN Bar ON y >= x INNER JOIN Flub ON y <= w; SELECT a, b, c FROM Foo, Bar, Flub WHERE y BETWEEN x AND w
But this statement will work from inside the parentheses first and then does the outermost INNER JOIN last.
SELECT a, b, c FROM Foo INNER JOIN (Bar INNER JOIN Flub ON y <= w) ON y >= x;
If Bar did not have a column named w, then the parser would go to the next containing expression, find Foo.w, and use it.
SQL-92 added new syntax for various JOINs using infixed operators in the FROM clause. The JOIN operators are quite general and flexible, allowing you to do things in a single statement that you could not do in the older notation. The basic syntax is
< joined table > ::= < cross join > | < qualified join > | (< joined table >) < cross join > ::= < table reference > CROSS JOIN < table reference > < qualified join > ::= < table reference > [NATURAL] [< join type >] JOIN < table reference > [< join specification >] < join specification > ::= < join condition > | < named columns join > < join condition > ::= ON < search condition > < named columns join > ::= USING (< join column list >) < join type > ::= INNER | < outer join type > [OUTER] | UNION < outer join type > ::= LEFT | RIGHT | FULL < join column list > ::= < column name list > < table reference > ::= < table name > [[AS] < correlation name>[(< derived column list >)]] | < derived table > [AS] < correlation name > [(< derived column list >)] | < joined table > < derived table > ::= < table subquery > < column name list > ::= < column name > [{ < comma > < column name > }. . .]
An INNER JOIN is done by forming the CROSS JOIN and then removing the rows that do not meet the JOIN search conditions given in the ON clause, just like we did with the original FROM.. WHERE syntax. The ON clause can be as elaborate as you want to make it, as long as it refers to tables and columns within its scope. If a < qualified join > is used without a < join type >, INNER is implicit. But it is good documentation to spell out all of the JOIN operators.
However, in the real world, most INNER JOINs are done using equality tests on columns with the same names in different tables, rather than on elaborate search conditions. Equi-JOINs are so common that Standard SQL has two shorthand ways of specifying them. The USING (c1, c2, …, cn) clause takes the column names in the list and replaces them with the clause ON ((T1.c1, T1.c2, …, T1.cn) = (T2.c1, T2.c2, …, T2.cn)). Likewise, the NATURAL option is shorthand for a USING() clause that is a list of all the column names that are common to both tables. If NATURAL is specified, a JOIN specification cannot be given; it is already there.
A strong warning: do not use NATURAL JOIN in production code. Any change to the column names or addition new columns will change the join at run time. This is also why you do not use “SELECT *” in production code. But the NATURAL JOIN is more dangerous. As Daniel Morgan pointed out a NATURAL JOIN between two tables with a vague generic column name like “comments” for unrelated data elements can give you a meaningless join containing megabytes of text.
The same sort of warning applies to the USING clause. Neither of these options is widely implemented or used. If you find out that “product_id,” “product_nbr,” and “upc” were all used for the same data element in your schema, you should do a global change to make sure that one data element has one and only one name.
There was a myth among ACCESS programmers that the ON clause can contain only JOIN conditions and the WHERE can contain only search conditions. This is not true, and the difference in the position of the search conditions is not important. The product generated code in that format because this was the execution plan used by the simple compiler.
Having said this, separating the conditions this way can have some advantages for documentation. It becomes easy to remove the WHERE clause and have a candidate for a VIEW. But there are trade-offs.
OUTER JOINs used to be done with proprietary vendor syntax. Today, the use of the Standard OUTER JOIN is universal. An OUTER JOIN is a JOIN that preserves all the rows in one or both tables, even when they do not have matching rows in the second table. The unmatched columns in the unpreserved table (this is the correct term) are filled with NULLs to complete the join and return rows with the right structure.
Let’s take a real-world situation. I have a table of orders and a table of suppliers that I wish to JOIN for a report to tell us how much business we did with each supplier. With a traditional inner join, the query would be this:
SELECT S.sup_id, S.sup_name, O.order_nbr, O.order_amt FROM Suppliers AS S, Orders AS O WHERE S.sup_id = O.sup_id;
or with the infixed syntax
SELECT S.sup_id, S.sup_name, O.order_nbr, O.order_amt FROM Suppliers AS S –- preserved table INNER JOIN Orders AS O ON S.sup_id = O.sup_id;
Some supplier totals include credits for returned merchandise, and our total business with them works out to zero dollars. Other suppliers never got an order from us at all, so we did zero dollars worth of business with them, too. But the first case will show up in the query result and be passed on to the report, whereas the second case will disappear in the INNER JOIN.
If we had used an OUTER JOIN, preserving the Suppliers table, we would have all the suppliers in the results. When a supplier with no orders was found in the Orders table, the order_nbr and order_amt columns would be given a NULL value in the result row.
SELECT S.sup_id, S.sup_name, O.order_nbr, O.order_amt FROM Suppliers AS S OUTER LEFT JOIN Orders AS O ON S.sup_id = O.sup_id;
Before the SQL-99 Standard, there was no Standard OUTER JOIN syntax, so you had to construct it by hand with a messy UNION in products like very early versions of DB2 from IBM like this:
SELECT sup_id, sup_name, order_amt -- regular INNER JOIN FROM Suppliers, Orders WHERE Suppliers.sup_id = Orders.sup_id UNION ALL SELECT sup_id, sup_name, CAST(NULL AS INTEGER) -- preserved rows FROM Suppliers WHERE NOT EXISTS (SELECT * FROM Orders WHERE Suppliers.sup_id = Orders.sup_id);
You have to use a NULL with the correct data type to make the UNION work, hence the CAST() functions. Some products are smart enough that just NULL by itself will be given the correct data type, but this is portable and safer.
The other alternative is to insert a constant of some sort to give a more meaningful result. This is easy in the case of a CHARACTER column, where a message like ‘{{NONE}}’ can be quickly understood. It is much harder in the case of a numeric column, where we could have a balance with a supplier that is positive, zero, or negative because of returns and credits. There really is a difference between a vendor that we did not use and a vendor whose returns and credits canceled out its orders.
In the second edition of this book, I described the proprietary OUTER JOIN extensions in detail. Today, they are gone and replaced by the Standard syntax. The vendor extensions were all different in syntax or semantics or both. Since they are mercifully gone, I am not going to tell you about them in this edition.
The name LEFT OUTER JOIN comes from the fact that the preserved table is on the left side of the operator. Likewise, a RIGHT OUTER JOIN would have the preserved table on the right-hand side and the FULL OUTER JOIN preserves both tables.
Here is how OUTER JOINs work in Standard SQL. Assume you are given:
and the OUTER JOIN expression
Table1 LEFT OUTER JOIN Table2 ON Table1.a = Table2.a -- JOIN condition AND Table2.c = 't'; −− single table filter condition
We call Table1 the “preserved table” and Table2 the “unpreserved table” in the query. What I am going to give you is a little different, but equivalent to the ANSI/ISO standards.
(1) We build the CROSS JOIN of the two tables. Scan each row in the result set.
(2) If all search conditions in the ON clause test TRUE for that row, then you keep it.
(3) If the predicate tests FALSE or UNKNOWN for that row, then keep the columns from the preserved table, convert all the columns from the unpreserved table to NULLs and remove the duplicates. So let us execute this by hand:
Let = passed the join search condition
Let = passed the filter search condition
Table1 LEFT OUTER JOIN Table2 | ||||
a | b | a | c | Notes |
3 | y | 3 | t | ![]() |
1 | w | NULL | NULL | Sets of duplicates |
1 | w | NULL | NULL | |
1 | w | NULL | NULL | |
2 | x | NULL | NULL | |
2 | x | NULL | NULL | |
2 | x | NULL | NULL | |
3 | y | NULL | NULL | ![]() |
3 | y | NULL | NULL | ![]() |
4 | z | NULL | NULL | |
4 | z | NULL | NULL | |
4 | z | NULL | NULL |
The basic rule is that every row in the preserved table is represented in the results in at least one result row.
Consider the two famous Chris Date tables from his “Suppliers and Parts” database used in his textbooks.
If you write the OUTER JOIN with only the join predicate in the ON clause, like this:
SELECT Suppliers.sup_id, SupParts.part_nbr, SupParts.part_qty FROM Suppliers LEFT OUTER JOIN SupParts ON Supplier.sup_id = SupParts.sup_id WHERE part_qty < 200;
but if we put the search predicate in the ON clause, we get this result.
SELECT Suppliers.sup_id, SupParts.part_nbr, SupParts.part_qty FROM Suppliers LEFT OUTER JOIN SupParts ON Supplier.sup_id = SupParts.sup_id AND part_qty < 200;
Another problem was that you could not show the same table as preserved and unpreserved in the proprietary syntax options, but it is easy in Standard SQL. For example, to find the students who have taken Math 101 and might have taken Math 102:
SELECT C1.stud_nbr, C1.math_course, C2.math_course FROM (SELECT * FROM Courses WHERE math_course = 'Math 101') AS C1 LEFT OUTER JOIN (SELECT * FROM Courses WHERE math_course = 'Math 102') AS C2 ON C1.stud_nbr = C2.stud_nbr;
A third problem is that the order of execution matters with a chain of OUTER JOINs. That is to say, ((T1 OUTER JOIN T2) OUTER JOIN T3) does not produce the same results as (T1 OUTER JOIN (T2 OUTER JOIN T3)).
The NULLs that are generated by the OUTER JOIN can occur in columns derived from source table columns that have been declared to be NOT NULL. Even if you tried to avoid all the problems with NULLs by making every column in every table of your schema NOT NULL, they could still occur in OUTER JOIN and OLAP function results. However, a table can have NULLs and still be used in an OUTER JOIN. Consider different JOINs on the following two tables, which have NULLs in the common column:
A natural INNER JOIN on column x can only match those values that are equal to each other. But NULLs do not match to anything, even to other NULLs. Thus, there is one row in the result, on the value r in column x in both tables.
Now, do a LEFT OUTER JOIN on the tables, which will preserve table T1, and you get
Again, there are no surprises. The original INNER JOIN row is still in the results. The other two rows of T1 that were not in the equi-JOIN do show up in the results, and the columns derived from table T2 are filled with NULLs. The RIGHT OUTER JOIN would also behave the same way. The problems start with the FULL OUTER JOIN, which looks like this:
The way this result is constructed is worth explaining in detail.
First do an INNER JOIN on T1 and T2, using the ON clause condition, and put those rows (if any) in the results. Then all rows in T1 that could not be joined are padded out with NULLs in the columns derived from T2 and inserted into the results. Finally, take the rows in T2 that could not be joined, pad them out with NULLs, and insert them into the results. The bad news is that the original tables cannot be reconstructed from an OUTER JOIN. Look at the results of the FULL OUTER JOIN, which we will call R1, and SELECT the first columns from it:
The created NULLs remain and could not be differentiated from the original NULLs. But you cannot throw out those duplicate rows because they may be in the original table T1. There is now a function, GROUPING (< column name >), used with the CUBE, ROLLUP, and GROUPING SET() options which returns a 1 for original NULLs or data and 0 for created NULLs. Your vendor may allow this function to be used with the OUTER JOINs.
It is worth mentioning in passing that Standard SQL has a NATURAL LEFT OUTER JOIN, but it is not implemented in most versions of SQL.
A NATURAL JOIN has only one copy of the common column pairs in its result. The searched OUTER JOIN has both of the original columns, with their table-qualified names. The NATURAL JOIN has to have a correlation name for the result table to identify the shared columns. We can build a NATURAL LEFT OUTER JOIN by using the COALESCE() function to combine the common column pairs into a single column and put the results into a VIEW where the columns can be properly named, thus:
CREATE VIEW NLOJ12 (x, a, b) AS SELECT COALESCE(T1.x, T2.x), T1.a, T2.b FROM T1 LEFT OUTER JOIN T2 ON T1.x = T2.x;
Unlike the NATURAL JOINs, the searched OUTER JOIN does not have to use a simple one-column equality as the JOIN search condition. The search condition can have several search conditions, use other comparisons, and so forth. For example,
There is no rule that forbids an OUTER JOIN on the same table. In fact, this kind of self-join is a good trick for “flattening” a normalized table into a horizontal report. To illustrate the method, start with a skeleton table defined as
CREATE TABLE Credits (student_nbr INTEGER NOT NULL, course_name CHAR(8) NOT NULL, PRIMARY KEY (student_nbr, course_name));
This table represents student ids and a course name for each class they have taken. However, our rules say that students cannot get credit for ‘CS-102’ until they have taken the prerequisite ‘CS-101’ course; they cannot get credit for ‘CS-103’ until they have taken the prerequisite ‘CS-102’ course, and so forth. Let’s first load the table with some sample values.
Notice that student #10 has both courses, student #20 has only the first of the series, and student #3 jumped ahead of sequence and therefore cannot get credit for his ‘CS-102’ course until he goes back and takes ‘CS-101’ as a prerequisite.
What we want is basically a histogram (bar chart) for each student, showing how far he or she has gone in his or her degree programs. Assume that we are only looking at two courses; the result of the desired query might look like this (NULL is used to represent a missing value):
(1, 'CS-101', 'CS-102') (2, 'CS-101', NULL)
Clearly, this will need a self-JOIN, since the last two columns come from the same table, Credits. You have to give correlation names to both uses of the Credits table in the OUTER JOIN operator when you construct a self OUTER JOIN, just as you would with any other SELF-JOIN, thus:
SELECT student_nbr, C1.course_name, C2.course_name FROM Credits AS C1 LEFT OUTER JOIN Credits AS C2 ON C1.stud_nbr_nbr = C2.stud_nbr_nbr AND C1.course_name = 'CS-101' AND C2.course_name = 'CS-102';
Some relational purists feel that every operator should have an inverse, and therefore they do not like the OUTER JOIN. Others feel that the created NULLs are fundamentally different from the explicit NULLs in a base table and should have a special token. SQL uses its general-purpose NULLs and leaves things at that. Getting away from theory, you will also find that vendors have often done strange things with the ways their products work.
A major problem is that OUTER JOIN operators do not have the same properties as INNER JOIN operators. The order in which FULL OUTER JOINs are executed will change the results (a mathematician would say that they are not associative). To show some of the problems that can come up when you have more than two tables, let us use three very simple two column tables. Notice that some of the column values match and some do not match, but the three tables have all possible pairs of column names in them.
CREATE TABLE T1 (a INTEGER NOT NULL, b INTEGER NOT NULL); INSERT INTO T1 VALUES (1, 2); CREATE TABLE T2 (a INTEGER NOT NULL, c INTEGER NOT NULL); INSERT INTO T2 VALUES (1, 3); CREATE TABLE T3 (b INTEGER NOT NULL, c INTEGER NOT NULL); INSERT INTO T3 VALUES (2, 100);
Now let’s try some of the possible orderings of the three tables in a chain of LEFT OUTER JOINS. The problem is that a table can be preserved or unpreserved in the immediate JOIN and in the opposite state in the containing JOIN.
SELECT T1.a, T1.b, T3.c FROM ((T1 NATURAL LEFT OUTER JOIN T2) NATURAL LEFT OUTER JOIN T3); Result
SELECT T1.a, T1.b, T3.c FROM ((T1 NATURAL LEFT OUTER JOIN T3) NATURAL LEFT OUTER JOIN T2); Result
SELECT T1.a, T1.b, T3.c FROM ((T1 NATURAL LEFT OUTER JOIN T3) NATURAL LEFT OUTER JOIN T2); Result
Even worse, the choice of column, in the SELECT list can change the output. Instead of displaying T3.c, use T2.c and you will get:
SELECT T1.a, T1.b, T2.c FROM ((T2 NATURAL LEFT OUTER JOIN T3) NATURAL LEFT OUTER JOIN T1);
The compiler should give you error messages about ambiguous column names.
At the start of this chapter, we had a table of orders and a table of suppliers, which were to be used to build a report to tell us how much business we did with each supplier. The query that will do this is
SELECT Suppliers.sup_id, sup_name, SUM(order_amt) FROM Suppliers LEFT OUTER JOIN Orders ON Suppliers.sup_id = Orders.sup_id GROUP BY sup_id, sup_name;
Some suppliers’ totals include credits for returned merchandise, such that our total business with them worked out to zero dollars. Each supplier with which we did no business will have a NULL in its order_amt column in the OUTER JOIN. The usual rules for aggregate functions with NULL values apply, so these suppliers will also show a zero total amount. It is also possible to use a function inside an aggregate function, so you could write SUM(COALESCE(T1.x, T2.x)) for the common column pairs.
If you need to tell the difference between a true sum of zero and the result of a NULL in an OUTER JOIN, use the MIN() or MAX() function on the questionable column. These functions both return a NULL result for a NULL input, so an expression inside the MAX() function could be used to print the message MAX(COALESCE(order_amt, ‘No Orders’)), for example.
Likewise, these functions could be used in a HAVING clause, but that would defeat the purpose of an OUTER JOIN.
The FULL OUTER JOIN is a mix of the LEFT and RIGHT OUTER JOINs, with preserved rows constructed from both tables. The statement takes two tables and puts them in one result table. Again, this is easier to explain with an example than with a formal definition.
There is also a UNION JOIN in Standard SQL which returns the results of a FULL OUTER JOIN without the rows that were in the INNER JOIN of the two tables. The only SQL product which has implemented it as of 2014 is MariaDB (www.mariadb.com) and it is part of the SAS statistical system (www.sas.com) in the PROC SQL options.
T1 UNION JOIN T2 ON(T1.x = T2.x) | ||||
a | T1.x | b | T2.x | Notes |
2 | v | NULL | NULL | -- preserved from T1 |
3 | NULL | NULL | NULL | -- preserved from T1 |
NULL | NULL | 8 | s | -- preserved from T2 |
NULL | NULL | 9 | NULL | -- preserved from T2 |
As an example of this, you might want to combine the medical records of male and female patients into one table with this query.
SELECT * FROM (SELECT 'male', prostate_flg FROM Males) OUTER UNION (SELECT 'female', pregnancy_flg FROM Females);
to get a result table like this
Result | |||
male | prostate_flg | female | pregnancy_flg |
'male' | 'no' | NULL | NULL |
'male' | 'no' | NULL | NULL |
'male' | 'yes' | NULL | NULL |
'male' | 'yes' | NULL | NULL |
NULL | NULL | 'female' | 'no' |
NULL | NULL | 'female' | 'no' |
NULL | NULL | 'female' | 'yes' |
NULL | NULL | 'female' | 'yes' |
Frédéric Brouard came up with a nice trick for writing a similar join. That is, a join on one table, say a basic table of student data, with either a table of data particular to domestic students or another table of data particular to foreign students, based on the value of a parameter. This is one way to handle sub-types and supper-types in SQL. This differs from a true UNION JOIN in that it has to have a “root” table to use for the outer JOINs.
CREATE TABLE Students (student_nbr INTEGER NOT NULL PRIMARY KEY, student_type CHAR(1) NOT NULL DEFAULT 'D' CHECK (student_type IN ('D', 2, ..)) . . .); CREATE TABLE Domestic_Students (student_nbr INTEGER NOT NULL PRIMARY KEY, REFERENCES Students(student_nbr), . . .); CREATE TABLE Foreign_Students (student_nbr INTEGER NOT NULL PRIMARY KEY, REFERENCES Students(student_nbr), . . .); SELECT Students.*, Domestic_Students.*, Foreign_Students.* FROM Students LEFT OUTER JOIN Domestic_Students ON CASE Students.stud_type WHEN 'D' THEN 1 ELSE NULL END = 1 LEFT OUTER JOIN Foreign_Students ON CASE Student.stud_type WHEN 2 THEN 1 ELSE NULL END = 1;
A SELECT expression that returns a single row with a single value can be used where a scalar expression can be used. If the result of the scalar query is empty, it is converted to a NULL. This will sometimes, but not always, let you write an OUTER JOIN as a query within the SELECT clause; thus, this query will work only if each supplier has one or zero orders:
SELECT sup_id, sup_name, order_nbr, (SELECT order_amt FROM Orders WHERE Suppliers.sup_id = Orders.sup_id) AS order_amt FROM Suppliers;
However, I could write
SELECT sup_id, sup_name, (SELECT COUNT(*) FROM Orders WHERE Suppliers.sup_id = Orders.sup_id) FROM Suppliers;
instead of writing
SELECT sup_id, sup_name, COUNT(*) FROM Suppliers LEFT OUTER JOIN Orders ON Suppliers.sup_id = Orders.sup_id GROUP BY sup_id, sup_name;
The infixed OUTER JOIN syntax was meant to replace several different vendor options which all had different syntax and semantics. It was absolutely needed. The INNER JOIN and OUTER JOIN operators are universal now. They are binary operators and programmers are used to binary operators—add, subtract, multiply, and divide are all binary operators. E-R diagrams use lines between tables to show a relational schema.
But this leads to a linear approach to problem solving that might not be such a good thing in SQL. Consider this statement which would have been written in the traditional syntax as:
SELECT a, b, c FROM Foo, Bar, Flub WHERE Foo.y BETWEEN Bar.x AND Flub.z;
With the infixed syntax, I can write this same statement in any of several ways. For example:
SELECT a, b, c FROM (Foo INNER JOIN Bar ON Foo.y >= Bar.x) INNER JOIN Flub ON Foo.y <= Flub.z;
or
SELECT a, b, c FROM Foo INNER JOIN (Bar INNER JOIN Flub ON Foo.y <= Flub.z) ON Foo.y >= Bar.x;
I leave it to the reader to find all the permutations, with or without the parentheses.
Humans tend to see things that are close together as a unit or as having a relationship. It is a law of visual psychology and typesetting called the Law of Proximity. The extra reserved words in the infixed notation tend to work against proximity; you have to look in many places to find the parts of a.
The infixed notation invites a programmer to add one table at a time to the chain of JOINs. First I built and tested the Foo-Bar join and when I was happy with the results, I added Flub. Step-wise program refinement was one of the mantras of structured programming.
But look at the code; can you see that there is a BETWEEN relationship among the three tables? It is not easy, is it? In effect, you see only pairs of tables and not the whole problem. SQL is an “all-at-once” set-oriented language, not a “step-wise” language. This is much like the conceptual difference between addition with a simple binary + operator and the generalized n-ary summation operator with a Σ.
I am against infixed JOINs? No, but it is a bit more complicated than it first appears and if there are some OUTER JOINs in the mix, things can be very complicated. Just be careful with the new toys, kids.
We can relate two tables together based on quantities in each of them. These problems take the form of pairing items in one set with items in another. The extra restriction is that the set of pairs has constraints at the level of the result which a row-by-row join does not. Here the values are identifiers and cannot be repeated in the results.
Let us assume we have two tables, X and Y. Some possible situations are:
1. A row in X matches one and only one row in Y. There can be one matching function that applies to one set, or each set can have its own matching function.
An example of one matching function is an optimization with constraints. For example, you are filling an egg carton with a set of colored eggs given rules about how the colors can be arranged. A lot of logic puzzles use this model.
The classic example of two matching functions is the Stable Marriages problem where the men rank the women they want to marry and where the women rank the men they want to marry.
2. A row in X matches one or more rows in Y: knapsack or bin packing problems where one bin holds one or more items and we try to optimize the arrangement.
In all cases, there can be a unique answer or several answers or no valid answer at all. Let’s give some examples and code for them.
The simplest example is filling customer orders from the inventories that we have at various stores. To make life easier, assume that we have only one product, process orders in increasing customer_id order (this could be temporal order as well), and draw from store inventory by increasing store_id (this could be nearest store).
CREATE TABLE Inventory (store_id INTEGER NOT NULL PRIMARY KEY, item_qty INTEGER NOT NULL CHECK (item_qty >=0)); INSERT INTO Inventory (store_id, item_qty) VALUES (10, 2), (20, 3), (30, 2); CREATE TABLE Orders (customer_id CHAR(5) NOT NULL PRIMARY KEY, item_qty INTEGER NOT NULL CHECK (item_qty > 0)); INSERT INTO Orders (customer_id, item_qty) VALUES ('Bill', 4), ('Fred', 2);
What we want to do is fill Bill’s order for 4 units by taking 2 units from store #10, and 2 units from store #20. Next, we process Fred’s order with the 1 unit left in store #10 and 1 unit from store #3.
SELECT I.store_id, O.customer_id, (CASE WHEN O.end_running_qty <= I.end_running_qty THEN O.end_running_qty ELSE I.end_running_qty END - CASE WHEN O.start_running_qty >=I.start_running_qty THEN O.start_running_qty ELSE I.start_running_qty END) AS items_consumed_tally FROM (SELECT I1.store_id, SUM(I2.item_qty) - I1.item_qty, SUM(I2.item_qty) FROM Inventory AS I1, Inventory AS I2 WHERE I2.store_id <= I1.store_id GROUP BY I1.store_id, I1.item_qty) AS I (store_id, start_running_qty, end_running_qty) INNER JOIN (SELECT O1.customer_id, SUM(O2.item_qty) - O1.item_qty, SUM(O2.item_qty) AS end_running_qty FROM Orders AS O1, Orders AS O2 WHERE O2.customer_id <= O1.customer_id GROUP BY O1.customer_id, O1.item_qty) AS O(customer_id, start_running_qty, end_running_qty) ON O.start_running_qty < I.end_running_qty AND O.end_running_qty > I.start_running_qty;
This can also be done with ordinal functions.
This is a classic programming problem from procedural language classes. The set up is fairly simple; you have a set of potential husbands and an equal sized set of potential wives. We want to pair them up into stable marriages.
What is a stable marriage? In 25 words or less, a marriage in which neither partner can do better. You have a set of n men and a set of n women. All the men have a preference scale for all the women that ranks them from 1 to n without gaps or ties. The women have the same ranking system for the men. The goal is to pair off the men and women into n marriages such that there is no pair in your final arrangement where Mr. X and Ms. Y are matched to each other when they both would rather be matched to someone else.
For example, let’s assume the husbands are (‘Joe Celko,’ ‘Brad Pitt’) and the wives are (‘Jackie Celko,’ ‘Angelina Jolie’). If Jackie got matched to Mr. Pitt, she would be quite happy. And I would enjoy Ms. Jolie’s company. However, Mr. Pitt and Ms. Jolie can both do better than us. Once they are paired up, they will stay that way, leaving Jackie and me still wed.
The classic Stable Marriage algorithms usually are based on backtracking. These algorithms try a combination of couples and then attempt to fix any unhappy matches. When the algorithm hits on a situation where nobody can improve their situation, they stop and give an answer.
Two important things to know about this problem: (1) there is always a solution and (2) there is often more than one solution. Remember that a stable marriage is not always a happy marriage. In fact, in this problem, while there is always at least one arrangement of stable marriages in any set, you most often find many different pairings that produce a set of stable marriages. Each set of marriages will tend to maximize either the happiness of the men or the women.
CREATE TABLE Husbands (man CHAR(2) NOT NULL, woman CHAR(2) NOT NULL, PRIMARY KEY (man, woman), ranking INTEGER NOT NULL); CREATE TABLE Wives (woman CHAR(2) NOT NULL, man CHAR(2) NOT NULL, PRIMARY KEY (woman, man), ranking INTEGER NOT NULL); CREATE TABLE Wife_Perms (perm INTEGER NOT NULL PRIMARY KEY, wife_name CHAR(2) NOT NULL); -- The men's preferences INSERT INTO Husbands -- husband #10 VALUES ('h1', 'w1', 5), ('h1', 'w2', 2), ('h1', 'w3', 6), ('h1', 'w4', 8), ('h1', 'w5', 4), ('h1', 'w6', 3), ('h1', 'w7', 1), ('h1', 'w8', 7); INSERT INTO Husbands -- husband #20 VALUES ('h2', 'w1', 6), ('h2', 'w2', 3), ('h2', 'w3', 2), ('h2', 'w4', 1), ('h2', 'w5', 8), ('h2', 'w6', 4), ('h2', 'w7', 7), ('h2', 'w8', 5); INSERT INTO Husbands -- husband #3 VALUES ('h3', 'w1', 4), ('h3', 'w2', 2), ('h3', 'w3', 1), ('h3', 'w4', 3), ('h3', 'w5', 6), ('h3', 'w6', 8), ('h3', 'w7', 7), ('h3', 'w8', 5); INSERT INTO Husbands -- husband #4 VALUES ('h4', 'w1', 8), ('h4', 'w2', 4), ('h4', 'w3', 1), ('h4', 'w4', 3), ('h4', 'w5', 5), ('h4', 'w6', 6), ('h4', 'w7', 7), ('h4', 'w8', 2); INSERT INTO Husbands -- husband #5 VALUES ('h5', 'w1', 6), ('h5', 'w2', 8), ('h5', 'w3', 2), ('h5', 'w4', 3), ('h5', 'w5', 4), ('h5', 'w6', 5), ('h5', 'w7', 7), ('h5', 'w8', 1); INSERT INTO Husbands -- husband #6 VALUES ('h6', 'w1', 7), ('h6', 'w2', 4), ('h6', 'w3', 6), ('h6', 'w4', 5), ('h6', 'w5', 3), ('h6', 'w6', 8), ('h6', 'w7', 2), ('h6', 'w8', 1); INSERT INTO Husbands -- husband #7 VALUES ('h7', 'w1', 5), ('h7', 'w2', 1), ('h7', 'w3', 4), ('h7', 'w4', 2), ('h7', 'w5', 7), ('h7', 'w6', 3), ('h7', 'w7', 6), ('h7', 'w8', 8); INSERT INTO Husbands -- husband #8 VALUES ('h8', 'w1', 2), ('h8', 'w2', 4), ('h8', 'w3', 7), ('h8', 'w4', 3), ('h8', 'w5', 6), ('h8', 'w6', 1), ('h8', 'w7', 5), ('h8', 'w8', 8); -- The women's preferences INSERT INTO Wives -- wife #1 VALUES ('w1', 'h1', 6), ('w1', 'h2', 3), ('w1', 'h3', 7), ('w1', 'h4', 1), ('w1', 'h5', 4), ('w1', 'h6', 2), ('w1', 'h7', 8), ('w1', 'h8', 5); INSERT INTO Wives -- wife #2 VALUES ('w2', 'h1', 4), ('w2', 'h2', 8), ('w2', 'h3', 3), ('w2', 'h4', 7), ('w2', 'h5', 2), ('w2', 'h6', 5), ('w2', 'h7', 6), ('w2', 'h8', 1); INSERT INTO Wives -- wife #3 VALUES ('w3', 'h1', 3), ('w3', 'h2', 4), ('w3', 'h3', 5), ('w3', 'h4', 6), ('w3', 'h5', 8), ('w3', 'h6', 1), ('w3', 'h7', 7), ('w3', 'h8', 2); INSERT INTO Wives -- wife #4 VALUES ('w4', 'h1', 8), ('w4', 'h2', 2), ('w4', 'h3', 1), ('w4', 'h4', 3), ('w4', 'h5', 7), ('w4', 'h6', 5), ('w4', 'h7', 4), ('w4', 'h8', 6); INSERT INTO Wives -- wife #5 VALUES ('w5', 'h1', 3), ('w5', 'h2', 7), ('w5', 'h3', 2), ('w5', 'h4', 4), ('w5', 'h5', 5), ('w5', 'h6', 1), ('w5', 'h7', 6), ('w5', 'h8', 8); INSERT INTO Wives -- wife #6 VALUES ('w6', 'h1', 2), ('w6', 'h2', 1), ('w6', 'h3', 3), ('w6', 'h4', 6), ('w6', 'h5', 8), ('w6', 'h6', 7), ('w6', 'h7', 5), ('w6', 'h8', 4); INSERT INTO Wives -- wife #7 VALUES ('w7', 'h1', 6), ('w7', 'h2', 4), ('w7', 'h3', 1), ('w7', 'h4', 5), ('w7', 'h5', 2), ('w7', 'h6', 8), ('w7', 'h7', 3), ('w7', 'h8', 7); INSERT INTO Wives -- wife #8 VALUES ('w8', 'h1', 8), ('w8', 'h2', 2), ('w8', 'h3', 7), ('w8', 'h4', 4), ('w8', 'h5', 5), ('w8', 'h6', 6), ('w8', 'h7', 1), ('w8', 'h8', 3); -- This auxiliary table helps us create all permutations of the wives. INSERT INTO Wife_Perms VALUES (1, 'w1'), (2, 'w2'), (4, 'w3'), (8, 'w4'), (16, 'w5'), (32, 'w6'), (64, 'w7'), (128, 'w8'),
The query builds all permutation of wives and then filters them for blocking pairs in an elaborate NOT EXISTS() predicate.
SELECT A.wife_name AS h1, B.wife_name AS h2, C.wife_name AS h3, D.wife_name AS h4, E.wife_name AS h5, F.wife_name AS h6, G.wife_name AS h7, H.wife_name AS h8 FROM Wife_Perms AS A, Wife_Perms AS B, Wife_Perms AS C, Wife_Perms AS D, Wife_Perms AS E, Wife_Perms AS F, Wife_Perms AS G, Wife_Perms AS H WHERE A.perm + B.perm + C.perm + D.perm + E.perm + F.perm + G.perm + H.perm = 255 AND NOT EXISTS (SELECT * FROM Husbands AS W, Husbands AS X, Wives AS Y, Wives AS Z WHERE W.man = X.man AND W.ranking > X.ranking AND X.woman = Y.woman AND Y.woman = Z.woman AND Y.ranking > Z.ranking AND Z.man = W.man AND W.man||W.woman IN ('h1'||A.wife_name, 'h2'||B.wife_name, 'h3'||C.wife_name, 'h4'||D.wife_name, 'h5'||E.wife_name, 'h6'||F.wife_name, h7'||G.wife_name, 'h8'||H.wife_name) AND Y.man||Y.woman IN ('h1'||A.wife_name, 'h2'||B.wife_name, 'h3'||C.wife_name, 'h4'||D.wife_name, 'h5'||E.wife_name, 'h6'||F.wife_name, 'h7'||G.wife_name, 'h8'|| H.wife_name))
h1 | h2 | h3 | h4 | h5 | h6 | h7 | h8 |
w3 | w6 | w4 | w8 | w1 | w5 | w7 | w2 |
w3 | w6 | w4 | w1 | w7 | w5 | w8 | w2 |
w6 | w4 | w3 | w8 | w1 | w5 | w7 | w2 |
w6 | w3 | w4 | w8 | w1 | w5 | w7 | w2 |
w6 | w4 | w3 | w1 | w7 | w5 | w8 | w2 |
w6 | w3 | w4 | w1 | w7 | w5 | w8 | w2 |
w2 | w4 | w3 | w8 | w1 | w5 | w7 | w6 |
w2 | w4 | w3 | w1 | w7 | w5 | w8 | w6 |
w7 | w4 | w3 | w8 | w1 | w5 | w2 | w6 |
REFERENCES: | |||||||
Gusfierld, Dan and Irving, Robert W. The Stable Marriage Problem: Structure & Algorithms. ISBN 0-262-07118-5. | |||||||
Knuth, Donald E. CRM Proceedings & Lecture Notes, Vol #10, “Stable Marriage and Its Relation to Other Combinatorial Problems” ISBN 0-8218-0603-3. |
This booklet, which reproduces seven expository lectures given by the author in November 1975, is a gentle introduction to the analysis of algorithms using the beautiful theory of stable marriages as a vehicle to explain the basic paradigms of that subject.
Wirth, Nicklaus; Algorithms + Data Structures = Programs. Section 3.6. ISBN 0-13-022418-9
This section gives an answer in Pascal and a short analysis of the algorithm. In particular, I used his data for my example. He gives several answers which give a varying “degrees of happiness” for husbands and wives.
This example was taken from the BeyondRelational Website SQL Challenge #22 in 2010 January. We have got some boxes and balls. Our job is to put balls into those boxes. But, wait a second. The balls should be filled into the boxes based on some rules and preferences configured by the user. Here are the rules.
1. A box can have only one ball
2. A ball can be placed only in one box
3. Number of balls and number of boxes will always be the same.
4. All boxes should be filled and all balls should be used
5. There will be a configuration table where the preferences of the user will be stored. The preference setting should be followed when putting a ball into a box.
CREATE TABLE Boxes (box_nbr INTEGER NOT NULL PRIMARY KEY, box_name VARCHAR(20) NOT NULL); INSERT INTO Boxes (box_nbr, box_name) VALUES (1, 'Box 1'), (2, 'Box 2'), (3, 'Box 3'), (4, 'Box 4'), (5, 'Box 5'), (6, 'Box 6'), CREATE TABLE Balls (ball_nbr INTEGER NOT NULL PRIMARY KEY, ball_name VARCHAR(20) NOT NULL); INSERT INTO Balls (ball_name) VALUES (1, 'Ball 1'), (2, 'Ball 2'), (3, 'Ball 3'), (4, 'Ball 4'), (5, 'Ball 5'), (6, 'Ball 6'), CREATE TABLE Preferences (box_nbr INTEGER NOT NULL REFERENCES Boxes (box_nbr), ball_nbr INTEGER NOT NULL REFERENCES Balls (ball_nbr), PRIMARY KEY (box_nbr, ball_nbr)); INSERT INTO Preferences (box_nbr, ball_nbr) VALUES (1, 1), (2, 1), (2, 3), (3, 2), (3, 3), (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6), (5, 4), (5, 5), (6, 5); Results
This answer is done in parts to expose the logic via CTEs. The Balls_In_Boxes CTE gives us all the possible arrangements of six balls in six boxes. This is passed to the Preferred_Balls_In_Boxes CTE to apply the preference rules but allow duplicate balls if two or more boxes want them. Finally, the main query makes sure that we keep only the rows with unique balls.
The use of the IN() predicates to assure that the row has no duplicate columns is easy to extend to any number of items, but a bit bulky to read. But it is remarkably fast in the SQL engines where we tested it.
WITH Balls_In_Boxes (bx1, bx2, bx3, bx4, bx5, bx6) AS (SELECT B1.ball_nbr, B2.ball_nbr, B3.ball_nbr, B4.ball_nbr, B5.ball_nbr, B6.ball_nbr FROM Balls AS B1, Balls AS B2, Balls AS B3, Balls AS B4, Balls AS B5, Balls AS B6 WHERE B1.ball_nbr NOT IN (B2.ball_nbr, B3.ball_nbr, B4.ball_nbr, B5.ball_nbr, B6.ball_nbr) AND B2.ball_nbr NOT IN (B3.ball_nbr, B4.ball_nbr, B5.ball_nbr, B6.ball_nbr) AND B3.ball_nbr NOT IN (B4.ball_nbr, B5.ball_nbr, B6.ball_nbr) AND B4.ball_nbr NOT IN (B5.ball_nbr, B6.ball_nbr) AND B5.ball_nbr NOT IN (B6.ball_nbr)), Preferred_Balls_In_Boxes (bx1, bx2, bx3, bx4, bx5, bx6) AS (SELECT bx1, bx2, bx3, bx4, bx5, bx6 FROM Balls_In_Boxes AS BB WHERE BB.bx1 IN (SELECT ball_nbr FROM Preferences AS P WHERE box_nbr = 1) AND BB.bx2 IN (SELECT ball_nbr FROM Preferences AS P WHERE box_nbr = 2) AND BB.bx3 IN (SELECT ball_nbr FROM Preferences AS P WHERE box_nbr = 3) AND BB.bx4 IN (SELECT ball_nbr FROM Preferences AS P WHERE box_nbr = 4) AND BB.bx5 IN (SELECT ball_nbr FROM Preferences AS P WHERE box_nbr = 5) AND BB.bx6 IN (SELECT ball_nbr FROM Preferences AS P WHERE box_nbr = 6)) SELECT bx1, bx2, bx3, bx4, bx5, bx6 FROM Preferred_Balls_In_Boxes AS PBB1 WHERE PBB1.bx NOT IN (PBB2.bx, PBB3.bx, PBB4.bx, PBB5.bx, PBB6.bx) AND PBB2.bx NOT IN (PBB3.bx, PBB4.bx, PBB5.bx, PBB6.bx) AND PBB3.bx NOT IN (PBB4.bx, PBB5.bx, PBB6.bx) AND PBB4.bx NOT IN (PBB5.bx, PBB6.bx) AND PBB5.bx NOT IN (PBB6.bx);
In the Second Version of the relationnal model in 1990, Dr. E. F. Codd introduced a set of new theta operators, called T-operators, which were based on the idea of a best-fit or approximate equality (Codd, 1990). The algorithm for the operators is easier to understand with an example modified from Dr. Codd.
The problem is to assign the classes to the available classrooms. We want (class_size < room_size) to be true after the assignments are made. This will allow us a few empty seats in each room for late students. We can do this in one of two ways. The first way is to sort the tables in ascending order by classroom size and the number of students in a class. We start with the following tables and load them with the data that follows the DDL.
CREATE TABLE Rooms (room_nbr CHAR(3) NOT NULL PRIMARY KEY, room_size INTEGER NOT NULL);
CREATE TABLE Classes (class_nbr CHAR(3) NOT NULL PRIMARY KEY, class_size INTEGER NOT NULL);
The goal of the T-JOIN problem is to assign a class which is smaller than the classroom given it (class_size < room_size). Dr. Codd gives two approaches to the problem.
(1) Ascending Order Algorithm
Sort both tables into ascending order. Reading from the top of the Rooms table, match each class with the first room that will fit.
(2) Descending Order Algorithm
Sort both tables into descending order. Reading from the top of the Classes table, match each class with the first room that will fit.
Classes | Rooms | ||
class_nbr | class_size | room_nbr | room_size |
c1 | 80 | r4 | 85 |
c2 | 70 | r1 | 70 |
c3 | 65 | r6 | 65 |
c4 | 55 | r7 | 55 |
c5 | 50 | r3 | 50 |
c6 | 40 | r2 | 40 |
NULL | NULL | r5 | 30 |
Notice that the answers are different. Dr. Codd has never given a definition in relational algebra of the T-Join, so I proposed that we need one. Informally, for each class, we want the smallest room that will hold it, while maintaining the T-JOIN condition. Or for each room, we want the largest class that will fill it, while maintaining the T-JOIN condition. These can be two different things, so you must decide which table is the driver. But either way, I am advocating a “best fit” over Codd’s “first fit” approach.
Other theta conditions can be used in place of the “less than” shown here. If “less than or equal” is used, all the classes are assigned to a room in this case, but not in all cases. This is left to the reader as an exercise.
The first attempts in Standard SQL are versions of grouped by queries. They can, however, produce some rows that would be left out of the answers Dr. Codd was expecting. The first JOIN can be written as
SELECT class_nbr, class_size, MIN(room_size) FROM Rooms, Classes WHERE Classes.class_size < Rooms.room_size GROUP BY class_nbr, class_size;
This will give a result table with the desired room sizes, but not the room numbers. You cannot put the other columns in the SELECT list, since it would conflict with the GROUP BY clause. But also note that the classroom with 85 seats (‘r4’) is used twice, once by class ‘c1’ and then by class ‘c2’:
If you do a little arithmetic on the data, you find that we have 360 students and 395 seats, 6 classes and 7 rooms. Do you want to use the smallest number of rooms?
As it works out, the best fit of rooms to classes will leave the smallest room empty and pack the other rooms to capacity, thus:
SELECT class_nbr, class_size, MIN(room_size) FROM Rooms, Classes WHERE Classes.class_size <= Rooms.room_size GROUP BY class_nbr, class_size;
The place to start is with all the possible legal room assignments for a class. We have already seen this query:
SELECT R.room_nbr, C.class_nbr FROM Rooms AS R, Classes AS C WHERE C.class_size <= R.room_size ORDER BY R.room_nbr, C.class_nbr;
At the extreme, if all the rooms and classes are the same size, then your have (n!) solutions. If all the rooms are different sizes, we can save ourselves combinatorial explosions, so let us agree that we will juggle the data to get that condition. This query will give us the pairs that are an exact fit, but we know that there will not be any ties for room size.
SELECT R.room_nbr, C.class_nbr FROM Rooms AS R, Classes AS C WHERE C.class_size <= R.room_size GROUP BY R.room_nbr, C.class_nbr HAVING MIN(R.room_size - C.class_size) = 0;
This leaves us with class {c1} and rooms {r4, r5} yet to be used. We can then repeat a limited version of the basic Pairs query.
SELECT R.room_nbr, C.class_nbr FROM Rooms AS R, Classes AS C WHERE C.class_size <= R.room_size AND R.room_nbr IN ('r4', 'r5') AND C.class_nbr IN ('c1'),
We can now union these result sets and have an answer. I took some extra time to show the details to demonstrate how we can implement a best-fit, greedy algorithm. Let us get a bigger data set and work with it.
INSERT INTO Classes (class_nbr, class_size) VALUES ('c01', 106), ('c02', 105), ('c03', 104), ('c04', 100), ('c05', 99), ('c06', 90), ('c07', 89), ('c08', 88), ('c09', 83), ('c10', 82), ('c11', 81), ('c12', 65), ('c13', 50), ('c14', 49), ('c15', 30), ('c16', 29), ('c17', 28), ('c18', 20), ('c19', 19); INSERT INTO Rooms (room_nbr, room_size) VALUES ('r01', 102), ('r02', 101), ('r03', 95), ('r04', 94), ('r05', 85), ('r06', 70), ('r07', 55), ('r08', 54), ('r09', 35), ('r10', 34), ('r11', 25), ('r12', 18);
To see how this will work, let’s add another column to the table of legal (class_nbr, room_nbr) pairs to see how well they fit.
WITH Pairs (room_nbr, class_nbr, fit) AS (SELECT R.room_nbr, C.class_nbr, (R.room_size - C.class_size) FROM Rooms AS R, Classes AS C WHERE C.class_size <= R.room_size) SELECT P1.room_nbr, P1.class_nbr, fit FROM Pairs AS P1 WHERE P1.fit = (SELECT MIN(P2.fit) FROM Pairs AS P2 WHERE P1.room_nbr = P2.room_nbr);
This time we did not have an exact fit, so let’s look for (fit = 1) into a working table and remove ‘r02’ and ‘c04’ from their tables. The second step is
We can now remove the best fit rooms and classes and look the next set of remaining best fits.
WITH Pairs (room_nbr, class_nbr, fit) AS (SELECT R.room_nbr, C.class_nbr, (R.room_size - C.class_size) FROM Rooms AS R, Classes AS C WHERE C.class_size <= R.room_size) (SELECT P1.room_nbr, P1.class_nbr, fit FROM Pairs AS P1 WHERE P1.fit = 2);
Here is the step for a fit of 3, 4, 5, and 6
At this point, the original tables of rooms and classes cannot be paired. Notice that as we removed rooms and classes, the fit numbers changed. This is a characteristic of greedy algorithms; taking the “big bites” can leave sub-optimal leftovers on the plate.
In 2014, Dwain Camps published an article on filling gaps in a sequence with the nearest prior value and then doing a linear distribution of the delta (changes). His choice of a name for the problem was the ‘Data Smear,’ and he found several solutions with window functions. Let’s begin with a table of fictional meter reading that is full of gaps.
CREATE TABLE Foo_Meters (reading_nbr INTEGER NOT NULL PRIMARY KEY, foo_reading INTEGER NOT NULL); INSERT INTO Foo_Meters VALUES (1, 121), (2, NULL), (3, NULL), (4, 312), (5, NULL), (6, NULL), (7, NULL), (8, 123), (9, NULL), (10, NULL), (11, 415), (12, NULL), (13, NULL), (14, NULL), (15, NULL), (16, 200);
The number of NULL (unknown) values between known values is not consistently the same. I arranged them on the same line to make it easy to see which values would be carried over which rows.
SELECT reading_nbr, foo_reading, MAX(foo_reading) OVER (PARTITION BY reading_cnt) AS prior_foo_reading FROM (SELECT reading_nbr, foo_reading, COUNT(foo_reading) OVER (ORDER BY reading_nbr) AS reading_cnt FROM Foo_Meters) AS prior_foo_reading;
Suppose we want something slightly different filling in the NULL values, for example, a sequence of the same PARTITION we constructed can be used in this case, and so can the ROW_NUMBER in the preceding example.
SELECT reading_nbr, foo_reading, COALESCE(foo_reading, ROW_NUMBER() OVER (PARTITION BY c ORDER BY reading_nbr) -1) AS s FROM (SELECT reading_nbr, foo_reading, COUNT(foo_reading) OVER (ORDER BY reading_nbr) AS c FROM Foo_Meters) AS A;
Suppose we want a bit smoother transition from our trailing value to our following value. This could be particularly appealing if we have many NULL rows between sparse data points. Let’s illustrate this by looking at the first four rows of the above results set.
reading_nbr | foo_reading | linear_reading |
1 | 121 | 121 |
2 | NULL | 216.500000121+(1/3)*(312–121)=184.6667 |
3 | NULL | 216.500000121+(2/3)*(312–121)=248.3333 |
4 | 312 | 312 |
First let’s look at some intermediate results using ROW_NUMBER() and the COUNT window aggregate.
SELECT reading_nbr, foo_reading, s, m, x FROM (SELECT reading_nbr, foo_reading, MAX(foo_reading) OVER (PARTITION BY c) AS s, ROW_NUMBER() OVER (PARTITION BY c ORDER BY reading_nbr DESC) AS n, ROW_NUMBER() OVER (PARTITION BY c ORDER BY reading_nbr) – 1 AS m, 1 + COUNT(CASE WHEN foo_reading IS NULL THEN 1 END) OVER (PARTITION BY c) AS x FROM (SELECT reading_nbr, foo_reading, COUNT(foo_reading) OVER (ORDER BY reading_nbr) AS c FROM Foo_Meters) AS A ) AS A;
Our intermediate results look pretty useful:
reading_nbr | foo_reading | s | m | x |
1 | 121 | 121 | 0 | 3 |
2 | NULL | 121 | 1 | 3 |
3 | NULL | 121 | 2 | 3 |
4 | 312 | 312 | 0 | 4 |
5 | NULL | 312 | 1 | 4 |
6 | NULL | 312 | 2 | 4 |
7 | NULL | 312 | 3 | 4 |
8 | 123 | 123 | 0 | 3 |
9 | NULL | 123 | 1 | 3 |
10 | NULL | 123 | 2 | 3 |
11 | 415 | 415 | 0 | 5 |
12 | NULL | 415 | 1 | 5 |
13 | NULL | 415 | 2 | 5 |
14 | NULL | 415 | 3 | 5 |
15 | NULL | 415 | 4 | 5 |
16 | 200 | 200 | 0 | 1 |
You can see that all three of our calculated columns are what we need to produce the smoothing calculation shown prior. The first, s, is of course our original smear. The columns m and x, when m is divided by x, produce the multiplier we need. We have omitted the LEAD result, but in the end, we’ll use that also (so n is likewise still required).
Putting that all together, we get this.
SELECT reading_nbr, foo_reading, CASE WHEN foo_reading IS NOT NULL THEN foo_reading ELSE s + (1.0 * m / x) * (LEAD(foo_reading, n, s) OVER (ORDER BY reading_nbr) - s) END AS computed_foo_reading FROM (SELECT reading_nbr, foo_reading, (MAX(foo_reading) OVER (PARTITION BY c)) AS s, (ROW_NUMBER() OVER (PARTITION BY c ORDER BY reading_nbr DESC)) AS n, (ROW_NUMBER() OVER (PARTITION BY c ORDER BY reading_nbr) - 1) AS m, 1 + COUNT(CASE WHEN foo_reading IS NULL THEN 1 ELSE NULL END) OVER (PARTITION BY c) AS x FROM (SELECT reading_nbr, foo_reading, (COUNT(foo_reading) OVER (ORDER BY reading_nbr)) AS c FROM Foo_Meters) AS F1 ) AS F2;
Collecting data is not always easy. When you sample or stage data, you can wind up with rows that have missing data (usually shown with NULLs) or overlapping data. This data will be shown on one row. We will need business rules that will depend on the particulars of each problem. The rules we want in this example are:
1. If we have multiple starting dates, then use the earliest known date. If there is a NULL in the data, then use it since we are uncertain.
2. If we have multiple ending dates, then use the latest known date. If there is a NULL in the data, then use it since we are uncertain.
As an example, assume we are taking samples of some kind within a date range, but the data is messy and we have multiple rows for multiple samples we have to reduce to a single row. Here is the DDL and some sample data.
CREATE TABLE Samples (sample_id INTEGER NOT NULL, first_date DATE, second_date DATE, UNIQUE (sample_id, first_date, second_date), CHECK (first_date < second_date) );
Note the use of UNIQUE versus PRIMARY KEY because of the NULL-able column. Many new SQL programmers do not know that UNIQUE does not imply NOT NULL, like PRIMARY KEY.
INSERT INTO Samples VALUES (1, '2012-02-24', '2012-02-27'), (1, '2012-02-24', '2012-03-06'), (2, '2012-02-24', '2012-03-06'), (2, '2012-01-05', '2012-01-06'), (3, NULL, '2012-03-07'), (3, '2012-02-01', '2012-03-07'), (4, '2012-02-01', NULL), (4, '2012-01-01', '2012-02-10'), (5, NULL, NULL);
Samples one and two have known values on their date ranges, so they are easy. But the other samples have missing dates. Normally, doing an aggregate will drop the NULLs instead of making them dominate. I am calling it “first_date” and “second_date” since we have not decided which start and end dates we will finally use. The names “candidate_start_date” and “candidate_end_date” are a bit stuffy. The decision is to use the latest ending date.
SELECT sample_id, CASE WHEN COUNT(*) <> COUNT(first_date) THEN NULL ELSE MIN(first_date) END AS first_date, CASE WHEN COUNT(*) <> COUNT(second_date) THEN NULL ELSE MAX(second_date) END AS second_date FROM Samples GROUP BY sample_id, second_date;
The use of CASE has to be at the same level of aggregation as the grouping for the query allows for this construct.
3.147.49.252