By set operations, I mean union, intersection, and set differences where the sets in SQL are tables. These are the basic operators used in elementary set theory, which has been taught in the United States public school systems for decades. Since the relational model is based on sets, you would expect that SQL would have had a good variety of set operators. But there is a problem in SQL that you did not have in high school set theory. SQL tables are multisets (also called bags), which means that, unlike sets, they allow duplicate elements (rows or tuples). SQL handles these duplicate rows with an ALL or DISTINCT modifier in different places in the language; ALL preserves duplicates and DISTINCT removes them.
So that we can discuss the result of each operator formally, let R be a row that is a duplicate of some row in TableA, or of some row in TableB, or of both. Let m be the number of duplicates of R in TableA and let n be the number of duplicates of R in TableB, where (m >=0) and (n >= 0). Informally, the engines will pair off the two tables on a row-per-row basis in set operations. We will see how this works for each operator.
For the rest of this discussion, let us create two skeleton tables with the same structure, which we can use for examples.
CREATE TABLE S1 (a1 CHAR(1)); INSERT INTO S1 VALUES ('a'), ('a'), ('b'), ('b'), ('c'), CREATE TABLE S2 (a2 CHAR(1)); INSERT INTO S2 VALUES ('a'), ('b'), ('b'), ('b'), ('c'), ('d'),
UNIONs have been supported since SQL-86, with this infixed syntax:
< table expression > UNION [ALL] < table expression >
The two versions of the UNION statement take two tables and build a result table from them. The two tables must be union compatible, which means that they have exactly the same number of columns, and that each column in the first table has the same data type (or automatically cast to it) as the column in the same position in the second table. That is, their rows have the same structure, so they can be put in the same final result table. Most implementations will do some data type conversions to create the result table, but this can be implementation-dependent and you should check it out for yourself.
There are two forms of the UNION statement: the UNION and the UNION ALL. The simple UNION is the same operator you had in high school set theory; it returns the rows that appear in either or both tables and removes redundant duplicates from the result table.
The phrase "redundant duplicates" sounds funny; but it means that you leave one copy of the row in the table. The problem is that string equality in SQL works by padding the shorter string with trailing blanks, then comparing them position by position. Which string is retained is implementation defined.
The sample tables will yield:
(SELECT a1 FROM S1 UNION SELECT a2 FROM S2)
In many early SQL implementations, merge-sorting the two tables discarding duplicates during the sorting did this removal. This had the side effect that the result table is sorted, but you cannot depend on that. Later implementations use hashing, indexing and parallel processing to find the duplicates.
The UNION ALL preserves the duplicates from both tables in the result table. Most early implementations simply appended one table to the other in physical storage. They used file systems based on physically contiguous storage, so this was easy and used the file system code. But, again, you cannot depend on any ordering in the results of either version of the UNION statement. Again, the sample tables will yield:
(SELECT a1 FROM S1 UNION ALL SELECT a2 FROM S2)
You can assign names to the columns by using the AS operator to make the result set into a derived table, thus:
SELECT rent, utilities, phone FROM (SELECT a, b, c FROM Old_Locations WHERE city_name = 'Boston' UNION SELECT x, y, z FROM New_Locations WHERE city_name = 'New York') AS Cities (rent, utilities, phone);
A few SQL products will attempt to optimize UNIONs if they are made on the same table. Those UNIONs can often be replaced with OR-ed predicates. For example:
SELECT city_name 'Western' FROM Cities WHERE market_code = 't' UNION ALL SELECT city_name 'Eastern' FROM Cities WHERE market_code = 'v';
could be rewritten (probably more efficiently) as follows:
SELECT city_name CASE market_code WHEN 't' THEN 'Western' WHEN 'v' THEN 'Eastern' END FROM Cities WHERE market_code IN ('v', 't'),
It takes system architecture based on domains rather than tables to optimize UNIONs if they are made on different tables.
Doing a UNION to the same table is the same as a SELECT DISTINCT, but the SELECT DISTINCT will probably run faster and preserve the column names too.
UNION and UNION ALL operators are executed from left to right unless parentheses change the order of execution. Since the UNION operator is associative and commutative, the order of a chain of UNIONs will not affect the results. However, order and grouping can affect performance. Consider two tables that have many duplicates between them. If the optimizer does not consider table sizes, this query
(SELECT * FROM Small_Table1) UNION (SELECT * FROM Big_Table) UNION (SELECT * FROM Small_Table2);
will merge Small_Table1 into Big_Table, and then merge Small_Table2 into that first result. If the rows of Small_Table1 are spread out in the first result table, locating duplicates from Small_Table2 will take longer than if we had written the query thus:
(SELECT * FROM Small_Table1) UNION (SELECT * FROM Small_Table2) UNION (SELECT * FROM Big_Table);
Again, optimization of UNIONs is highly product-dependent, so you should experiment with it.
If you know that there are no duplicates, or that duplicates are not a problem in your situation, use UNION ALL instead of UNION for speed. For example, if we are sure that Big_Table has no duplicates in common with Small_Table1 and Small_Table2, this query will produce the same results as before but should run much faster:
((SELECT * FROM Small_Table1) UNION (SELECT * FROM Small_Table2)) -– intermediate set UNION ALL (SELECT * FROM Big_Table);
But be careful when mixing UNION and UNION ALL operators. The left-to-right order of execution will cause the last operator in the chain to have an effect on the results.
A useful trick for building the union of columns from the same table is to use a CROSS JOIN and a CASE expression, thus
SELECT CASE WHEN S1.seq = 1 THEN F1.col1 WHEN S1.seq = 2 THEN F1.col2 ELSE NULL END FROM Foobar AS F1 CROSS JOIN Series AS S1(seq) WHERE S1.seq IN (1, 2)
This acts like the UNION ALL statement, but change the SELECT to SELECT DISTINCT and you have a UNION. The advantage of this statement over the more obvious UNION is that it makes one pass through the table. Given a large table, that can be important for good performance.
The INTERSECT and EXCEPT set operators take two tables and build a new table from them. The two tables must be union compatible, just like with the UNION [ALL] operator. Like the UNION [ALL], the result of an INTERSECT [ALL] or EXCEPT [ALL] should use an AS operator if you want to have names for the result table and its columns.
Oracle was the first major vendor to have the EXCEPT operator with the keyword MINUS. The set difference is the rows in the first table, except for those that also appear in the second table. It answers questions like "Give me all the employees except the salesmen" in a natural manner.
Let's take our two multisets and use them to explain the basic model, by making a mapping between them:
S1 = {a, a, b, b, c } ↕ ↕ ↕ ↕ S2 = {a, b, b, b, c, d}
The INTERSECT and EXCEPT operators remove all duplicates from both sets, so we would have:
S1 = {a, b, c } ↕ ↕ ↕ S2 = {a, b, c, d}
and therefore,
S1 INTERSECT S2 = {a, b, c}
and
S2 EXCEPT S1 = {d} S1 EXCEPT S2 = {}
When you add the ALL option, things are trickier. The mapped pairs become the unit of work. The INTERSECT ALL keeps each pairing, so that
S1 INTERSECT ALL S2 = {a, b, b, c}
and the EXCEPT ALL throws them away, retain what is left in the first set, thus.
S2 EXCEPT ALL S1 = {b, d}
Trying to write the INTERSECT and EXCEPT with other operators is trickier than it looks. It has to be general enough to handle situations where there is no key available and the number of columns is not known.
Standard SQL defines the actions for duplicates in terms of the count of duplicates of matching rows. Let (m) be the number of rows of one kind in S1 and (n) be the number in S2. The UNION ALL will have (m + n) copies of the row. The INTERSECT ALL will have LEAST(m, n) copies. EXCEPT ALL will have the greater of either the first table's count minus the second table's count or zero copies.
The immediate impulse of a programmer is to write the code with EXISTS() predicates. The bad news is that it does not work because of NULLs. This is easier to show with code. Let's redo our two sample tables.
CREATE TABLE S1 (a1 CHAR(1)); INSERT INTO S1 VALUES ('a'), ('a'), ('b'), ('b'), ('c'), (NULL), (NULL); CREATE TABLE S2 (a2 CHAR(1)); INSERT INTO S2 VALUES ('a'), ('b'), ('b'), ('b'), ('c'), ('d'), (NULL);
Now build a view to hold the duplication_cnt of each value in each table.
CREATE VIEW DupCounts (a, s1_dup, s2_dup) AS SELECT S.a, SUM(s1_dup), SUM(s2_dup) FROM (SELECT S1.a1, 1, 0 FROM S1 UNION ALL SELECT S2.a2, 0, 1 FROM S2) AS S(a, s1_dup, s2_dup) GROUP BY S.a, s1_dup, s2_dup;
The GROUP BY will put the NULLs into a separate group giving them the right tallies. Now code is a straightforward implementation of the definitions in Standard SQL.
-- S1 EXCEPT ALL S2 SELECT DISTINCT D1.a, (s1_dup - s2_dup) AS dups FROM DupCounts AS D1, Series AS S1 WHERE S1.seq <= (s1_dup - s2_dup); -- S1 INTERSECT ALL S2 SELECT DISTINCT D1.a, CASE WHEN s1_dup <= s2_dup THEN s1_dup ELSE s2_dup END AS duplication_cnt FROM DupCounts AS D1, Series AS S1 WHERE S1.seq <= CASE WHEN s1_dup <= s2_dup THEN s1_dup ELSE s2_dup END;
Notice that we had to use SELECT DISTINCT. Without it, the sample data will produce this table.
The nonduplicated versions are easy to write from the definitions in the Standards. In effect their duplication tallies are set to one.
-- S1 INTERSECT S2 SELECT D1.a FROM DupCounts AS D1 WHERE s1_dup > 0 AND s2_dup > 0; -- S1 EXCEPT S2 SELECT D1.a FROM DupCounts AS D1 WHERE s1_dup > 0 AND s2_dup = 0; -- S2 EXCEPT S1 SELECT D1.a FROM DupCounts AS D1 WHERE s2_dup > 0 AND s1_dup = 0;
INTERSECT and EXCEPT are much easier if each of the two tables does not have NULLs and duplicate values in them. Intersection is simply
SELECT * FROM S1 WHERE EXISTS (SELECT * FROM S2 WHERE S1.a1 = S2.a2);
or
SELECT * FROM S2 WHERE EXISTS (SELECT * FROM S1 WHERE S1.a1 = S2.a2);
You can also use
SELECT DISTINCT S2.* FROM (S2 INNER JOIN S1 ON S1.a1 = S2.a2);
This is given as a motivation for the next piece of code, but you may find that some SQL engines do joins faster than EXISTS() predicates and vice versa, so it is a good idea to have more than one trick in your bag.
The set difference can be written with an OUTER JOIN operator. This code is due to Jim Panttaja.
SELECT DISTINCT S2.* FROM (S2 LEFT OUTER JOIN S1 ON S1.a1 = S2.a2) WHERE S1.a1 IS NULL;
This was all we had in the early days, and you should not use it today. But you will see it in old code.
These versions of INTERSECT and EXCEPT are due to Itzak Ben-Gan. They make very good use of the UNION and DISTINCT operators to implement set theory definitions.
-- S1 INTERSECT S2 SELECT D.a FROM (SELECT DISTINCT a1 FROM S1 UNION ALL SELECT DISTINCT a2 FROM S2) AS D(a) GROUP BY D.a HAVING COUNT(*) > 1; -- S1 INTERSECT ALL S2 SELECT D2.a FROM (SELECT D1.a, MIN(cnt) AS mincnt FROM (SELECT a1, COUNT(*) FROM S1 GROUP BY a1 UNION ALL SELECT a2, COUNT(*) FROM S2 GROUP BY a2) AS D1(a, cnt) GROUP BY D1.a HAVING COUNT(*) > 1) AS D2 INNER JOIN Series ON seq <= min_cnt; -- S1 EXCEPT ALL S2 SELECT D2.a FROM (SELECT D1.a, SUM(cnt) FROM (SELECT a1, COUNT(*) FROM S1 GROUP BY a1 UNION ALL SELECT a2, -COUNT(*) FROM S2 GROUP BY a2) AS D1(a, cnt) GROUP BY D1.a HAVING SUM(cnt) > 0) AS D2(a, dups) INNER JOIN Series ON seq <= D2.dups;
The Series table is discussed in other places in this book. It is a table of integers from 1 to (n) that is used to replace iteration and counting in SQL. Obviously, (n) has to be large enough for these statements to work.
Here is a series of observations about the relationship between the ALL option in set operations and the SELECT DISTINCT options in a query from Beught Gunne. The multiset model does not behave like the traditional set theory we know from school.
Given two tables with duplicate values:
CREATE TABLE A (i INTEGER NOT NULL); INSERT INTO A VALUES (1), (1), (2), (2), (4), (4); CREATE TABLE B (i INTEGER NOT NULL); INSERT INTO B VALUES (2), (2), (3), (3);
The UNION and INTERSECT operations have regular behavior in that
(A UNION B) = SELECT DISTINCT (A UNION ALL B) = ((1), (2), (3)) and (A INTERSECT B) = SELECT DISTINCT (A INTERSECT ALL B) = (2)
However,
(A EXCEPT B) <> SELECT DISTINCT (A EXCEPT ALL B)
Or more literally, (1) <> ((1), (2)) for the tables given in the example. And likewise, we have
(B EXCEPT A) = SELECT DISTINCT (B EXCEPT ALL A) = (3)
by a coincidence of the particular values used in these tables.
At one point, when SQL was still in the laboratory at IBM, there was a CONTAINS operator that would tell you if one table was a subset of another. It disappeared in later versions of the language and no vendor picked it up. Set equality was never part of SQL as an operator, so you would have to have used the two expressions ((A CONTAINS B) AND (B CONTAINS A)) to find out.
Today, you can use the methods shown in the section on Relational Division to determine containment or equality. Or you can use set theory expressions, such as NOT EXISTS ((A EXCEPT B) UNION (B EXCEPT A)). Which method is fastest is going to depend on the SQL product and the access method used.
However, Itzak Ben-Gan came up with a novel approach for finding containment and equality that is worth a mention.
SELECT SUM(DISTINCT match_col) FROM (SELECT CASE WHEN S1.col IN (SELECT S2.col FROM S2) THEN 1 ELSE -1 END FROM S1) AS X(match_col) HAVING SUM(DISTINCT match_col) = :n;
You can set (:n) to 1, 0, or − 1 for each particular test.
When I find a matching row in S1, I get a one; when I find a mismatched row in S1, get a − 1 and they sum together to give me a zero. Therefore, S1 is a proper subset of S2. If they sum to one, then they are equal. If they sum to − 1, they are disjoint.
Another trick for equality is based on the principle that if two sets are equal, then they have the same cardinality. This is necessary but not sufficient. They have the same elements. We can use a windowed count(*) to get the cardinality
EXISTS (SELECT a, COUNT(a) OVER() AS alpha_cardinality FROM Alphas EXCEPT SELECT b, COUNT(b) OVER() AS beta_cardinality FROM Betas)
18.118.9.197