Chapter 16

Set Operations

Abstract

Since SQL is based on sets, it includes classic set operations, with extensions for multisets.

Keywords

ALL

DISTINCT

EXCEPT

EXCEPT ALL

INTERSECT

INTERSECT ALL

Multisets (bags)

Proper subsets

Redundant duplicates

Set equality

UNION

UNION ALL

Union compatible

By set operations, I mean union, intersection, and set differences where the sets in SQL are tables. These are the basic operators used in elementary set theory, which has been taught in the United States public school systems for decades. Since the relational model is based on sets, you would expect that SQL would have had a good variety of set operators. But there is a problem in SQL that you did not have in high school set theory. SQL tables are multisets (also called bags), which means that, unlike sets, they allow duplicate elements (rows or tuples). SQL handles these duplicate rows with an ALL or DISTINCT modifier in different places in the language; ALL preserves duplicates and DISTINCT removes them.

So that we can discuss the result of each operator formally, let R be a row that is a duplicate of some row in TableA, or of some row in TableB, or of both. Let m be the number of duplicates of R in TableA and let n be the number of duplicates of R in TableB, where (m >=0) and (n >= 0). Informally, the engines will pair off the two tables on a row-per-row basis in set operations. We will see how this works for each operator.

For the rest of this discussion, let us create two skeleton tables with the same structure, which we can use for examples.

CREATE TABLE S1 (a1 CHAR(1));
INSERT INTO S1 
VALUES ('a'), ('a'), ('b'), ('b'), ('c'),
CREATE TABLE S2 (a2 CHAR(1));
INSERT INTO S2 
VALUES ('a'), ('b'), ('b'), ('b'), ('c'), ('d'),

16.1 UNION and UNION ALL

UNIONs have been supported since SQL-86, with this infixed syntax:

< table expression > UNION [ALL] < table expression >

The two versions of the UNION statement take two tables and build a result table from them. The two tables must be union compatible, which means that they have exactly the same number of columns, and that each column in the first table has the same data type (or automatically cast to it) as the column in the same position in the second table. That is, their rows have the same structure, so they can be put in the same final result table. Most implementations will do some data type conversions to create the result table, but this can be implementation-dependent and you should check it out for yourself.

There are two forms of the UNION statement: the UNION and the UNION ALL. The simple UNION is the same operator you had in high school set theory; it returns the rows that appear in either or both tables and removes redundant duplicates from the result table.

The phrase "redundant duplicates" sounds funny; but it means that you leave one copy of the row in the table. The problem is that string equality in SQL works by padding the shorter string with trailing blanks, then comparing them position by position. Which string is retained is implementation defined.

The sample tables will yield:

(SELECT a1 FROM S1
UNION
SELECT a2 FROM S2)

a

b

c

d

In many early SQL implementations, merge-sorting the two tables discarding duplicates during the sorting did this removal. This had the side effect that the result table is sorted, but you cannot depend on that. Later implementations use hashing, indexing and parallel processing to find the duplicates.

The UNION ALL preserves the duplicates from both tables in the result table. Most early implementations simply appended one table to the other in physical storage. They used file systems based on physically contiguous storage, so this was easy and used the file system code. But, again, you cannot depend on any ordering in the results of either version of the UNION statement. Again, the sample tables will yield:

(SELECT a1 FROM S1
UNION ALL
SELECT a2 FROM S2)

a

a

a

b

b

b

b

b

c

c

d

You can assign names to the columns by using the AS operator to make the result set into a derived table, thus:

SELECT rent, utilities, phone
FROM (SELECT a, b, c 
FROM Old_Locations 
WHERE city_name = 'Boston'
UNION
SELECT x, y, z 
FROM New_Locations
WHERE city_name = 'New York')
 AS Cities (rent, utilities, phone);

A few SQL products will attempt to optimize UNIONs if they are made on the same table. Those UNIONs can often be replaced with OR-ed predicates. For example:

SELECT city_name 'Western'
FROM Cities
WHERE market_code = 't'
UNION ALL
SELECT city_name 'Eastern'
FROM Cities
WHERE market_code = 'v'; 

could be rewritten (probably more efficiently) as follows:

SELECT city_name
CASE market_code
WHEN 't' THEN 'Western'
WHEN 'v' THEN 'Eastern' END
FROM Cities
WHERE market_code IN ('v', 't'),

It takes system architecture based on domains rather than tables to optimize UNIONs if they are made on different tables.

Doing a UNION to the same table is the same as a SELECT DISTINCT, but the SELECT DISTINCT will probably run faster and preserve the column names too.

16.1.1 Order of Execution

UNION and UNION ALL operators are executed from left to right unless parentheses change the order of execution. Since the UNION operator is associative and commutative, the order of a chain of UNIONs will not affect the results. However, order and grouping can affect performance. Consider two tables that have many duplicates between them. If the optimizer does not consider table sizes, this query

(SELECT * FROM Small_Table1)
UNION
(SELECT * FROM Big_Table)
UNION
(SELECT * FROM Small_Table2);

will merge Small_Table1 into Big_Table, and then merge Small_Table2 into that first result. If the rows of Small_Table1 are spread out in the first result table, locating duplicates from Small_Table2 will take longer than if we had written the query thus:

(SELECT * FROM Small_Table1)
UNION
(SELECT * FROM Small_Table2)
UNION
(SELECT * FROM Big_Table);

Again, optimization of UNIONs is highly product-dependent, so you should experiment with it.

16.1.2 Mixed UNION and UNION ALL Operators

If you know that there are no duplicates, or that duplicates are not a problem in your situation, use UNION ALL instead of UNION for speed. For example, if we are sure that Big_Table has no duplicates in common with Small_Table1 and Small_Table2, this query will produce the same results as before but should run much faster:

((SELECT * FROM Small_Table1)
UNION
(SELECT * FROM Small_Table2)) -– intermediate set
UNION ALL
(SELECT * FROM Big_Table);

But be careful when mixing UNION and UNION ALL operators. The left-to-right order of execution will cause the last operator in the chain to have an effect on the results.

16.1.3 UNION of Columns from the Same Table

A useful trick for building the union of columns from the same table is to use a CROSS JOIN and a CASE expression, thus

SELECT CASE WHEN S1.seq = 1 THEN F1.col1
WHEN S1.seq = 2 THEN F1.col2
ELSE NULL END
FROM Foobar AS F1
CROSS JOIN
Series AS S1(seq)
 WHERE S1.seq IN (1, 2)

This acts like the UNION ALL statement, but change the SELECT to SELECT DISTINCT and you have a UNION. The advantage of this statement over the more obvious UNION is that it makes one pass through the table. Given a large table, that can be important for good performance.

16.2 INTERSECT and EXCEPT

The INTERSECT and EXCEPT set operators take two tables and build a new table from them. The two tables must be union compatible, just like with the UNION [ALL] operator. Like the UNION [ALL], the result of an INTERSECT [ALL] or EXCEPT [ALL] should use an AS operator if you want to have names for the result table and its columns.

Oracle was the first major vendor to have the EXCEPT operator with the keyword MINUS. The set difference is the rows in the first table, except for those that also appear in the second table. It answers questions like "Give me all the employees except the salesmen" in a natural manner.

Let's take our two multisets and use them to explain the basic model, by making a mapping between them:

S1 =  {a, a, b, b, c }
↕ ↕ ↕ ↕
S2 =  {a, b, b, b, c, d}

The INTERSECT and EXCEPT operators remove all duplicates from both sets, so we would have:

S1 = {a, b, c }
↕ ↕ ↕
S2 = {a, b, c, d}

and therefore,

S1 INTERSECT S2 = {a, b, c}

and

S2 EXCEPT S1 = {d}
S1 EXCEPT S2 = {}

When you add the ALL option, things are trickier. The mapped pairs become the unit of work. The INTERSECT ALL keeps each pairing, so that

S1 INTERSECT ALL S2 = {a, b, b, c}

and the EXCEPT ALL throws them away, retain what is left in the first set, thus.

S2 EXCEPT ALL S1 = {b, d}

Trying to write the INTERSECT and EXCEPT with other operators is trickier than it looks. It has to be general enough to handle situations where there is no key available and the number of columns is not known.

Standard SQL defines the actions for duplicates in terms of the count of duplicates of matching rows. Let (m) be the number of rows of one kind in S1 and (n) be the number in S2. The UNION ALL will have (m + n) copies of the row. The INTERSECT ALL will have LEAST(m, n) copies. EXCEPT ALL will have the greater of either the first table's count minus the second table's count or zero copies.

The immediate impulse of a programmer is to write the code with EXISTS() predicates. The bad news is that it does not work because of NULLs. This is easier to show with code. Let's redo our two sample tables.

CREATE TABLE S1 (a1 CHAR(1));
INSERT INTO S1 
VALUES ('a'), ('a'), ('b'), ('b'), ('c'), (NULL), (NULL);
CREATE TABLE S2 (a2 CHAR(1));
INSERT INTO S2 
VALUES ('a'), ('b'), ('b'), ('b'), ('c'), ('d'), (NULL);

Now build a view to hold the duplication_cnt of each value in each table.

CREATE VIEW DupCounts (a, s1_dup, s2_dup)
AS
SELECT S.a, SUM(s1_dup), SUM(s2_dup)
FROM (SELECT S1.a1, 1, 0
FROM S1
UNION ALL
 SELECT S2.a2, 0, 1
FROM S2) AS S(a, s1_dup, s2_dup)
GROUP BY S.a, s1_dup, s2_dup;

The GROUP BY will put the NULLs into a separate group giving them the right tallies. Now code is a straightforward implementation of the definitions in Standard SQL.

-- S1 EXCEPT ALL S2
SELECT DISTINCT D1.a, (s1_dup - s2_dup) AS dups
FROM DupCounts AS D1, 
Series AS S1
 WHERE S1.seq <= (s1_dup - s2_dup);
-- S1 INTERSECT ALL S2
SELECT DISTINCT D1.a, 
CASE WHEN s1_dup <= s2_dup
THEN s1_dup ELSE s2_dup END
AS duplication_cnt
FROM DupCounts AS D1, 
Series AS S1
WHERE S1.seq <= CASE WHEN s1_dup <= s2_dup
 THEN s1_dup ELSE s2_dup END;

Notice that we had to use SELECT DISTINCT. Without it, the sample data will produce this table.

aduplication_cnt
NULL1
a1
b2
b2 entity redundant row
c1

The nonduplicated versions are easy to write from the definitions in the Standards. In effect their duplication tallies are set to one.

-- S1 INTERSECT S2
SELECT D1.a
FROM DupCounts AS D1
 WHERE s1_dup > 0
AND s2_dup > 0;
-- S1 EXCEPT S2
SELECT D1.a
FROM DupCounts AS D1
 WHERE s1_dup > 0
AND s2_dup = 0;
-- S2 EXCEPT S1
SELECT D1.a
 FROM DupCounts AS D1
WHERE s2_dup > 0
AND s1_dup = 0;

16.2.1 INTERSECT and EXCEPT Without NULLs and Duplicates

INTERSECT and EXCEPT are much easier if each of the two tables does not have NULLs and duplicate values in them. Intersection is simply

SELECT *
FROM S1
WHERE EXISTS
(SELECT *
FROM S2
WHERE S1.a1 = S2.a2);

or

SELECT *
FROM S2
WHERE EXISTS
(SELECT *
FROM S1
 WHERE S1.a1 = S2.a2);

You can also use

SELECT DISTINCT S2.*
FROM (S2 INNER JOIN S1 ON S1.a1 = S2.a2);

This is given as a motivation for the next piece of code, but you may find that some SQL engines do joins faster than EXISTS() predicates and vice versa, so it is a good idea to have more than one trick in your bag.

The set difference can be written with an OUTER JOIN operator. This code is due to Jim Panttaja.

SELECT DISTINCT S2.*
FROM (S2 LEFT OUTER JOIN S1
ON S1.a1 = S2.a2)
WHERE S1.a1 IS NULL;

This was all we had in the early days, and you should not use it today. But you will see it in old code.

16.2.2 INTERSECT and EXCEPT with NULLs and Duplicates

These versions of INTERSECT and EXCEPT are due to Itzak Ben-Gan. They make very good use of the UNION and DISTINCT operators to implement set theory definitions.

-- S1 INTERSECT S2 
SELECT D.a
FROM (SELECT DISTINCT a1 FROM S1
UNION ALL
SELECT DISTINCT a2 FROM S2) AS D(a)
GROUP BY D.a
HAVING COUNT(*) > 1;
-- S1 INTERSECT ALL S2 
SELECT D2.a
FROM (SELECT D1.a, MIN(cnt) AS mincnt
FROM (SELECT a1, COUNT(*) 
FROM S1
 GROUP BY a1
 UNION ALL
 SELECT a2, COUNT(*)
 FROM S2
 GROUP BY a2) AS D1(a, cnt)
 GROUP BY D1.a
 HAVING COUNT(*) > 1) AS D2
INNER JOIN 
Series 
ON seq <= min_cnt;
-- S1 EXCEPT ALL S2 
SELECT D2.a
FROM (SELECT D1.a, SUM(cnt)
FROM (SELECT a1, COUNT(*) 
FROM S1
 GROUP BY a1
 UNION ALL
 SELECT a2, -COUNT(*)
FROM S2
 GROUP BY a2) 
 AS D1(a, cnt)
GROUP BY D1.a
 HAVING SUM(cnt) > 0) 
 AS D2(a, dups)
 INNER JOIN 
 Series ON seq <= D2.dups;

The Series table is discussed in other places in this book. It is a table of integers from 1 to (n) that is used to replace iteration and counting in SQL. Obviously, (n) has to be large enough for these statements to work.

16.3 A Note on ALL and SELECT DISTINCT

Here is a series of observations about the relationship between the ALL option in set operations and the SELECT DISTINCT options in a query from Beught Gunne. The multiset model does not behave like the traditional set theory we know from school.

Given two tables with duplicate values:

CREATE TABLE A (i INTEGER NOT NULL);
INSERT INTO A VALUES (1), (1), (2), (2), (4), (4);
CREATE TABLE B (i INTEGER NOT NULL);
INSERT INTO B VALUES (2), (2), (3), (3);

The UNION and INTERSECT operations have regular behavior in that

(A UNION B) = SELECT DISTINCT (A UNION ALL B) = ((1), (2), (3))
and
(A INTERSECT B) = SELECT DISTINCT (A INTERSECT ALL B) = (2)

However,

(A EXCEPT B) <> SELECT DISTINCT (A EXCEPT ALL B)

Or more literally, (1) <> ((1), (2)) for the tables given in the example. And likewise, we have

  (B EXCEPT A) = SELECT DISTINCT (B EXCEPT ALL A) = (3)

by a coincidence of the particular values used in these tables.

16.4 Equality and Proper Subsets

At one point, when SQL was still in the laboratory at IBM, there was a CONTAINS operator that would tell you if one table was a subset of another. It disappeared in later versions of the language and no vendor picked it up. Set equality was never part of SQL as an operator, so you would have to have used the two expressions ((A CONTAINS B) AND (B CONTAINS A)) to find out.

Today, you can use the methods shown in the section on Relational Division to determine containment or equality. Or you can use set theory expressions, such as NOT EXISTS ((A EXCEPT B) UNION (B EXCEPT A)). Which method is fastest is going to depend on the SQL product and the access method used.

However, Itzak Ben-Gan came up with a novel approach for finding containment and equality that is worth a mention.

SELECT SUM(DISTINCT match_col)
FROM (SELECT CASE 
WHEN S1.col
IN (SELECT S2.col FROM S2)
THEN 1 ELSE -1 END
FROM S1) AS X(match_col)
HAVING SUM(DISTINCT match_col) = :n;

You can set (:n) to 1, 0, or − 1 for each particular test.

When I find a matching row in S1, I get a one; when I find a mismatched row in S1, get a − 1 and they sum together to give me a zero. Therefore, S1 is a proper subset of S2. If they sum to one, then they are equal. If they sum to − 1, they are disjoint.

Another trick for equality is based on the principle that if two sets are equal, then they have the same cardinality. This is necessary but not sufficient. They have the same elements. We can use a windowed count(*) to get the cardinality

EXISTS (SELECT a, COUNT(a) OVER() AS alpha_cardinality
FROM Alphas
EXCEPT
SELECT b, COUNT(b) OVER() AS beta_cardinality
FROM Betas)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.28.70