CHAPTER 4

image

Querying from Multiple Tables

by Jonathan Gennick

It is the rare database that has all its data in a single table. Data tends to be spread over multiple tables in ways that optimize storage and ensure consistency and integrity. Part of your job when writing a query is to deploy and link together T-SQL operations that can operate across tables in order to generate needed business results.

4-1. Correlating Parent and Child Rows

Problem

You want to bring together data from parent and child tables. For example, you have a list of people in a parent table named Person, and a list of phone numbers in a child table named PersonPhone. Each person may have zero, one, or several phone numbers. You want to return a list of each person having at least one phone number, along with all their numbers.

Image Note  It is also possible to return all persons, including those having zero phone numbers. Recipe 4-3 shows how.

Solution

Write an inner join to bring related information from two tables together into a single result set. Begin with a FROM clause and one of the tables:

FROM Person.Person

Add the keywords INNER JOIN followed by the second table:

FROM    Person.Person
        INNER JOIN Person.PersonPhone

Follow with an ON clause to specify the join condition. The join condition identifies the row combinations of interest. It is the BusinessEntityID that identifies a person. That same ID indicates the phone numbers for a person. For this example, you want all combinations of Person and PersonPhone rows sharing the same value for BusinessEntityID. The following ON clause gives that result:

FROM    Person.Person
        INNER JOIN Person.PersonPhone
            ON Person.BusinessEntityID = PersonPhone.BusinessEntityID

Specify the columns you wish to see in the output. All columns from both tables are available. The following final version of the query returns two columns from each table:

SELECT  PersonPhone.BusinessEntityID,
        FirstName,
        LastName,
        PhoneNumber
FROM    Person.Person
        INNER JOIN Person.PersonPhone
            ON Person.BusinessEntityID = PersonPhone.BusinessEntityID
ORDER BY LastName,
        FirstName,
        Person.BusinessEntityID;

The ORDER BY clause sorts the results so that all phone numbers for a given person group together. Results are as follows:

BusinessEntityID FirstName  LastName     PhoneNumber
---------------- ---------- ------------ -------------
             285 Syed       Abbas        926-555-0182
             293 Catherine  Abel         747-555-0171
              38 Kim        Abercrombie  208-555-0114
             295 Kim        Abercrombie  334-555-0137
            2170 Kim        Abercrombie  919-555-0100
             211 Hazem      Abolrous     869-555-0125
            2357 Sam        Abolrous     567-555-0100
             297 Humberto   Acevedo      599-555-0127
...

How It Works

The inner join is fundamental. Imagine the following, very simplified, two tables:

Person                                    PersonPhone
BusinessEntityID FirstName  LastName      BusinessEntityID PhoneNumber
---------------- ---------- --------      ---------------- ------------
             285 Syed       Abbas         285              926-555-0182
             293 Catherine  Abel          293              747-555-0171

From a conceptual standpoint, an inner join begins with all possible combinations of rows from the two tables. Some combinations make sense. Some do not. The set of all possible combinations is called the Cartesian product. Notice the bold rows in the following Cartesian product.

BusinessEntityID FirstName  LastName BusinessEntityID PhoneNumber
---------------- ---------- -------- ---------------- ------------
             285 Syed       Abbas    285              926-555-0182
             285 Syed       Abbas    293              747-555-0171
             293 Catherine  Abel     293              747-555-0171
             293 Catherine  Abel     285              926-555-0182

It makes sense to have Syed’s name in the same row as his phone number. Likewise, it is sensible to list Catherine with her phone number. There’s no logic at all in listing Syed’s name with Catherine’s number, or vice versa. Thus, the join condition is very sensibly written to specify the case in which the two BusinessEntityID values are the same:

ON Person.BusinessEntityID = PersonPhone.BusinessEntityID

The Cartesian product gives all possible results from an inner join. Picture the Cartesian product in your mind. Bring in the fishnet analogy from Recipe 1-4. Then write join conditions to trap the rows that you care about as the rest of the Cartesian product falls through your net.

Image Note  Database engines do not materialize the entire Cartesian product when executing an inner join. There are more efficient approaches for SQL Server to take. However, regardless of approach, the results will always be in line with the conceptual description given here in this recipe.

THE TERM “RELATIONAL”

One sometimes hears the claim that the word relational in relational database refers to the fact that one table can “relate” to another, in the sense that one joins the two tables together, as described in Recipe 4-1. That explanation sounds so very plausible, yet it is incorrect.

The term relation comes from set theory, and you can read in detail about what a relation is by visiting Wikipedia’s article on finitary relations:

http://en.wikipedia.org/wiki/Finitary_relation

The key statement from the current version of this article reads as follows (emphasis added):

“Typically, the property [a relation] describes a possible connection between the components of a k-tuple.”

The words “between the components of” tell the tale. A tuple’s analog is the row. The components of a tuple are its values, and thus the database analog would be the values in a row. The term relation speaks to a relationship, not between tables, but between the values in a row.

We encourage you to read the Wikipedia article. Then if you really want to dive deeper into set theory and how it can help you work with data, we recommend you read the book Applied Mathematics for Database Professionals by Lex de Haan and Toon Koppelaars (Apress, 2007).

4-2. Querying Many-to-Many Relationships

Problem

You have a many-to-many relationship with two detail tables on either side of an intersection table. You want to resolve the relationship across all three tables.

Solution

String two inner joins together. The following example joins three tables in order to return discount information on a specific product:

SELECT  p.Name,
        s.DiscountPct
FROM    Sales.SpecialOffer s
        INNER JOIN Sales.SpecialOfferProduct o
            ON s.SpecialOfferID = o.SpecialOfferID
        INNER JOIN Production.Product p
            ON o.ProductID = p.ProductID
WHERE   p.Name = 'All-Purpose Bike Stand';

The results of this query are as follows:

Name                                                         DiscountPct
-----------------------                                      ------------
All-Purpose Bike Stand                                              0.00

How It Works

A join starts after the first table in the FROM clause. In this example, three tables are joined together: Sales.SpecialOffer, Sales.SpecialOfferProduct, and Production.Product. The first table referenced in the FROM clause, Sales.SpecialOffer, contains a lookup of sales discounts:

FROM Sales.SpecialOffer s

Notice the letter s that trails the table name. This is a table alias. Once you begin using more than one table in a query, it is important to explicitly identify the data source of the individual columns. If the same column names exist in two different tables, you can get an error from the SQL compiler asking you to clarify which column you really want to return.

As a best practice, it is a good idea to use aliases whenever column names are specified in a query. For each of the referenced tables, an alias is used to symbolize the table name, saving you the trouble of spelling it out each time. This query uses a single character as a table alias, but you can use any valid identifier. A table alias, aside from allowing you to shorten or clarify the original table name, allows you to swap out the base table name if you ever have to replace it with a different table or view, or if you need to self-join the tables. Table aliases are optional, but recommended when your query has more than one table. (Because table aliases are optional, you can instead specify the entire table name every time you refer to the column in that table.)

Getting back to the example, the INNER JOIN keywords follow the first table reference, and then the table being joined to it is named, followed by its alias:

INNER JOIN Sales.SpecialOfferProduct o

After that, the ON keyword prefaces the column joins:

ON

This particular INNER JOIN is based on the equality of two columns, one from the first table and another from the second:

s.SpecialOfferID = o.SpecialOfferID

Next, the Production.Product table is inner joined as well:

INNER JOIN Production.Product p
    ON o.ProductID = p.ProductID

Finally, a WHERE clause is used to filter rows returned in the final result set:

WHERE   p.Name = 'All-Purpose Bike Stand';

4-3. Making One Side of a Join Optional

Problem

You want rows returned from one table in a join even when there are no corresponding rows in the other table. For example, you want to list states and provinces and their tax rates. Sometimes no tax rate is on file. In those cases, you still want to list the state or province.

Solution

Write an outer join rather than the inner join that you have seen in the recipes so far. You can designate an outer join as being either left or right. Following is a left outer join that produces a list of all states and provinces, including tax rates when they are available.

SELECT  s.CountryRegionCode,
        s.StateProvinceCode,
        t.TaxType,
        t.TaxRate
FROM    Person.StateProvince s
        LEFT OUTER JOIN Sales.SalesTaxRate t
            ON s.StateProvinceID = t.StateProvinceID;

This returns the following results:

CountryRegionCode StateProvinceCode TaxType               TaxRate
----------------- ----------------- ------- ---------------------
CA                AB                      1                 14.00
CA                AB                      2                  7.00
US                AK                   NULL                  NULL
US                AL                   NULL                  NULL
US                AR                   NULL                  NULL
AS                AS                   NULL                  NULL
US                AZ                      1                  7.75
CA                BC                      3                  7.00
...

How It Works

A left outer join causes the table named first to become the nonoptional table, or what I often term as the anchor table. The word “left” derives from the fact that English is written left to right. The left outer join in the solution makes StateProvince the anchor table, so all states are returned. The sales tax side of the join is then the optional side, and the database engine supplies nulls for the sales tax columns when no corresponding row exists for each state in question.

Change the join type in the solution from LEFT OUTER to INNER, and you’ll get only those rows for states that have tax rates defined in the SalesTaxRate table. This is because an inner join requires a row from each table involved. By making the join a left outer join, you make the right-hand table optional. Rows from the left-hand table are returned regardless of whether corresponding rows exist in the other table. Thus, you get all states and provinces; lack of a tax rate does not prevent a state or province from appearing in the results.

It is common to write outer joins with one optional table as left outer joins. However, you do have the option of looking at things from the other direction. For example:

FROM    Sales.SalesTaxRate t
        RIGHT OUTER JOIN Person.StateProvince s

This right outer join will yield the same results as the left outer join in the solution example. That’s because the order of the tables has been flipped. StateProvince is now on the right-hand side, and it is the anchor table once again because this time a right outer join is used.

Image Tip  Experiment! Take time to execute the solution query. Then change the join clause to read INNER JOIN. Note the difference in results. Then change the entire FROM clause to use a right outer join with the StateProvince table on the right-hand side. You should get the same results as you got from the solution query.

4-4. Making Both Sides of a Join Optional

Problem

You want the effect of a left and a right outer join at the same time.

Solution

Write a full outer join. Do that using the keywords FULL OUTER JOIN. For example:

SELECT  soh.SalesOrderID,
        sr.SalesReasonID,
        sr.Name
FROM    Sales.SalesOrderHeader soh
        FULL OUTER JOIN Sales.SalesOrderHeaderSalesReason sohsr
            ON soh.SalesOrderID = sohsr.SalesOrderID
        FULL OUTER JOIN Sales.SalesReason sr
            ON sr.SalesReasonID = sohsr.SalesReasonID;

This query follows the same pattern as that seen in Recipe 4-3 on querying many-to-many relationships. Only the join type and tables are different.

How It Works

The solution query returns sales orders and their associated reasons. The full outer join in the query guarantees the following:

  • All the results from an inner join
  • One additional row for each order not associated with a sale
  • One additional row for each sales reason not associated with an order

The additional rows have nulls from one side of the join or the other. If there is no order associated with a reason, then there is no value available for the SalesOrderID column in the result, and the value is null. Likewise, the SalesReasonID and Name values are null in the case of an order having no reason.

Results are as follows for orders associated with reasons:

SalesOrderID SalesReasonID Name
------------ ------------- -------------
       43697             5 Manufacturer
       43697             9 Quality
       43702             5 Manufacturer
...

Any reasons not associated with an order will come back with nulls in the order columns:

SalesOrderID SalesReasonID Name
------------ ------------- ----------------------
        NULL             3 Magazine Advertisement
        NULL             7 Demo Event
        NULL             8 Sponsorship
...

Any orders not given a reason will likewise come back with nulls in the reason columns:

SalesOrderID SalesReasonID Name
------------ ------------- ----
       45889          NULL NULL
       48806          NULL NULL
       51723          NULL NULL
...

All the preceding results will come back as a single result set.

4-5. Generating All Possible Row Combinations

Problem

You want to generate all possible combinations of rows from two tables. You want to generate the Cartesian product described in Recipe 4-1.

Solution

Write a cross join. In this example, the Person.StateProvince and Sales.SalesTaxRate tables are cross joined to generate all possible combinations of rows from the two tables:

SELECT  s.CountryRegionCode,
        s.StateProvinceCode,
        t.TaxType,
        t.TaxRate
FROM    Person.StateProvince s
        CROSS JOIN Sales.SalesTaxRate t;

This returns the following (abridged) results:

CountryRegionCode StateProvinceCode TaxType               TaxRate
----------------- ----------------- ------- ---------------------
CA                AB                      1                 14.00
US                AK                      1                 14.00
US                AL                      1                 14.00
...

How It Works

A cross join is essentially a join with no join conditions. Every row from one table is joined to every row in the other table, regardless of whether the resulting combination of values makes any sense. The result is termed a Cartesian product.

The solution results show StateProvince and SalesTaxRate information that doesn’t logically go together. Because the Person.StateProvince table had 181 rows, and the Sales.SalesTaxRate had 29 rows, the query returned 5249 rows.

4-6. Selecting from a Result Set

Problem

You find it easier to think in terms of selecting a set of rows, and then selecting again from that result set.

Solution

Create a derived table in your FROM clause by enclosing a SELECT statement within parentheses. For example, the following query joins SalesOrderHeader to the results from a query against SalesOrderDetail:

SELECT DISTINCT
        s.PurchaseOrderNumber
FROM    Sales.SalesOrderHeader s
        INNER JOIN (SELECT  SalesOrderID
                    FROM    Sales.SalesOrderDetail
                    WHERE   UnitPrice BETWEEN 1000 AND 2000
                   ) d
            ON s.SalesOrderID = d.SalesOrderID;

This returns the following abridged results:

PurchaseOrderNumber
-------------------
PO1595126190
PO9077115532
PO13340115824
PO11861162351
PO9222123146

...

How It Works

Derived tables are SELECT statements that act as tables in the FROM clause. A derived table is a separate query in itself and doesn’t require the use of a temporary table to store its results. Thus, queries that use derived tables can sometimes perform significantly better than the process of building a temporary table and querying from it, as you eliminate the steps needed for SQL Server to create and allocate a temporary table prior to use.

This example’s query returns the PurchaseOrderNumber from the Sales.SalesOrderHeader table for any order containing products with a UnitPrice between 1000 and 2000. The query joins a table to a derived table using an inner join operation. The derived table query is encapsulated in parentheses and is followed by a table alias.

4-7. Introducing New Columns

Problem

You are writing a query that returns an expression, and you seem to be stuck repeating the expression in multiple clauses of the query. You’d prefer to state the computation just one time.

Image Note  Thanks to Brad Schulz for graciously allowing us to copy the example in this recipe from his blog post about CROSS APPLY. It’s a wonderful post. We encourage you to read it at: http://bradsruminations.blogspot.com/2011/04/t-sql-tuesday-017-it-slices-it-dices-it.html

Solution

There’s a clever way to use CROSS APPLY that can sometimes help you avoid the need to redundantly write the expression for a computed column. For example, the following is a GROUP BY query with a computed column. You can see how part of the expression behind the column appears in the query redundantly three times:

SELECT
   DATENAME(MONTH,
      DATEADD(MONTH,
         DATEDIFF(MONTH,'19000101',OrderDate),
         '19000101')
   ) AS Mth,
   SUM(TotalDue) AS Total
FROM Sales.SalesOrderHeader
WHERE OrderDate>='20120101'
  AND OrderDate<'20140101'
GROUP BY DATEADD(MONTH,
            DATEDIFF(MONTH,'19000101',OrderDate),
            '19000101')
ORDER BY DATEADD(MONTH,
            DATEDIFF(MONTH,'19000101',OrderDate),
            '19000101'),

Redundancy is undesirable. It clutters the query and increases the chance of error. Any change or correction to the expression logic must be repeated three times or else the query won’t produce correct results. You can eliminate the undesired redundancy by placing the expression logic into a cross-joined subquery. Here’s how to do that:

SELECT DATENAME(MONTH,FirstDayOfMth) AS Mth,
       SUM(TotalDue) AS Total
FROM Sales.SalesOrderHeader
CROSS APPLY (
   SELECT DATEADD(MONTH,
             DATEDIFF(MONTH,'19000101',OrderDate),
             '19000101') AS FirstDayOfMth
   ) F_Mth
where OrderDate>='20120101'
  and OrderDate<'20140101'
group by FirstDayOfMth
order by FirstDayOfMth

How It Works

The queries in this example each group the sales by month. Results in both cases are:

Mth                                            Total
------------------------------ ---------------------
January                                 1462448.8986
February                                2749104.6546
March                                   2350568.1264
April                                   1727689.5793
May                                      3299799.233
June                                    1920506.6177
July                                    3253418.7629
August                                  4663508.0154
September                               3638980.3689
October                                 2488758.6715
November                                3809633.4035
December                                3099432.1035

The second query simplifies the logic by placing the column expression into a subquery. The execution plan is the same in both cases, so there is no performance impact, only an increase in readability and maintainability.

The subquery is a correlated subquery returning one row. Each individual row from the main query is cross-joined to the subquery. All possible combinations of one row and one row work out to be just one row. The same number of rows are thus returned, but those rows now each contain the computed column. Then the results are grouped and summed by month.

4-8. Testing for the Existence of a Row

Problem

You are writing a WHERE clause. You want to return rows from the table you are querying based upon the existence of related rows in some other table.

Solution

One solution is to write a subquery in conjunction with the EXISTS predicate:

SELECT  s.PurchaseOrderNumber
FROM    Sales.SalesOrderHeader s
WHERE   EXISTS ( SELECT SalesOrderID
                 FROM   Sales.SalesOrderDetail sod
                 WHERE  sod.UnitPrice BETWEEN 1000 AND 2000
                        AND sod.SalesOrderID = s.SalesOrderID );

This returns the following abridged results.

PurchaseOrderNumber
--------------------
PO12586178184
PO10440182311
PO13543119495
PO12586169040
PO2146113582
...

How It Works

The critical piece in the solution example is the subquery in the WHERE clause, which checks for the existence of SalesOrderIDs that have products with a UnitPrice between 1000 and 2000. A JOIN is essentially written into the WHERE clause of the subquery by stating sod.SalesOrderID = s.SalesOrderID. The subquery uses the SalesOrderID from each returned row in the outer query.

The subquery in this recipe is known as a correlated subquery. It is so called because the subquery accesses values from the parent query. It is certainly possible to write an EXISTS predicate with a noncorrelated subquery, however, it is unusual to do so.

Look back at Recipe 4-6. It solves the same problem and generates the same results, but uses a derived table in the FROM clause. Often you can solve such problems multiple ways. Pick the one that performs best. If performance is equal, then pick the approach with which you are most comfortable.

4-9. Testing Against the Result from a Query

Problem

You are writing a WHERE clause and wish to write a predicate involving the result from another query. For example, you wish to compare a value in a table against the maximum value in a related table.

Solution

Write a noncorrelated subquery. Make sure it returns a single value. Put the query where you would normally refer to the value. For example:

SELECT  BusinessEntityID,
        SalesQuota AS CurrentSalesQuota
FROM    Sales.SalesPerson
WHERE   SalesQuota = (SELECT    MAX(SalesQuota)
                      FROM      Sales.SalesPerson
                     );

This returns the three salespeople who reached the maximum sales quota of 300,000:

BusinessEntityID     CurrentSalesQuota
---------------- ---------------------
             275             300000.00
             279             300000.00
             284             300000.00
Warning: Null value is eliminated by an aggregate or other SET operation.

How It Works

There is no WHERE clause in the subquery, and the subquery does not reference values from the parent query. It is therefore not a correlated subquery. Instead, the maximum sales quota is retrieved once. That value is used to evaluate the WHERE clause for all rows tested by the parent query.

Ignore the warning message in the results. That message simply indicates that some of the SalesQuota values fed into the MAX function were null. You can avoid the message by adding WHERE SalesQuota IS NOT NULL to the subquery. You can also avoid the message by issuing the command set ANSI_WARNINGS OFF. However, there is no real need to avoid the message unless it offends your sense of tidiness to see it.

4-10. Stacking Two Row Sets Vertically

Problem

You are querying the same data from two different sources. You wish to combine the two sets of results. For example, you wish to combine current with historical sales quotas.

Solution

Write two queries. Glue them together with the UNION ALL operator. For example:

SELECT  BusinessEntityID,
        GETDATE() QuotaDate,
        SalesQuota
FROM    Sales.SalesPerson
WHERE   SalesQuota > 0
UNION ALL
SELECT  BusinessEntityID,
        QuotaDate,
        SalesQuota
FROM    Sales.SalesPersonQuotaHistory
WHERE   SalesQuota > 0
ORDER BY BusinessEntityID DESC,
        QuotaDate DESC;

Results are as follows.

BusinessEntityID QuotaDate                          SalesQuota
---------------- ----------------------- ---------------------
             290 2012-02-09 00:04:39.420             250000.00
             290 2008-04-01 00:00:00.000             908000.00
             290 2008-01-01 00:00:00.000             707000.00
             290 2007-10-01 00:00:00.000            1057000.00
...

How It Works

The solution query appends two result sets into a single result set. The first result set returns the BusinessEntitylD, the current date, and the SalesQuota. Because GETDATE()is a function, it doesn’t naturally generate a column name, so a QuotaDate column alias was used in its place:

SELECT  BusinessEntityID,
        GETDATE() QuotaDate,
        SalesQuota
FROM    Sales.SalesPerson

The WHERE clause filters data for those salespeople with a SalesQuota greater than zero:

WHERE   SalesQuota > 0

The next part of the query is the UNION ALL operator, which appends all results from the second query:

UNION ALL

The second query pulls data from the Sales.SalesPersonQuotaHistory, which keeps the history for a salesperson’s sales quota as it changes through time:

SELECT  BusinessEntityID,
        QuotaDate,
        SalesQuota
FROM    Sales.SalesPersonQuotaHistory

The ORDER BY clause sorts the result set by BusinessEntitylD and QuotaDate, both in descending order. The ORDER BY clause, when needed, must appear at the bottom of the entire statement. In the solution query, the clause is:

ORDER BY BusinessEntityID DESC,
        QuotaDate DESC;

You cannot write individual ORDER BY clauses for each of the SELECTs that you UNION together. ORDER BY can only appear once at the end, and applies to the combined result set.

Column names in the final, combined result set derive from the first SELECT in the overall statement. Thus, the ORDER BY clause should only refer to column names from the first result set.

Image Tip  UNION ALL is more efficient than UNION (described in the next recipe), because UNION ALL does not force a sort or similar operation in support of duplicate elimination. Use UNION ALL whenever possible, unless you really do need duplicate rows in the result set to be eliminated.

4-11. Eliminating Duplicate Values from a Union

Problem

You are writing a UNION query and prefer not to have duplicate rows in the results. For example, you wish to generate a list of unique surnames from among employees and salespersons.

Solution

Write a union query, but omit the ALL keyword and write just UNION instead. For example:

SELECT  P1.LastName
FROM    HumanResources.Employee E
        INNER JOIN Person.Person P1
            ON E.BusinessEntityID = P1.BusinessEntityID
UNION
SELECT  P2.LastName
FROM    Sales.SalesPerson SP
        INNER JOIN Person.Person P2
            ON SP.BusinessEntityID = P2.BusinessEntityID;

Results are as follows.

LastName
------------
Abbas
Abercrombie
Abolrous
Ackerman
Adams
...

How It Works

The behavior of the UNION operator is to remove all duplicate rows. The solution query uses that behavior to generate a list of unique surnames from among the combined group of employees and salespersons.

For large result sets, deduplication can be a very costly operation. It very often involves a sort. If you don’t need to deduplicate your data, or if your data is naturally distinct, write UNION ALL instead and your queries will run more efficiently. (See Recipe 4-10 for an example of UNION ALL).

Image Caution  Do you need your results sorted? Then be sure to write an ORDER BY clause. The solution results appear sorted, but that is a side effect from the deduplication operation. You should not count on such a side effect. The database engine might not drive the sort to completion. Other deduplication logic can be introduced in a future release and break your query. If you need ordering, write an ORDER BY clause into your query.

4-12. Subtracting One Row Set from Another

Problem

You want to subtract one set of rows from another. For example, you want to subtract component ID numbers from a list of product ID numbers so as to find those products that are at the top of the heap and are not themselves part of some larger product.

Solution

Write a union query involving the EXCEPT operator. Subtract products that are components from the total list of products, leaving only those products that are not components. For example:

SELECT  P.ProductID
FROM    Production.Product P
EXCEPT
SELECT  BOM.ComponentID
FROM    Production.BillOfMaterials BOM;
  ProductID
-----------
        378
        710
        879
        856
...

How It Works

EXCEPT begins with the first query and eliminates any rows that are also found in the second. It is considered to be a union operator, although the operation is more along the lines of a subtraction.

In the Adventure Works database, the BillOfMaterials table describes products that are made up of other products. The component products are recorded in the ComponentID column. Thus, subtracting the ComponentID values from the ProductID values in the Product table leaves only those products that are at the top and are not themselves part of some larger product.

Image Note  The EXCEPT operator implicitly deduplicates the final result set.

4-13. Finding Rows in Common Between Two Row Sets

Problem

You have two queries. You want to discover which rows are returned by both. For example, you wish to find products that have incurred both good and poor reviews.

Solution

Write a union query using the INTERSECT keyword. For example:

SELECT  PR1.ProductID
FROM    Production.ProductReview PR1
WHERE   PR1.Rating >= 4
INTERSECT
SELECT  PR1.ProductID
FROM    Production.ProductReview PR1
WHERE   PR1.Rating <= 2;

Results from this query show the one product that has both good and bad reviews:

  ProductID
-----------
        937

How It Works

The INTERSECT operator finds rows in common between two row sets. The solution example defines a good review as one with a rating of 4 and above. A bad review is a rating of 2 and lower. It’s easy to write a separate query to identify products falling into each case. The INTERSECT operator takes the results from both of those simple queries and returns a single result set showing the products—just one in this case—that both queries return.

Image Note  Like the EXCEPT operator, INTERSECT implicitly deduplicates the final results.

Sometimes you’ll find yourself wanting to include other columns in an INTERSECT query, and those columns cause the intersection operation to fail because that operation is performed taking all columns into account. One solution is to treat the intersection query as a derived table and join it to the Product table. For example:

SELECT  PR3.ProductID,
        PR3.Name
FROM    Production.Product PR3
        INNER JOIN (SELECT  PR1.ProductID
                    FROM    Production.ProductReview PR1
                    WHERE   PR1.Rating >= 4
                    INTERSECT
                    SELECT  PR1.ProductID
                    FROM    Production.ProductReview PR1
                    WHERE   PR1.Rating <= 2
                   ) SQ
            ON PR3.ProductID = SQ.ProductID;
  ProductID Name
----------- -----------------
        937 HL Mountain Pedal

Another approach is to move the intersection subquery into the WHERE clause and use it to generate an in-list. For example:

SELECT  ProductID,
        Name
FROM    Production.Product
WHERE   ProductID IN (SELECT    PR1.ProductID
                      FROM      Production.ProductReview PR1
                      WHERE     PR1.Rating >= 4
                      INTERSECT
                      SELECT    PR1.ProductID
                      FROM      Production.ProductReview PR1
                      WHERE     PR1.Rating <= 2);
  ProductID Name
----------- -----------------
        937 HL Mountain Pedal

In this version of the query, the subquery generates a list of product ID numbers. The database engine then treats that list as input into the IN predicate. There is only one product in this case, so you can think loosely in terms of the database engine ultimately executing a statement such as the following:

SELECT  ProductID,
        Name
FROM    Production.Product
WHERE   ProductID IN (937);

You can actually write an IN predicate that gives a list of hard-coded values, or you can choose to generate that list of values from a SELECT statement, as in this recipe.

4-14. Finding Rows that Are Missing

Problem

You want to find rows in one table or result set that have no corresponding rows in another. For example, you want to find all products in the Product table with no corresponding special offers.

Solution

Different approaches are possible, one of which is to write a query involving EXCEPT:

SELECT  ProductID
FROM    Production.Product
EXCEPT
SELECT  ProductID
FROM    Sales.SpecialOfferProduct;
  ProductID
-----------
          1
          2
          3
...

If you want to see more than just a list of ID numbers, you can write a query involving NOT EXISTS and a correlated subquery. For example:

SELECT  P.ProductID,
        P.Name
FROM    Production.Product P
WHERE   NOT EXISTS ( SELECT *
                     FROM   Sales.SpecialOfferProduct SOP
                     WHERE  SOP.ProductID = P.ProductID );
  ProductID Name
----------- --------------------------------------------------
          1 Adjustable Race
          2 Bearing Ball
          3 BB Ball Bearing
...

How It Works

The solution involving EXCEPT is simple to write and easy to understand. The top query generates a list of all possible products. The bottom query generates a list of products that have been given special offers. EXCEPT subtracts the second list from the first and returns a list of products having no corresponding rows in SpecialOfferProduct. The downside is that this approach limits the final results to just a list of ID numbers.

The second solution involves a NOT EXISTS predicate. You first read about EXISTS in Recipe 4-8. NOT EXISTS is a variation on that theme. Rather than testing for existence, the predicate tests for nonexistence. The parent query then returns all product rows not having corresponding special offers. You are able to include any columns that you desire from the Product table in the query results.

4-15. Comparing Two Tables

Problem

You have two copies of a table. You want to test for equality. Do both copies have the same row and column values?

Solution

Begin by creating a copy of the table. For purposes of example in this recipe, use the Password table

SELECT  *
INTO    Person.PasswordCopy
FROM    Person.Password;

Then execute the following UNION query to compare the data between the two tables and report on the differences.

SELECT  *,
        COUNT(*) DupeCount,
        'Password' TableName
FROM    Person.Password P
GROUP BY BusinessEntityID,
        PasswordHash,
        PasswordSalt,
        rowguid,
        ModifiedDate
HAVING  NOT EXISTS ( SELECT *,
                            COUNT(*)
                     FROM   Person.PasswordCopy PC
                     GROUP BY BusinessEntityID,
                            PasswordHash,
                            PasswordSalt,
                            rowguid,
                            ModifiedDate
                     HAVING PC.BusinessEntityID = P.BusinessEntityID
                            AND PC.PasswordHash = P.PasswordHash
                            AND PC.PasswordSalt = P.PasswordSalt
                            AND PC.rowguid = P.rowguid
                            AND PC.ModifiedDate = P.ModifiedDate
                            AND COUNT(*) = COUNT(ALL P.BusinessEntityID) )
UNION
SELECT  *,
        COUNT(*) DupeCount,
        'PasswordCopy' TableName
FROM    Person.PasswordCopy PC
GROUP BY BusinessEntityID,
        PasswordHash,
        PasswordSalt,
        rowguid,
        ModifiedDate
HAVING  NOT EXISTS ( SELECT *,
                            COUNT(*)
                     FROM   Person.Password P
                     GROUP BY BusinessEntityID,
                            PasswordHash,
                            PasswordSalt,
                            rowguid,
                            ModifiedDate
                     HAVING PC.BusinessEntityID = P.BusinessEntityID
                            AND PC.PasswordHash = P.PasswordHash
                            AND PC.PasswordSalt = P.PasswordSalt
                            AND PC.rowguid = P.rowguid
                            AND PC.ModifiedDate = P.ModifiedDate
                            AND COUNT(*) = COUNT(ALL PC.BusinessEntityID) );

The result from this query will be zero rows. That is because the tables are unchanged. You’ve made a copy of Password, but haven’t changed values in either table.

Now make some changes to the data in the two tables. BusinessEntityID numbers are in the range 1, . . . , 19972. Following are some statements to change data in each table, and to create one duplicate row in the copy:

UPDATE  Person.PasswordCopy
SET     PasswordSalt = 'Munising!'
WHERE   BusinessEntityID IN (9783, 221);

UPDATE  Person.Password
SET     PasswordSalt = 'Marquette!'
WHERE   BusinessEntityID IN (42, 4242);

INSERT  INTO Person.PasswordCopy
        SELECT  *
        FROM    Person.PasswordCopy
        WHERE   BusinessEntityID = 1;

Having changed the data, reissue the previous UNION query to compare the two tables. This time there are results indicating the differences just created:

BusinessEntityID PasswordHash  ... PasswordSalt ... DupeCount TableName
---------------- ------------- ... ------------ ... --------- ------------
               1 pbFwXWE99vobT ... bE3XiWw=     ...         1 Password
              42 HSLAA7MxklY4d ... Marquette!   ...         1 Password
             221 DFSEDLoy3em1I ... 5nzaMoQ=     ...         1 Password
            4242 YITAXaCQCapPi ... Marquette!   ...         1 Password
            9783 1gv08vLyjlhQY ... YcAxsQQ=     ...         1 Password
               1 pbFwXWE99vobT ... bE3XiWw=     ...         2 PasswordCopy
              42 HSLAA7MxklY4d ... uTuRBuI=     ...         1 PasswordCopy
             221 DFSEDLoy3em1I ... Munising!    ...         1 PasswordCopy
            4242 YITAXaCQCapPi ... mj6TQG4=     ...         1 PasswordCopy
            9783 1gv08vLyjlhQY ... Munising!    ...         1 PasswordCopy

These results indicate rows from each table that are not found in the other. They also indicate differences in duplication counts.

How It Works

The solution query is intimidating at first, and it is a lot to type. But it is a rote query once you get the hang of it, and the two halves are essentially mirror images of each other.

The grouping and counting is there to handle the possibility of duplicate rows. Each of the queries begins by grouping all columns and generating a duplication count. For example, the second subquery examines PasswordCopy:

SELECT  *,
        COUNT(*) DupeCount,
        'PasswordCopy' TableName
FROM    Person.PasswordCopy PC
GROUP BY BusinessEntityID,
        PasswordHash,
        PasswordSalt,
        rowguid,
        ModifiedDate;
BusinessEntityID PasswordHash       DupeCount TableName
---------------- ------------- ... ---------- ------------
               1 pbFwXWE99vobT ...          2 PasswordCopy
               2 bawRVNrZQYQ05 ...          1 PasswordCopy
...

Here, you can see that there are two rows containing the same set of values. Both rows are associated with BusinessEntityID 1. The DupeCount for that ID is 2.

Next comes a subquery in the HAVING clause to restrict the results to only those rows not also appearing in the Password table:

HAVING  NOT EXISTS ( SELECT *,
                            COUNT(*)
                     FROM   Person.PasswordCopy PC
                     GROUP BY BusinessEntityID,
                            PasswordHash,
                            PasswordSalt,
                            rowguid,
                            ModifiedDate
                     HAVING PC.BusinessEntityID = P.BusinessEntityID
                            AND PC.PasswordHash = P.PasswordHash
                            AND PC.PasswordSalt = P.PasswordSalt
                            AND PC.rowguid = P.rowguid
                            AND PC.ModifiedDate = P.ModifiedDate
                            AND COUNT(*) = COUNT(ALL P.BusinessEntityID) )

This HAVING clause is tedious to write, but it is conceptually simple. It compares all columns, looking for equality. It compares row counts to check for differences in the number of times a row is duplicated in either of the tables. The results are a list of rows in PasswordCopy that do not also exist the same number of times in Password.

Both queries do the same thing from different directions. The first query in the UNION finds rows in Password that are not also in PasswordCopy. The second query reverses things and finds rows in PasswordCopy that are not also in Password. Both queries will detect differences in duplication counts.

There is one row that is reported in the solution results because it occurs twice in the copy and once in the original. See here:

BusinessEntityID PasswordHash  ... PasswordSalt ... DupeCount TableName
---------------- ------------- ... ------------ ... --------- ------------
               1 pbFwXWE99vobT ... bE3XiWw=     ...         1 Password
...
               1 pbFwXWE99vobT ... bE3XiWw=     ...         2 PasswordCopy
...

The TableName column lets you see that Password contains just one row for BusinessEntityID 1. That makes sense, because that column is the primary key. The PasswordCopy table, however, has no primary key. Somehow, someone has duplicated the row for BusinessEntityID 1. That table has two copies of the row. Because the number of copies is different, the tables do not compare as being equal.

The solution query reports differences between the two tables. An empty result set indicates that the two tables contain the same rows, which have the same values and occur the same number of times.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.35.238