Summary Exercise—Relational Division

As a summary exercise, you’re given the following task for cases in which you might want to consider using temporary objects: in the Northwind database, you need to determine which customers have orders handled by the same set of employees. The result set should contain a row for each customer, with the customer ID, and a value representing the group of employees. The latter is expressed as the minimum customer ID out of all customers that share the same set of employees. That is, if customers A, B, and C were handled by the same set of employees (for example, 3, 7, 9), the result set would contain {(A, A), (B, A), (C, A)}. Return NULL for customers that have no orders. The desired result is shown in Table 2-5 (abbreviated).

Table 2-5. Desired Result of Summary Exercise (Abbreviated)

CustomerID

Grp

FISSA

NULL

PARIS

NULL

ALFKI

ALFKI

ANATR

ANATR

THECR

ANATR

ANTON

ANTON

TRADH

ANTON

AROUT

AROUT

BERGS

BERGS

HANAR

BERGS

BLAUS

BLAUS

BLONP

BLONP

...

...

You can observe, for example, that ANATR and THECR were handled by the same set of employees. Note that you should use the sample tables in the Northwind database only to check the accuracy of your result. For performance estimations, you need to create your own tables with substantially more rows to reflect realistic production environments. For example, you can run the code in Example 2-1 to create in tempdb a Customers table with 10,000 customers and an Orders table with 1,000,000 orders. The run times that I will mention were measured against these benchmark tables on my system.

Example 2-1. Code that creates realistic table sizes for summary exercise benchmark

SET NOCOUNT ON;
USE tempdb;
GO

IF OBJECT_ID('dbo.Orders') IS NOT NULL
  DROP TABLE dbo.Orders;
GO
IF OBJECT_ID('dbo.Customers') IS NOT NULL
  DROP TABLE dbo.Customers;
GO

SELECT n AS CustomerID
INTO dbo.Customers
FROM dbo.Nums
WHERE n <= 10000;

ALTER TABLE dbo.Customers ADD PRIMARY KEY(CustomerID);

SELECT n AS OrderID,
  1 + ABS(CHECKSUM(NEWID())) % 10000 AS CustomerID,
  1 + ABS(CHECKSUM(NEWID())) % 40    AS EmployeeID
INTO dbo.Orders
FROM dbo.Nums
WHERE n <= 1000000;

ALTER TABLE dbo.Orders ADD PRIMARY KEY(OrderID);
CREATE INDEX idx_cid_eid ON dbo.Orders(CustomerID, EmployeeID);

The first solution doesn’t make any use of temporary objects; rather, it implements a classic relational division approach applying reverse logic with subqueries:

SELECT CustomerID,
    CASE WHEN EXISTS(SELECT * FROM dbo.Orders AS O
                     WHERE O.CustomerID = C1.CustomerID)
      THEN COALESCE(
        (SELECT MIN(C2.CustomerID)
           FROM dbo.Customers AS C2
           WHERE C2.CustomerID < C1.CustomerID
             AND NOT EXISTS
               (SELECT * FROM dbo.Orders AS O1
                WHERE O1.CustomerID = C1.CustomerID
                  AND NOT EXISTS
                    (SELECT * FROM dbo.Orders AS O2
                     WHERE O2.CustomerID = C2.CustomerID
                       AND O2.EmployeeID = O1.EmployeeID))
             AND NOT EXISTS
               (SELECT * FROM dbo.Orders AS O2
                WHERE O2.CustomerID = C2.CustomerID
                  AND NOT EXISTS
                    (SELECT * FROM dbo.Orders AS O1
                     WHERE O1.CustomerID = C1.CustomerID
                       AND O1.EmployeeID = O2.EmployeeID))),
        CustomerID) END AS Grp
FROM dbo.Customers AS C1
ORDER BY Grp, CustomerID;

The query invokes a CASE expression for every customer from the Customers table (C1). The CASE expression invokes the COALESCE function for customers who placed orders, and returns NULL for customers who placed no orders. If the customer placed orders, COALESCE will substitute a NULL returned by the input expression with the current CustomerID. The input expression will return the result of the following:

  • Return the minimum CustomerID from a second instance of Customers (C2)

  • Where C2.CustomerID (Cust2) is smaller than C1.CustomerID (Cust1)

  • And you cannot find an employee in Cust1’s orders that does not appear in Cust2’s orders

  • And you cannot find an employee in Cust2’s orders that does not appear in Cust1’s orders

Logically, you could do without filtering Cust2 < Cust1, but this expression is used to avoid wasting resources. Anyway, you need to return the minimum CustomerID out of the ones with the same employee list. If customer A has the same employee list as customer B, both will end up with a Grp value of A. For customer B, there’s a point in comparing it to customer A (smaller ID), but for customer A there’s no point in comparing it to customer B (higher ID). Naturally, the minimum CustomerID within the group will not find customers with smaller IDs. In such a case, the expression will return a NULL, and the outer COALESCE will substitute the NULL with the current CustomerID. As for the rest, it’s a classical phrasing of relational division with reverse logic.

This solution is expensive because of the excessive scan count, which has to do with the large number of invocations of the correlated subqueries. To give you a sense, this solution ran over an hour before I gave up waiting for it to finish and stopped it. Most standard set-based solutions you can come up with for this problem that don’t use temporary objects will typically be expensive.

If you devise a solution in which you generate an interim set that can benefit from an index, you might want to consider using temporary tables. For example, you can materialize the distinct list of CustomerID, EmployeeID values; index the temporary table; and continue from there. The materialized data would substantially reduce the number of rows in the set you’ll query. Still, you won’t be dealing with a tiny set, and most probably your solution will access the table multiple times. You want efficient plans to be generated based on distribution statistics and accurate cardinality information. All this should lead you to use a local temporary table and not a table variable.

Example 2-2 has a solution that first creates the suggested local temporary table, indexes it, and then queries it.

Example 2-2. Solution based on temporary tables

SELECT DISTINCT CustomerID, EmployeeID
INTO #CustsEmps
FROM dbo.Orders;

CREATE UNIQUE CLUSTERED INDEX idx_cid_eid
  ON #CustsEmps(CustomerID, EmployeeID);
GO

WITH Agg AS
(
  SELECT CustomerID,
    MIN(EmployeeID) AS MN,
    MAX(EmployeeID) AS MX,
    COUNT(*)        AS CN,
    SUM(EmployeeID) AS SM,
    CHECKSUM_AGG(EmployeeID) AS CS
  FROM #CustsEmps
  GROUP BY CustomerID
),
AggJoin AS
(
  SELECT A1.CustomerID AS Cust1, A2.CustomerID AS Cust2, A1.CN
  FROM Agg AS A1
    JOIN Agg AS A2
      ON  A2.CustomerID <= A1.CustomerID
      AND A2.MN = A1.MN
      AND A2.MX = A1.MX
      AND A2.CN = A1.CN
      AND A2.SM = A1.SM
      AND A2.CS = A1.CS
),
CustGrp AS
(
  SELECT Cust1, MIN(Cust2) AS Grp
  FROM AggJoin AS AJ
  WHERE CN = (SELECT COUNT(*)
              FROM #CustsEmps AS C1
                JOIN #CustsEmps AS C2
                  ON C1.CustomerID = AJ.Cust1
                  AND C2.CustomerID = AJ.Cust2
                  AND C2.EmployeeID = C1.EmployeeID)
  GROUP BY Cust1
)
SELECT CustomerID, Grp
FROM dbo.Customers AS C
  LEFT OUTER JOIN CustGrp AS G
    ON C.CustomerID = G.Cust1
ORDER BY Grp, CustomerID;
GO

DROP TABLE #CustsEmps;

I also used table expressions here to build the solution in a modular approach. I used CTEs, but the solution can be easily revised to use derived tables in SQL Server 2000.

The first CTE (Agg) groups the data from the temporary table by CutomerID, and returns a bunch of aggregates based on EmployeeID for each customer (MIN, MAX, COUNT, SUM, CHECKSUM_AGG).

The second CTE (AggJoin) joins two instances of Agg (A1 and A2)–matching for each customer in A1, all customers from A2 with a lower CustomerID and the same values in all aggregates. The purpose of comparing aggregates is to return, for each customer, smaller than or equal to customer IDs that potentially share the same list of employees. The reasoning behind the use of the smaller than or equal to (<=) filter is the similar to the one in the previous solution. That is, comparing sets of employees between customers when A2.CustomerID (Cust2) is greater than A1.CustomerID (Cust1) is superfluous.

The third CTE (CustGrp), filters from AggJoin only pairs that are proven to share the exact list of employees, by verifying that the count of matching employees in both is identical to the count of employees in each. The query groups the filtered rows by Cust1, returning the minimum Cust2 for each Cust1. At this point, CustGrp contains the correct Grp value for each customer.

Finally the outer query performs a left outer join that adds customers with no orders.

This solution runs for 7 seconds. Note that you could use a CTE with the set of distinct CustomerID, EmployeeID combinations instead of the temporary table #CustEmps. This way, you could avoid using temporary tables altogether. I tested such a solution and it ran for about 12 seconds–almost twice the runtime of the solution that utilizes a temporary table. The advantage in the temporary table approach was that you could index it.

Considering the fastest solution we had so far–the one utilizing a temporary table–is this really the best you can get? Apparently not. You can use the FOR XML PATH option to concatenate all distinct EmployeeID values per customer. You can then group the data by the concatenated string, and return for each customer the minimum CustomerID within the group using the OVER clause. The fast and nifty concatenation technique was devised by Michael Rys, a program manager with the Microsoft SQL Server development team in charge of SQL Server XML technologies, and Eugene Kogan, a technical lead on the Microsoft SQL Server Engine team. The PATH mode provides an easier way to mix elements and attributes than the EXPLICIT directive. Here’s the complete solution:

WITH CustGroups AS
(
  SELECT CustomerID,
    (SELECT CAST(EmployeeID AS VARCHAR(10)) + ';' AS [text()]
     FROM (SELECT DISTINCT EmployeeID
           FROM dbo.Orders AS O
           WHERE O.CustomerID = C.CustomerID) AS D
     ORDER BY EmployeeID
     FOR XML PATH('')) AS CustEmps
  FROM dbo.Customers AS C
)
SELECT CustomerID,
  CASE WHEN CustEmps IS NULL THEN NULL
    ELSE MIN(CustomerID) OVER(PARTITION BY CustEmps) END AS Grp
FROM CustGroups
ORDER BY Grp, CustomerID;

The solution is short and slick, doesn’t use temporary tables at all, and runs for 7 seconds as well! This is to prove that you should revisit slow-running solutions from time to time; especially with new functionality available in SQL Server 2005.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.12.50