Advanced SELECT techniques

Aggregating data over the complete input rowset or aggregating in groups produces aggregated rows only—either one row for the whole input rowset or one row per group. Sometimes you need to return aggregates together with the detail data. One way to achieve this is by using subqueries, queries inside queries.

The following query shows an example of using two subqueries in a single query. In the SELECT clause, a subquery calculates the sum of quantity for each customer; it returns a scalar value. The subquery refers to the customer key from the outer query. The subquery can't execute without the outer query; this is a correlated subquery. There is another subquery in the FROM clause that calculates the overall quantity for all customers. This query returns a table, although it is a table with a single row and a single column. This query is a self-contained subquery, independent of the outer query. A subquery in the FROM clause is also called a derived table.

Another type of join is used to add the overall total to each detail row. A cross join is a Cartesian product of two input rowsets—each row from one side is associated with every single row from the other side. No join condition is needed. A cross join can produce an unwanted huge result set. For example, if you cross join just 1,000 rows from the left side of the join with 1,000 rows from the right side, you get 1,000,000 rows in the output. Therefore, typically, you want to avoid a cross join in production. However, in the example in the following query, 143,968 from the left side rows is cross joined to a single row from the subquery, therefore producing 143,968 only. Effectively, this means that the overall total column is added to each detail row:

SELECT c.Customer, 
  f.Quantity, 
  (SELECT SUM(f1.Quantity) FROM Fact.Sale AS f1 
   WHERE f1.[Customer Key] = c.[Customer Key]) AS TotalCustomerQuantity, 
  f2.TotalQuantity 
FROM (Fact.Sale AS f 
  INNER JOIN Dimension.Customer AS c 
    ON f.[Customer Key] = c.[Customer Key]) 
  CROSS JOIN  
    (SELECT SUM(f2.Quantity) FROM Fact.Sale AS f2 
    WHERE f2.[Customer Key] <> 0) AS f2(TotalQuantity) 
WHERE c.[Customer Key] <> 0 
ORDER BY c.Customer, f.Quantity DESC;

Here is an abbreviated output of the query:

In the previous example, the correlated subquery in the SELECT clause has to logically execute once per row of the outer query. The query was partially optimized by moving the self-contained subquery for the overall total in the FROM clause, where it logically executes only once. Although SQL Server can optimize correlated subqueries many times and convert them to joins, there also exists a much better and more efficient way to achieve the same result as the previous query returned. You can do this by using the window functions.

The following query is using the window aggregate function SUM to calculate the total over each customer and the overall total. The OVER clause defines the partitions, or the windows of the calculation. The first calculation is partitioned over each customer, meaning that the total quantity per customer is reset to zero for each new customer. The second calculation uses an OVER clause without specifying partitions, thus meaning the calculation is done over all the input rowset. This query produces exactly the same result as the previous one:

SELECT c.Customer, 
  f.Quantity, 
  SUM(f.Quantity) 
   OVER(PARTITION BY c.Customer) AS TotalCustomerQuantity, 
  SUM(f.Quantity) 
   OVER() AS TotalQuantity 
FROM Fact.Sale AS f 
  INNER JOIN Dimension.Customer AS c 
    ON f.[Customer Key] = c.[Customer Key] 
WHERE c.[Customer Key] <> 0 
ORDER BY c.Customer, f.Quantity DESC;

You can use many other functions for window calculations. For example, you can use the ranking functions, such as ROW_NUMBER(), to calculate some rank in the window or in the overall rowset. However, rank can be defined only over some order. You can specify the order of the calculation in the ORDER BY sub-clause inside the OVER clause. Please note that this ORDER BY clause defines only the logical order of the calculation, and not the order of the rows returned. A standalone, outer ORDER BY at the end of the query defines the order of the result.

The following query calculates a sequential number, the row number of each row in the output, for each detail row of the input rowset. The row number is calculated once in partitions for each customer and once over the whole input rowset. The logical order of the calculation is over quantity descending, meaning that row number one gets the largest quantity—either the largest for each customer or the largest in the whole input rowset:

SELECT c.Customer, 
  f.Quantity, 
  ROW_NUMBER() 
   OVER(PARTITION BY c.Customer 
        ORDER BY f.Quantity DESC) AS CustomerOrderPosition, 
  ROW_NUMBER() 
   OVER(ORDER BY f.Quantity DESC) AS TotalOrderPosition 
FROM Fact.Sale AS f 
  INNER JOIN Dimension.Customer AS c 
    ON f.[Customer Key] = c.[Customer Key] 
WHERE c.[Customer Key] <> 0 
ORDER BY c.Customer, f.Quantity DESC;

The query produces the following result, again abbreviated to a couple of rows only:

Note the position, or the row number, for the second customer. The order does not look to be completely correct—it is 1, 4, 3, 2, 5, and not 1, 2, 3, 4, 5, as you might expect. This is due to repeating the value for the second largest quantity—for the quantity 250. The quantity is not unique, and thus the order is not deterministic. The order of the result is defined over the quantity, and not over the row number. You can't know in advance which row will get which row number when the order of the calculation is not defined on unique values. Please also note that you might get a different order when you execute the same query on your SQL Server instance.

Window functions are useful for some advanced calculations, such as running totals and moving averages as well. However, the calculation of these values can't be performed over the complete partition. You can additionally frame the calculation to a subset of rows of each partition only.

The following query calculates the running total of the quantity per customer ordered by the sale key and framed differently for each row. The frame is defined from the first row in the partition to the current row. Therefore, the running total is calculated over one row for the first row, over two rows for the second row, and so on. Additionally, the query calculates the moving average of the quantity for the last three rows:

SELECT c.Customer, 
  f.[Sale Key] AS SaleKey, 
  f.Quantity, 
  SUM(f.Quantity) 
   OVER(PARTITION BY c.Customer 
        ORDER BY [Sale Key] 
       ROWS BETWEEN UNBOUNDED PRECEDING 
                 AND CURRENT ROW) AS Q_RT, 
  AVG(f.Quantity) 
   OVER(PARTITION BY c.Customer 
        ORDER BY [Sale Key] 
       ROWS BETWEEN 2 PRECEDING 
                 AND CURRENT ROW) AS Q_MA 
FROM Fact.Sale AS f 
  INNER JOIN Dimension.Customer AS c 
    ON f.[Customer Key] = c.[Customer Key] 
WHERE c.[Customer Key] <> 0 
ORDER BY c.Customer, f.[Sale Key];

The query returns the following (abbreviated) result:

 Customer                    SaleKey  Quantity    Q_RT        Q_M
--------------------        --------  -------    --------  ------------
Tailspin Toys (Absecon, NJ)   2869     216         216         216
Tailspin Toys (Absecon, NJ)   2870     2           218         109
Tailspin Toys (Absecon, NJ)   2871     2           220         73

Let's find the top three orders by quantity for the Tailspin Toys (Aceitunas, PR) customer! You can do this by using the OFFSET...FETCH clause after the ORDER BY clause, as the following query shows:

SELECT c.Customer, 
  f.[Sale Key] AS SaleKey, 
  f.Quantity 
FROM Fact.Sale AS f 
  INNER JOIN Dimension.Customer AS c 
    ON f.[Customer Key] = c.[Customer Key] 
WHERE c.Customer = N'Tailspin Toys (Aceitunas, PR)' 
ORDER BY f.Quantity DESC 
OFFSET 0 ROWS FETCH NEXT 3 ROWS ONLY;

This is the complete result of the query:

    Customer                        SaleKey  Quantity
    ------------------------------  -------- -----------
    Tailspin Toys (Aceitunas, PR)   36964    288
    Tailspin Toys (Aceitunas, PR)   126253   250
    Tailspin Toys (Aceitunas, PR)   79272    250

But wait, didn't the second largest quantity, the value 250, repeat three times? Which two rows were selected in the output? Again, because the calculation is done over a non-unique column, the result is somehow non-deterministic. SQL Server offers another possibility, the TOP clause. You can specify TOP n WITH TIES, meaning you can get all of the rows with ties on the last value in the output. However, this way you don't know the number of the rows in the output in advance. The following query shows this approach:

SELECT TOP 3 WITH TIES 
  c.Customer, 
  f.[Sale Key] AS SaleKey, 
  f.Quantity 
FROM Fact.Sale AS f 
  INNER JOIN Dimension.Customer AS c 
    ON f.[Customer Key] = c.[Customer Key] 
WHERE c.Customer = N'Tailspin Toys (Aceitunas, PR)' 
ORDER BY f.Quantity DESC;

This is the complete result of the previous query—this time it is four rows:

    Customer                        SaleKey  Quantity
    ------------------------------ -------- -----------
    Tailspin Toys (Aceitunas, PR)   36964    288
    Tailspin Toys (Aceitunas, PR)   223106   250
    Tailspin Toys (Aceitunas, PR)   126253   250
    Tailspin Toys (Aceitunas, PR)   79272    250

The next task is to get the top three orders by quantity for each customer. You need to perform the calculation for each customer. The APPLY Transact-SQL operator comes in handy here. You use it in the FROM clause. You apply, or execute, a table expression defined on the right side of the operator once for each row of the input rowset from the left side of the operator. There are two flavors of this operator. The CROSS APPLY version filters out the rows from the left rowset, if the tabular expression on the right side does not return any row. The OUTER APPLY version preserves the row from the left side, even if the tabular expression on the right side does not return any row, just as the LEFT OUTER JOIN does. Of course, columns for the preserved rows do not have known values from the right-side tabular expression. The following query uses the CROSS APPLY operator to calculate the top three orders by quantity for each customer that actually does have some orders:

SELECT c.Customer, 
  t3.SaleKey, t3.Quantity 
FROM Dimension.Customer AS c 
  CROSS APPLY (SELECT TOP(3)  
                 f.[Sale Key] AS SaleKey, 
                 f.Quantity 
                FROM Fact.Sale AS f 
                WHERE f.[Customer Key] = c.[Customer Key] 
                ORDER BY f.Quantity DESC) AS t3 
WHERE c.[Customer Key] <> 0 
ORDER BY c.Customer, t3.Quantity DESC;

Here is the result of this query, shortened to the first nine rows:

    Customer                            SaleKey  Quantity
    ----------------------------------  -------- -----------
    Tailspin Toys (Absecon, NJ)         5620     360
    Tailspin Toys (Absecon, NJ)         114397   324
    Tailspin Toys (Absecon, NJ)         82868    288
    Tailspin Toys (Aceitunas, PR)       36964    288
    Tailspin Toys (Aceitunas, PR)       126253   250
    Tailspin Toys (Aceitunas, PR)       79272    250
    Tailspin Toys (Airport Drive, MO)   43184    250
    Tailspin Toys (Airport Drive, MO)   70842    240
    Tailspin Toys (Airport Drive, MO)   630      225

For the final task in this section, assume that you need to calculate some statistics over totals of customers' orders. You need to calculate the average total amount for all customers, the standard deviation of this total amount, and the average count of total count of orders per customer. This means you need to calculate the totals over customers in advance, and then use aggregate functions AVG() and STDEV() on these aggregates. You could do aggregations over customers in advance in a derived table. However, there is another way to achieve this. You can define the derived table in advance, in the WITH clause of the SELECT statement. Such a subquery is called a CTE.

CTEs are more readable than derived tables, and might be also more efficient. You could use the result of the same CTE multiple times in the outer query. If you use derived tables, then you need to define them multiple times if you want to use them multiple times in the outer query. The following query shows the usage of a CTE to calculate the average total amount for all customers, the standard deviation of this total amount, and the average count of total count of orders per customer:

WITH CustomerSalesCTE AS 
( 
SELECT c.Customer,  
  SUM(f.[Total Excluding Tax]) AS TotalAmount, 
  COUNT(*) AS SalesCount 
FROM Fact.Sale AS f 
  INNER JOIN Dimension.Customer AS c 
    ON f.[Customer Key] = c.[Customer Key] 
WHERE c.[Customer Key] <> 0  
GROUP BY c.Customer 
) 
SELECT ROUND(AVG(TotalAmount), 6) AS AvgAmountPerCustomer, 
  ROUND(STDEV(TotalAmount), 6) AS StDevAmountPerCustomer,  
  AVG(SalesCount) AS AvgCountPerCustomer 
FROM CustomerSalesCTE;

It returns the following result:

    AvgAmountPerCustomer  StDevAmountPerCustomer AvgCountPerCustomer
    --------------------- ---------------------- -------------------
    270479.217661         38586.082621           358

Table of Contents for Advanced SELECT techniques

Create new playlist

Sign In

Sign Up

Table of Contents for
Advanced SELECT techniques