Chapter 5: Aggregations and Grouping

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 5

Aggregations and Grouping

by Wayne Sheffield

Aggregate functions are used to perform a calculation on one or more values, and the result is a single value. If your query has any columns with nonwindowed aggregate functions, then a GROUP BY clause is required for the query. Table 5-1 shows the various aggregate functions.

Table 5-1. Aggregate Functions

Function Name	Description
AVG	The AVG aggregate function calculates the average of non-NULL values in a group.
CHECKSUM_AGG	The CHECKSUM_AGG function returns a checksum value based on a group of rows, allowing you to potentially track changes to a table. For example, adding a new row or changing the value of a column that is being aggregated will usually result in a new checksum integer value. The reason I say “usually” is because there is a possibility that the checksum value does not change even if values are modified.
COUNT	The COUNT aggregate function returns an integer data type showing the count of rows in a group, including rows with NULL values.
COUNT_BIG	The COUNT_BIG aggregate function returns a bigint data type showing the count of rows in a group, including rows with NULL values.
GROUPING
MAX	The MAX aggregate function returns the highest value in a set of non-NULL values.
MIN	The MIN aggregate function returns the lowest value in a group of non-NULL values.
STDEV	The STDEV function returns the standard deviation of all values provided in the expression based on a sample of the data population.
STDEVP	The STDEVP function also returns the standard deviation for all values in the provided expression, but does so based upon the entire data population.
SUM	The SUM aggregate function returns the summation of all non-NULL values in an expression.
VAR	The VAR function returns the statistical variance of values in an expression based upon a sample of the provided population.
VARP	The VARP function returns the statistical variance of values in an expression, but does so based upon the entire data population.

Note Window functions—and using aggregate functions with window functions—are discussed in Chapter 7.

The STDEV, STDEVP, VAR, and VARP are statistical functions. The use of these functions requires knowledge of how statistics works, which is beyond the scope of this book.

With the exception of the COUNT and COUNT_BIG functions, all of the aggregate functions have the same syntax (the syntax and usage of these functions will be discussed in a recipe later on in this chapter).

aggregate_function_name ( { [ [ ALL | DISTINCT ] expression ] } )

where expression is a series of expressions and operations that returns a single value (but does not include aggregate functions or subqueries) that the aggregate function will be calculated over. If the optional keyword DISTINCT is used, then only distinct values will be considered. If the optional keyword ALL is used, then all values will be considered. If neither is specified, then ALL is used by default. Aggregate functions and subqueries are not allowed for the expression parameter.

Frequently when aggregating data, you will want to perform the aggregation based upon a grouping of a set of columns in the query. Grouping is primarily performed in SQL Server by using the GROUP BY clause within a SELECT query to determine in which group the rows should be put. Grouping can also be performed at the column level with the use of window functions. Data is aggregated by using the appropriate aggregation function. The simplified syntax is as follows:

SELECT Column1, <aggregate_function>(Column2)
FROM table_list
[WHERE search_conditions]
GROUP BY Column1

GROUP BY follows the optional WHERE clause and is most often used when aggregate functions are being utilized in the SELECT statement.

5-1. Computing an Aggregation

Problem

You want to perform several aggregations on the ratings of your products.

Solution

Use the appropriate aggregation function to determine each aggregation:

SELECT  MIN(Rating)     Rating_Min,
        MAX(Rating)     Rating_Max,
        SUM(Rating)     Rating_Sum,
        AVG(Rating)     Rating_Avg
FROM    Production.ProductReview;

This query produces the following result set:

Rating_Min  Rating_Max  Rating_Sum  Rating_Avg
----------- ----------- ----------- -----------
2           5           16          4

How It Works

For the non-NULL values in the table in the Rating column, the MIN function calculates the lowest of these values; the MAX function calculates the highest of these values; the SUM function calculates the total of the ratings; and the AVG aggregate function calculates the average of the values. To demonstrate the use of DISTINCT, let’s compare the previous query with the columns returned from the following query:

SELECT  AVG(Grade) AS AvgGrade,
        AVG(DISTINCT Grade) AS AvgDistinctGrade
FROM    (VALUES (1, 100),
                (1, 100),
                (1, 100),
                (1, 100),
                (1, 100),
                (1, 30)
        ) dt (StudentId, Grade);

This query produces the following result set:

AvgGrade    AvgDistinctGrade
----------- ----------------
88          65

In this example, we have a student with six grades—five perfect grades of 100 and one failing grade of 30; the average of all of the grades is 88. Within these grades are two distinct grades, and the average of these distinct grades is 65.

When utilizing the AVG function, the expression parameter must be one of the numeric data types.

5-2. Creating Aggregations Based upon the Values of the Data

Problem

You want to aggregate one or more columns, with the aggregations applied to a set of columns whenever the data in those columns change. For example, for each order, you want to see the number of line items, as well as the average, minimum, maximum, and total for those line items.

Solution

The SalesOrderID column is included in the query in order to make the results meaningful (e.g. which order is this for?). Each desired aggregation is performed against the LineTotal column. Finally, the query utilizes the GROUP BY clause in order to group the data by the SalesOrderID. See the following:

SELECT  TOP (10)
        SalesOrderID,
        SUM(LineTotal)      AS OrderTotal,
        MIN(LineTotal)      AS MinLine,
        MAX(LineTotal)      AS MaxLine,
        AVG(LineTotal)      AS AvgLine,
        COUNT(LineTotal)    AS CountLine
FROM    [Sales].[SalesOrderDetail]
GROUP BY SalesOrderID
ORDER BY SalesOrderID;

This query returns the following result set:

SalesOrderID   OrderTotal     MinLine      MaxLine       AvgLine       CountLine
------------   ------------   ----------   -----------   -----------   ---------
43659          20565.620600   10.373000    6074.982000   1713.801716   12
43660          1294.252900    419.458900   874.794000    647.126450    2
43661          32726.478600   20.746000    8099.976000   2181.765240   15
43662          28832.528900   178.580800   5248.764000   1310.569495   22
43663          419.458900     419.458900   419.458900    419.458900    1
43664          24432.608800   28.840400    8099.976000   3054.076100   8
43665          14352.771300   10.373000    4049.988000   1435.277130   10
43666          5056.489600    356.898000   2146.962000   842.748266    6
43667          6107.082000    17.100000    2039.994000   1526.770500   4
43668          35944.156200   20.186500    5248.764000   1239.453662   29

How It Works

To determine the groups that rows should be put in, the GROUP BY clause is used in a SELECT query. When grouping a result set, the GROUP BY clause can specify multiple columns, and all columns listed in the SELECT clause must be either used in an aggregate function or referenced in the GROUP BY clause. If the query returns both aggregated and non-aggregated columns, and the GROUP BY clause doesn’t specify all of the non-aggregated columns, then the following error will be raised (this error was generated by removing the GROUP BY clause from the example in this recipe):

Msg 8120, Level 16, State 1, Line 9
Column 'Sales.SalesOrderDetail.SalesOrderID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

This error is raised because any column being returned by the query that is not used in an aggregate function in the SELECT list must be listed in the GROUP BY clause for the query.

5-3. Counting the Rows in a Group

Problem

You want to see the number of rows for each value of a column—for instance, the number of products you have in inventory on each shelf for your first five shelves.

Solution

Utilize the COUNT or COUNT_BIG function to return the count of rows in a group:

SELECT  TOP (5)
        Shelf,
        COUNT(ProductID) AS ProductCount,
        COUNT_BIG(ProductID) AS ProductCountBig
FROM    Production.ProductInventory
GROUP BY Shelf
ORDER BY Shelf;

This query returns the following result set:

Shelf      ProductCount ProductCountBig
---------- ------------ ---------------
A          81           81
B          36           36
C          55           55
D          50           50
E          85           85

The results of this query show each shelf and the number of products on each of those shelves.

How It Works

The COUNT and COUNT_BIG functions are utilized to return a count of the number of items in a group. The difference between these functions is the data type returned: COUNT returns an INTEGER, while COUNT_BIG returns a BIGINT. You should utilize COUNT_BIG if you will be counting more rows than the INTEGER data type supports (2³¹−1). Throughout this book, the COUNT function will be used. The syntax for these functions is as follows:

COUNT | COUNT_BIG ( { [ [ ALL | DISTINCT ] expression ] | * } )

The difference between this syntax and the other aggregate functions is the optional asterisk (*) that can be specified. The use of COUNT(*)specifies that all rows should be counted so as to return the total number of rows within a table; conversely, if COUNT(<nullable column>) is used, then rows where that column is NULL will not be counted. COUNT(*) does not use any parameters, so it does not use any information about any column.

When utilizing the COUNT or COUNT_BIG functions, the expression parameter can be of any data type except for the text, image, or ntext data types. For instance, for the following table variable:

DECLARE @test TABLE (col1 TEXT);

this query:

SELECT COUNT(col1) FROM @test;

will return the following error:

Msg 8117, Level 16, State 1, Line 4
Operand data type text is invalid for count operator.

However, you can utilize COUNT(*) instead:

SELECT COUNT(*) FROM @test;

which returns this result set:

--
0

If you are using the COUNT function and you exceed the capacity for an integer, then the following error will be generated:

Msg 8115, Level 16, State 2, Line 1
Arithmetic overflow error converting expression to data type int.

5-4. Detecting Changes in a Table

Problem

You need to determine whether any changes have been made to the data in a column.

Solution

Utilize the CHECKSUM_AGG function to detect changes in a table. For example:

IF OBJECT_ID('tempdb.dbo.[#Recipe5.4]') IS NOT NULL DROP TABLE [#Recipe5.4];
CREATE TABLE [#Recipe5.4]
(
    StudentID INTEGER,
    Grade     INTEGER
);
INSERT INTO [#Recipe5.4] (StudentID, Grade)
VALUES  (1, 100),
        (1, 95)

SELECT  StudentID, CHECKSUM_AGG(Grade) AS GradeChecksumAgg
FROM    [#Recipe5.4]
GROUP BY StudentID;

UPDATE  [#Recipe5.4]
SET     Grade = 99
WHERE   Grade = 95;

SELECT  StudentID, CHECKSUM_AGG(Grade) AS GradeChecksumAgg
FROM    [#Recipe5.4]
GROUP BY StudentID;

These queries return the following result sets:

StudentID   GradeChecksumAgg
----------- ----------------
1           59

StudentID   GradeChecksumAgg
----------- ----------------
1           7

How It Works

The CHECKSUM_AGG function returns the checksum of the values in the group, in this case the values from the Grade column. In the second query, the last grade is changed, and when the first query is rerun, the aggregated checksum returns a different value.

When utilizing the CHECKSUM_AGG function, the expression parameter must be of an integer data type.

Note Because of the hashing algorithm being used, it is possible for the CHECKSUM_AGG function to return the same value with different data. You should use this function only if your application can tolerate occasionally missing a change.

5-5. Restricting a Result Set to Groups of Interest

Problem

You do not want to return all of the rows that could be returned by an aggregation; instead, you want only the rows where the aggregation itself is filtered. For example, you want to report on the reasons that the product was scrapped, but only for the reasons that have more than 50 occurrences.

Solution

Specify a HAVING clause, giving the conditions that the aggregated rows must meet in order to be returned.

This example queries two tables: Production.ScrapReason and Production.WorkOrder. The Production.ScrapReason table is a lookup table that contains manufacturing failure reasons, and the Production.WorkOrder table contains the manufacturing work orders that control which products are manufactured in the quantity and time period required in order to meet inventory and sales needs. A report is needed that shows which of the “failure reasons” have occurred more than 50 times, which can be achieved by the following code:

SELECT  s.Name,
        COUNT(w.WorkOrderID) AS Cnt
FROM    Production.ScrapReason s
        INNER JOIN Production.WorkOrder w
            ON s.ScrapReasonID = w.ScrapReasonID
GROUP BY s.Name
HAVING  COUNT(*) > 50;

This query returns the following result set:

Name                              Cnt
--------------------------------- ---
Gouge in metal                    54
Stress test failed                52
Thermoform temperature too low    63
Trim length too long              52
Wheel misaligned                  51

How It Works

The HAVING clause of the SELECT statement allows you to specify a search condition on a query that uses GROUP BY and/or an aggregated value. The syntax is as follows:

SELECT select_list
FROM table_list
[ WHERE search_conditions ]
[ GROUP BY group_by_list ]
[ HAVING search_conditions ]

The HAVING clause is used to qualify the results after the GROUP BY has been applied. The WHERE clause, in contrast, is used to qualify the rows that are returned from the tables specified in the FROM clause before the data is aggregated or grouped. HAVING qualifies the aggregated data after the data has been grouped.

In this recipe, the SELECT clause requests a count of WorkOrderIDs by failure name:

SELECT  s.Name,
        COUNT(w.WorkOrderID) AS Cnt

Two tables are joined by the ScrapReasonID column:

FROM    Production.ScrapReason s
        INNER JOIN Production.WorkOrder w
            ON s.ScrapReasonID = w.ScrapReasonID

Because an aggregate function is used in the SELECT clause, the nonaggregated columns must appear in the GROUP BY clause:

GROUP BY s.Name

Lastly, using the HAVING query ensures that, of the selected and grouped data, only those rows in the result set with a count of more than 50 will be returned:

HAVING COUNT(*)>50

5-6. Performing Aggregations against Unique Values Only

Problem

You need to know the quantity of unique values per date.

Solution

Add the DISTINCT clause to the COUNT function:

SELECT  [RateChangeDate],
        COUNT([Rate]) AS [Count],
        COUNT(DISTINCT Rate) AS [DistinctCount]
FROM    [HumanResources].[EmployeePayHistory]
WHERE   RateChangeDate >= '2008-12-01'
        AND RateChangeDate < '2008-12-10'
GROUP BY RateChangeDate;

This query returns the following result set:

RateChangeDate           Count       DistinctCount
-----------------------  ----------- -------------
2008-12-01 00:00:00.000  2           2
2008-12-02 00:00:00.000  3           2
2008-12-03 00:00:00.000  1           1
2008-12-04 00:00:00.000  3           3
2008-12-05 00:00:00.000  1           1
2008-12-06 00:00:00.000  2           2
2008-12-07 00:00:00.000  5           3
2008-12-08 00:00:00.000  2           2
2008-12-09 00:00:00.000  3           3

How It Works

This query utilizes two COUNT functions; the second one also uses the DISTINCT clause. This forces the COUNT function to count only the distinct values in the specified column, in this case the Rate column. To further understand the difference, let’s examine the data for a date where the two functions are returning different values—2008-12-02:

SELECT  RateChangeDate, Rate
FROM    HumanResources.EmployeePayHistory
WHERE   RateChangeDate = '2008-12-02';

This query returns the following three rows:

RateChangeDate          Rate
----------------------- -----
2008-12-02 00:00:00.000 6.50
2008-12-02 00:00:00.000 10.00
2008-12-02 00:00:00.000 10.00

When looking at the data, you can see that for this date there are three rows; however, there are two distinct values in the Rate column. Therefore, the COUNT function returned 3, while COUNT(DISTINCT) returned 2.

5-7. Creating Hierarchical Summaries

Problem

You need to return a data set with the detail data as well as with subtotal rows and a grand total row based upon the GROUP BY clause.

Solution

You need to include the ROLLUP argument after the GROUP BY clause. This example uses the ROLLUP argument to produce subtotal lines at the Shelf level, as well as a grand total line:

SELECT  i.Shelf,
        p.Name,
SUM(i.Quantity) AS Total
FROM    Production.ProductInventory i
INNER JOIN Production.Product p
ON i.ProductID = p.ProductID
WHERE   i.Shelf IN ('A','B')
AND     p.Name LIKE 'Metal%'
GROUP BY ROLLUP(i.Shelf, p.Name);

This query returns the following result set:

Shelf      Name        Total
---------- ----------- -----
A          Metal Angle 404
A          Metal Bar 1 353
A          Metal Bar 2 622
A          NULL        1379
B          Metal Angle 355
B          Metal Bar 1 403
B          Metal Bar 2 497
B          NULL        1255
NULL       NULL        2634

How It Works

The order in which you place the columns in the GROUP BY ROLLUP clause affects how data is aggregated. ROLLUP in this query aggregates the total quantity for each change in Shelf. Notice the row with shelf A and the NULL name; this holds the total quantity for shelf A. Also notice that the final row is the grand total of all product quantities. Whereas CUBE creates a result set that aggregates all combinations for the selected columns, ROLLUP generates the aggregates for a hierarchy of values.

GROUP BY ROLLUP (i.Shelf, p.Name)

ROLLUP aggregated both a grand total and totals by shelf. Totals were not generated for the product name, but would have been had CUBE been designated instead.

ROLLUP uses a slightly different syntax than previous versions of SQL Server used. ROLLUP comes directly after the GROUP BY clause, instead of trailing the GROUP BY clause with a WITH ROLLUP clause. Notice also that the column lists are contained within parentheses.

Note The GROUP BY WITH ROLLUP feature does not follow the ISO standard, and it will be removed in a future version of Microsoft SQL Server. You should avoid using this feature in any new development work, and you should modify any applications that currently use this feature to use the ROLLUP argument instead.

In this example, I’ve shown how the ROLLUP clause works across a set of columns. Let’s examine another scenario now: we want to include the LocationID column in the output, and then we want to perform the ROLLUP at just the name level. This would be performed with the following query:

SELECT  i.Shelf,
        i.LocationID,
        p.Name,
SUM(i.Quantity) AS Total
FROM    Production.ProductInventory i
INNER JOIN Production.Product p
ON i.ProductID = p.ProductID
WHERE   i.Shelf IN ('A','B')
AND     p.Name LIKE 'Metal%'
GROUP BY i.Shelf, i.LocationID, ROLLUP(i.Shelf, p.Name)
ORDER BY i.Shelf, i.LocationID;

This is accomplished by grouping by Shelf and LocationID and then applying ROLLUP.

5-8. Creating Summary Totals and Subtotals

Problem

You need to return a data set with the detail data as well as with the data summarized on each combination of columns specified in the GROUP BY clause.

Solution

You need to include the CUBE argument after the GROUP BY clause. This example uses the CUBE argument to produce subtotal lines at both the Shelf and LocationID levels, as well as a grand total line:

SELECT  Shelf,
        LocationID,
SUM(Quantity) AS Total
FROM    Production.ProductInventory
WHERE   Shelf IN ('A','B')
AND     LocationID IN (10, 20)
GROUP BY CUBE(Shelf, LocationID);

This query produces several levels of totals, the first being by LocationID.

Shelf      LocationID Total
---------- ---------- -----
A          10         1379
B          10         1648
NULL       10         3027
A          20         1680
B          20         355
NULL       20         2035
NULL       NULL       5062
A          NULL       3059
B          NULL       2003

How It Works

By using the CUBE argument, the query groups by the specified columns, and it creates additional rows that provide totals for each combination of the columns specified in the GROUP BY clause. The rows with NULL values indicate a totaling at either the subtotal or total level. When all of the columns specified in the GROUP BY CUBE are NULL, then this row is the total row. Rows with one or more, but not all, of the columns specified in the GROUP BY CUBE set to NULL are subtotals at the level of the non-null columns. See recipes 5-12 and 5-13 for differenting these groups from data when the columns specified in the GROUP BY CUBE clause contain legitimate NULL values.

CUBE uses a slightly different syntax than in previous versions of SQL Server: CUBE comes after the GROUP BY clause instead of trailing the GROUP BY clause with a WITH CUBE. Notice also that the column lists are contained within parentheses.

Note The GROUP BY WITH CUBE feature does not follow the ISO standard, and it will be removed in a future version of Microsoft SQL Server. You should avoid using this feature in any new development work, and you should modify any applications that currently use this feature to use the CUBE argument instead.

As with the ROLLUP feature, CUBE allows to you first group by columns, then a cube. For example:

SELECT  Shelf,
        LocationID,
        SUM(i.Quantity) AS Total
FROM    Production.ProductInventory
WHERE   Shelf in ('A','B')
AND     LocationID in (10,20)
GROUP BY shelf, CUBE(Shelf, LocationID);

5-9. Creating Custom Summaries

Problem

You need to have one result set with multiple custom aggregations.

Solution

You must include the GROUPING SETS argument after the GROUP BY clause and also include each of the custom aggregations that you want performed.

SQL Server gives you the ability to define your own grouping sets within a single query result set without having to resort to multiple UNION ALL queries. GROUPING SETS also provides you with more control over what is aggregated, as compared to the previously demonstrated CUBE and ROLLUP operations. This is performed by using the GROUPING SETS operator.

First, let’s define a business requirement for a query, which is to have a single result set that contains three different aggregate quantity summaries. Specifically, I would like to see quantity totals by shelf, quantity totals by both shelf and product name, and then also quantity totals by location and name.

We’ll use the GROUPING SETS operator to define the various aggregations we would like to have returned in a single result set:

SELECT  i.Shelf,
        i.LocationID,
        p.Name,
        SUM(i.Quantity) AS Total
FROM    Production.ProductInventory i
        INNER JOIN Production.Product p
            ON i.ProductID = p.ProductID
WHERE   Shelf IN ('A', 'C')
        AND Name IN ('Chain', 'Decal', 'Head Tube')
GROUP BY GROUPING SETS((i.Shelf), (i.Shelf, p.Name), (i.LocationID, p.Name));

This returns the following results:

Shelf      LocationID Name       Total
---------- ---------- ---------- -----
NULL       1          Chain      236
NULL       5          Chain      192
NULL       50         Chain      161
NULL       20         Head Tube  544
A          NULL       Chain      353
A          NULL       Head Tube  544
A          NULL       NULL       897
C          NULL       Chain      236
C          NULL       NULL       236

How It Works

The new GROUPING SETS operator allows you to define varying aggregate groups in a single query while avoiding having multiple queries attached together using the UNION ALL operator. The core of this recipe’s example is the following line of code:

GROUP BY GROUPING SETS ((i.Shelf), (i.Shelf, p.Name), (i.LocationID, p.Name))

Notice that, unlike a regular aggregated query, the GROUP BY clause is not followed by a list of columns. Instead, it is followed by GROUPING SETS. GROUPING SETS is then followed by parentheses and the groupings of column names, each of which is also encapsulated in parentheses.

To achieve this in previous versions of SQL Server, you would have needed to use the UNION ALL operator with multiple queries, as follows:

SELECT  NULL AS Shelf,
        i.LocationID,
        p.Name,
        SUM(i.Quantity) AS Total
FROM    Production.ProductInventory i
        INNER JOIN Production.Product p
            ON i.ProductID = p.ProductID
WHERE   Shelf IN ('A', 'C')
        AND Name IN ('Chain', 'Decal', 'Head Tube')
GROUP BY i.LocationID,
         p.Name
UNION ALL
SELECT  i.Shelf,
        NULL,
        NULL,
        SUM(i.Quantity) AS Total
FROM    Production.ProductInventory i
        INNER JOIN Production.Product p
            ON i.ProductID = p.ProductID
WHERE   Shelf IN ('A', 'C')
        AND Name IN ('Chain', 'Decal', 'Head Tube')
GROUP BY i.Shelf
UNION ALL
SELECT  i.Shelf,
        NULL,
        p.Name,
        SUM(i.Quantity) AS Total
FROM    Production.ProductInventory i
        INNER JOIN Production.Product p
            ON i.ProductID = p.ProductID
WHERE   Shelf IN ('A', 'C')
        AND Name IN ('Chain', 'Decal', 'Head Tube')
GROUP BY i.Shelf,
         p.Name;

This query returns the following result set, which has the same results as when using grouping sets (just ordered slightly differently by the database engine):

Shelf      LocationID Name        Total
---------- ---------- ----------- -----
NULL       1          Chain       236
NULL       5          Chain       192
NULL       50         Chain       161
NULL       20         Head Tube   544
A          NULL       NULL        897
C          NULL       NULL        236
A          NULL       Chain       353
C          NULL       Chain       236
A          NULL       Head Tube   544

As you can see, GROUPING SETS allows for quite a bit simpler code for the complex requirements. GROUPING SETS also allows you to use CUBE and ROLLUP as one of the sets.

5-10. Identifying Rows Generated by the GROUP BY Arguments

Problem

You need to differentiate between the rows that actually have stored NULL data and the total or subtotal rows generated by the GROUP BY arguments that have a NULL value generated for that column.

Solution

You need to utilize the GROUPING function in your query.

The following query uses a CASE statement to evaluate whether each row is a total by the ReorderPoint, total by Size, or a regular, noncubed row:

SELECT  CASE WHEN GROUPING(ReorderPoint) = 1 THEN '--GROUP--'
             ELSE CONVERT(VARCHAR(15), ReorderPoint)
        END AS ReorderPointCalc,
        ReorderPoint,
        CASE WHEN GROUPING(Size) = 1 THEN '--GROUP--'
             ELSE CONVERT(VARCHAR(15), Size)
        END AS SizeCalc,
        Size,
        CASE WHEN GROUPING(ReorderPoint) = 0 AND GROUPING(Size) = 1 THEN 'Size Total'
             WHEN GROUPING(ReorderPoint) = 1 AND GROUPING(Size) = 0 THEN 'ReorderPoint Total'
             WHEN GROUPING(ReorderPoint) = 1 AND GROUPING(Size) = 1 THEN 'Grand Total'
             ELSE 'Regular Row'
        END AS RowType,
        SUM(StandardCost) AS Total
FROM    Production.Product
WHERE   ReorderPoint = 3
GROUP BY CUBE(ReorderPoint, Size);

This query returns the following result set:

ReorderPointCalc ReorderPoint SizeCalc        Size  RowType            Total
---------------- ------------ --------------- ----- ------------------ ---------
3                3            NULL            NULL  Regular Row        290.7344
--GROUP--        NULL         NULL            NULL  ReorderPoint Total 290.7344
3                3            70              70    Regular Row        20.5663
--GROUP--        NULL         70              70    ReorderPoint Total 20.5663
3                3            L               L     Regular Row        254.3789
--GROUP--        NULL         L               L     ReorderPoint Total 254.3789
3                3            M               M     Regular Row        254.3789
--GROUP--        NULL         M               M     ReorderPoint Total 254.3789
3                3            S               S     Regular Row        247.6203
--GROUP--        NULL         S               S     ReorderPoint Total 247.6203
3                3            XL              XL    Regular Row        104.8105
--GROUP--        NULL         XL              XL    ReorderPoint Total 104.8105
--GROUP--        NULL         --GROUP--       NULL  Grand Total        1172.4893
3                3            --GROUP--       NULL  Size Total         1172.4893

How It Works

Notice how the rows grouped in the previous recipes have NULL values in the columns that aren’t participating in the aggregate totals. For example, when shelf C is totaled up in the previous recipe, the location and product name columns are NULL:

C  NULL  NULL  236

If the data contains NULL values, then it can become difficult to differentiate the NULL values from the data, and NULL values from the grouping. To address this issue, you can use the GROUPING function. This allows you to differentiate and act upon those rows that are generated automatically for aggregates using CUBE, ROLLUP, and GROUPING SETS. In this example, the SELECT statement evaluates whether the data in the column is NULL due to the grouping; if so, it returns "—GROUP—". The SELECT statement also calculates for each row whether it is a regular row (a row that contains data from the table), or whether it is added to the result set as the result of the grouping. If it is the result of the grouping, it determines which grouping (Size, ReorderPoint, or Grand Total) that the row represents.

Tip For more on CASE, see Chapter 2.

When GROUPING returns a 1 value (true), it means the column NULL is not an actual data value but rather is a result of the aggregate operation, standing in for the value all. So, for example, if the ReorderPoint value is not NULL and the Size is NULL because of the CUBE aggregation process and not the data itself, the string Size Total is returned:

CASE WHEN GROUPING(ReorderPoint) = 0 AND GROUPING(Size) = 1 THEN 'Size Total'

The statement continues with similar logic, only this time if the ReorderPoint value is NULL because of the CUBE aggregation process but the Size is not null, a ReorderPoint total is provided:

WHEN GROUPING(ReorderPoint) = 1 AND GROUPING(Size) = 0 THEN 'ReorderPoint Total'

The last WHEN states that when both ReorderPoint and Size are NULL because of the CUBE aggregation process, then the row contains the grand total for the result set:

WHEN GROUPING(ReorderPoint) = 1 AND GROUPING(Size) = 1 THEN 'Grand Total'

Notice that the first two rows returned in this result set have a value of NULL for the Size. For the row where the ReorderPoint is 3, the NULL is from the actual data. Without the GROUPING function, it would be difficult to determine by looking at the data returned whether the NULL was from the data or from the grouping.

GROUPING returns only a 1 or a 0; however, you also have the option of using GROUPING_ID to compute grouping at a finer grain, as demonstrated in the next recipe.

5-11. Identifying Summary Levels

Problem

You need to identify which columns are being considered in the grouping rows added to the result set; however, using the GROUPING function on the multiple columns being grouped is making the query complex and difficult to understand.

Solution

You need to utilize the GROUPING_ID function in your query.

The following query uses the GROUPING_ID function to return those columns used in the grouping of that particular row:

SELECT  Shelf,
        LocationID,
        Bin,
        CASE GROUPING_ID(Shelf, LocationID, Bin)
          WHEN 1 THEN 'Shelf/Location Total'
          WHEN 2 THEN 'Shelf/Bin Total'
          WHEN 3 THEN 'Shelf Total'
          WHEN 4 THEN 'Location/Bin Total'
          WHEN 5 THEN 'Location Total'
          WHEN 6 THEN 'Bin Total'
          WHEN 7 THEN 'Grand Total'
          ELSE 'Regular Row'
        END AS GroupingType,
        SUM(Quantity) AS Total
FROM    Production.ProductInventory
WHERE   LocationID IN (3)
        AND Bin IN (1, 2)
GROUP BY CUBE(Shelf, LocationID, Bin)
ORDER BY Shelf,
         LocationID,
         Bin;

The result set returned from this query has descriptions of the various aggregations that resulted from using CUBE.

Shelf      LocationID Bin  GroupingType         Total
---------- ---------- ---- -------------------- -----
NULL       NULL       NULL Grand Total          90
NULL       NULL       1    Bin Total            49
NULL       NULL       2    Bin Total            41
NULL       3          NULL Location Total       90
NULL       3          1    Location/Bin Total   49
NULL       3          2    Location/Bin Total   41
A          NULL       NULL Shelf Total          90
A          NULL       1    Shelf/Bin Total      49
A          NULL       2    Shelf/Bin Total      41
A          3          NULL Shelf/Location Total 90
A          3          1    Regular Row          49
A          3          2    Regular Row          41

How It Works

Note This recipe assumes an understanding of the binary/base-2 number system.

Identifying which rows belong to which type of aggregate becomes progressively more difficult for each new column you add to the GROUP BY clause and for each unique data value that can be grouped and aggregated. For example, this query shows the quantity of products in location 3 within bins 1 and 2:

SELECT  Shelf,
        LocationID,
        Bin,
        Quantity
FROM    Production.ProductInventory
WHERE   LocationID IN (3)
AND     Bin IN (1, 2);

This query returns only two rows:

Shelf      LocationID Bin  Quantity
---------- ---------- ---- --------
A          3          2    41
A          3          1    49

Now, what if we needed to report aggregations based on the various combinations of Shelf, Location, and Bin? We could use CUBE to give summaries of all these potential combinations:

SELECT  Shelf,
        LocationID,
        Bin,
        SUM(Quantity) AS Total
FROM    Production.ProductInventory
WHERE   LocationID IN (3)
AND     Bin IN (1, 2)
GROUP BY CUBE(Shelf, LocationID, Bin)
ORDER BY Shelf,
         LocationID,
         Bin;

Although the query returns the various aggregations expected from CUBE, the results are difficult to decipher.

Shelf      LocationID Bin  Total
---------- ---------- ---- -----
NULL       NULL       NULL 90
NULL       NULL       1    49
NULL       NULL       2    41
NULL       3          NULL 90
NULL       3          1    49
NULL       3          2    41
A          NULL       NULL 90
A          NULL       1    49
A          NULL       2    41
A          3          NULL 90
A          3          1    49
A          3          2    41

This is where GROUPING_ID comes in handy. Using this function, we can determine the level of grouping for the row. This function is more complicated than GROUPING, however, because GROUPING_ID takes one or more columns as its input and then returns the integer equivalent of the base-2 (binary) number calculation on the columns.

Stepping through this, the query starts off with the list of the three nonaggregated columns to be returned in the result set:

SELECT i.Shelf,
       i.LocationID,
       i.Bin,

Next, a CASE statement evaluates the return value of GROUPING_ID for the list of the three columns:

CASE GROUPING_ID(i.Shelf, i.LocationID, i.Bin)

Since there are three columns in the GROUP BY CUBE, the various potential aggregations are represented in the following WHENs/THENs:

CASE GROUPING_ID(i.Shelf,i.LocationID, i.Bin)
     WHEN 1 THEN 'Shelf/Location Total'
     WHEN 2 THEN 'Shelf/Bin Total'
     WHEN 3 THEN 'Shelf Total'
     WHEN 4 THEN 'Location/Bin Total'
     WHEN 5 THEN 'Location Total'
     WHEN 6 THEN 'Bin Total'
     WHEN 7 THEN 'Grand Total'
ELSE 'Regular Row'
END,

Each potential combination of aggregations is handled in the CASE statement. The rest of the query involves using an aggregate function on quantity and then using CUBE to find the various aggregation combinations for the shelf, location, and bin:

       SUM(i.Quantity) AS Total
  FROM Production.ProductInventory i
 WHERE i.LocationID IN (3)
   AND i.Bin IN (1, 2)
 GROUP BY CUBE (i.Shelf, i.LocationID, i.Bin)
 ORDER BY i.Shelf, i.LocationID, i.Bin;

To illustrate the concept of a base-2 conversion to an integer, let’s start by including the results of the GROUPING_ID function (for the set of columns defined above) and the individual GROUPING function outputs (for each of the three columns) to the query. The updated query that will be used is:

SELECT  Shelf,
        LocationID,
        Bin,
        CASE GROUPING_ID(Shelf, LocationID, Bin)
          WHEN 1 THEN 'Shelf/Location Total'
          WHEN 2 THEN 'Shelf/Bin Total'
          WHEN 3 THEN 'Shelf Total'
          WHEN 4 THEN 'Location/Bin Total'
          WHEN 5 THEN 'Location Total'
          WHEN 6 THEN 'Bin Total'
          WHEN 7 THEN 'Grand Total'
          ELSE 'Regular Row'
        END AS GroupingType,
        GROUPING_ID(Shelf, LocationID, Bin) AS [G_ID],
        GROUPING(Shelf) AS [G_Shelf],
        GROUPING(LocationID) AS [G_Loc],
        GROUPING(Bin) AS [G_Bin],
        (GROUPING(Shelf)*4) + (GROUPING(LocationID)*2) + GROUPING(Bin) AS [G_Total],
        SUM(Quantity) AS Total
FROM    Production.ProductInventory
WHERE   LocationID IN (3)
        AND Bin IN (1, 2)
GROUP BY CUBE(Shelf, LocationID, Bin)
ORDER BY Shelf,
LocationID,
Bin;

This query returns the following result set:

Shelf LocationID Bin  GroupingType         G_ID G_Shelf G_Loc G_Bin G_Total Total
----- ---------- ---- -------------------- ---- ------- ----- ----- ------- -----
NULL  NULL       NULL Grand Total          7    1       1     1     7       90
NULL  NULL       1    Bin Total            6    1       1     0     6       49
NULL  NULL       2    Bin Total            6    1       1     0     6       41
NULL  3          NULL Location Total       5    1       0     1     5       90
NULL  3          1    Location/Bin Total   4    1       0     0     4       49
NULL  3          2    Location/Bin Total   4    1       0     0     4       41
A     NULL       NULL Shelf Total          3    0       1     1     3       90
A     NULL       1    Shelf/Bin Total      2    0       1     0     2       49
A     NULL       2    Shelf/Bin Total      2    0       1     0     2       41
A     3          NULL Shelf/Location Total 1    0       0     1     1       90
A     3          1    Regular Row          0    0       0     0     0       49
A     3          2    Regular Row          0    0       0     0     0       41

In this query, the GROUPING function is used for each column in the same order as they are called in the GROUPING_ID function, and the results display a bit map for those three columns. Starting from the right-most column, the G_Bin column is taken as is. By their locations in the bit map, the G_Loc column has its value multiplied by two, and the G_Shelf column has its value multiplied by 4. The sum of these numbers is returned in the G_Total column, and you can see that it matches up exactly with the G_ID column.

It is possible to obtain the same information utilizing just the GROUPING function. If we modify the original query to also determine the Grouping Type by the GROUPING function, the query would become:

SELECT  i.Shelf,
        i.LocationID,
        i.Bin,
CASE GROUPING_ID(i.Shelf, i.LocationID, i.Bin)
WHEN 1 THEN 'Shelf/Location Total'
WHEN 2 THEN 'Shelf/Bin Total'
WHEN 3 THEN 'Shelf Total'
WHEN 4 THEN 'Location/Bin Total'
WHEN 5 THEN 'Location Total'
WHEN 6 THEN 'Bin Total'
WHEN 7 THEN 'Grand Total'
ELSE 'Regular Row'
END AS GroupingType,
CASE WHEN GROUPING(Shelf) = 0 AND GROUPING(LocationID) = 0 AND GROUPING(Bin) = 1 THEN 'Shelf/Location Total'
WHEN GROUPING(Shelf) = 0 AND GROUPING(LocationID) = 1 AND GROUPING(Bin) = 0 THEN 'Shelf/Bin Total'
WHEN GROUPING(Shelf) = 0 AND GROUPING(LocationID) = 1 AND GROUPING(Bin) = 1 THEN 'Shelf Total'
WHEN GROUPING(Shelf) = 1 AND GROUPING(LocationID) = 0 AND GROUPING(Bin) = 0 THEN 'Location/Bin Total'
WHEN GROUPING(Shelf) = 1 AND GROUPING(LocationID) = 0 AND GROUPING(Bin) = 1 THEN 'Location Total'
WHEN GROUPING(Shelf) = 1 AND GROUPING(LocationID) = 1 AND GROUPING(Bin) = 0 THEN 'Bin Total'
WHEN GROUPING(Shelf) = 1 AND GROUPING(LocationID) = 1 AND GROUPING(Bin) = 1 THEN 'Grand Total'
ELSE 'Regular Row'
END,
SUM(i.Quantity) AS Total
FROM    Production.ProductInventory i
WHERE   i.LocationID IN (3)
AND i.Bin IN (1, 2)
GROUP BY CUBE(i.Shelf, i.LocationID, i.Bin)
ORDER BY i.Shelf,
         i.LocationID,
         i.Bin;

When run, this query produces the following result set:

Shelf      LocationID Bin  GroupingType         GroupingTypeMod      Total
---------- ---------- ---- -------------------- -------------------- -----
NULL       NULL       NULL Grand Total          Grand Total          90
NULL       NULL       1    Bin Total            Bin Total            49
NULL       NULL       2    Bin Total            Bin Total            41
NULL       3          NULL Location Total       Location Total       90
NULL       3          1    Location/Bin Total   Location/Bin Total   49
NULL       3          2    Location/Bin Total   Location/Bin Total   41
A          NULL       NULL Shelf Total          Shelf Total          90
A          NULL       1    Shelf/Bin Total      Shelf/Bin Total      49
A          NULL       2    Shelf/Bin Total      Shelf/Bin Total      41
A          3          NULL Shelf/Location Total Shelf/Location Total 90
A          3          1    Regular Row          Regular Row          49
A          3          2    Regular Row          Regular Row          41

It can be seen that the same results can be produced utilizing just the GROUPING function. For either function, as you increase the number of columns that the grouping is being performed on, each additional column doubles the number of values being returned (and thus it doubles the number of WHEN expressions needed in the CASE statement). However, if you are utilizing the GROUPING function, then each column that you add also needs to be added into the WHEN expression, quickly making the use of the GROUPING_ID function simpler—with just the three columns being grouped on in this example, the section utilizing the GROUPING_ID function is already simpler to read and understand. Using the GROUPING_ID function is also more efficient; notice that there is just one call to the GROUPING_ID function, while there are twenty-one calls to the GROUPING function. Each call does take some additional CPU time—even if the GROUPING function call is extremely efficient, there will be a hit.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5: Aggregations and Grouping

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 5: Aggregations and Grouping