Aggregations and Grouping
Aggregate functions are used to perform a calculation on one or more values, and the result is a single value. If your query has any columns with nonwindowed aggregate functions, then a GROUP BY clause is required for the query. Table 5-1 shows the various aggregate functions.
Table 5-1. Aggregate Functions
Function Name |
Description |
---|---|
AVG |
The AVG aggregate function calculates the average of non-NULL values in a group. |
CHECKSUM_AGG |
The CHECKSUM_AGG function returns a checksum value based on a group of rows, allowing you to potentially track changes to a table. For example, adding a new row or changing the value of a column that is being aggregated will usually result in a new checksum integer value. The reason I say “usually” is because there is a possibility that the checksum value does not change even if values are modified. |
COUNT |
The COUNT aggregate function returns an integer data type showing the count of rows in a group, including rows with NULL values. |
COUNT_BIG |
The COUNT_BIG aggregate function returns a bigint data type showing the count of rows in a group, including rows with NULL values. |
GROUPING | |
MAX |
The MAX aggregate function returns the highest value in a set of non-NULL values. |
MIN |
The MIN aggregate function returns the lowest value in a group of non-NULL values. |
STDEV |
The STDEV function returns the standard deviation of all values provided in the expression based on a sample of the data population. |
STDEVP |
The STDEVP function also returns the standard deviation for all values in the provided expression, but does so based upon the entire data population. |
SUM |
The SUM aggregate function returns the summation of all non-NULL values in an expression. |
VAR |
The VAR function returns the statistical variance of values in an expression based upon a sample of the provided population. |
VARP |
The VARP function returns the statistical variance of values in an expression, but does so based upon the entire data population. |
Note Window functions—and using aggregate functions with window functions—are discussed in Chapter 7.
The STDEV, STDEVP, VAR, and VARP are statistical functions. The use of these functions requires knowledge of how statistics works, which is beyond the scope of this book.
With the exception of the COUNT and COUNT_BIG functions, all of the aggregate functions have the same syntax (the syntax and usage of these functions will be discussed in a recipe later on in this chapter).
aggregate_function_name ( { [ [ ALL | DISTINCT ] expression ] } )
where expression is a series of expressions and operations that returns a single value (but does not include aggregate functions or subqueries) that the aggregate function will be calculated over. If the optional keyword DISTINCT is used, then only distinct values will be considered. If the optional keyword ALL is used, then all values will be considered. If neither is specified, then ALL is used by default. Aggregate functions and subqueries are not allowed for the expression parameter.
Frequently when aggregating data, you will want to perform the aggregation based upon a grouping of a set of columns in the query. Grouping is primarily performed in SQL Server by using the GROUP BY clause within a SELECT query to determine in which group the rows should be put. Grouping can also be performed at the column level with the use of window functions. Data is aggregated by using the appropriate aggregation function. The simplified syntax is as follows:
SELECT Column1, <aggregate_function>(Column2)
FROM table_list
[WHERE search_conditions]
GROUP BY Column1
GROUP BY follows the optional WHERE clause and is most often used when aggregate functions are being utilized in the SELECT statement.
5-1. Computing an Aggregation
Problem
You want to perform several aggregations on the ratings of your products.
Solution
Use the appropriate aggregation function to determine each aggregation:
SELECT MIN(Rating) Rating_Min,
MAX(Rating) Rating_Max,
SUM(Rating) Rating_Sum,
AVG(Rating) Rating_Avg
FROM Production.ProductReview;
This query produces the following result set:
Rating_Min Rating_Max Rating_Sum Rating_Avg
----------- ----------- ----------- -----------
2 5 16 4
How It Works
For the non-NULL values in the table in the Rating column, the MIN function calculates the lowest of these values; the MAX function calculates the highest of these values; the SUM function calculates the total of the ratings; and the AVG aggregate function calculates the average of the values. To demonstrate the use of DISTINCT, let’s compare the previous query with the columns returned from the following query:
SELECT AVG(Grade) AS AvgGrade,
AVG(DISTINCT Grade) AS AvgDistinctGrade
FROM (VALUES (1, 100),
(1, 100),
(1, 100),
(1, 100),
(1, 100),
(1, 30)
) dt (StudentId, Grade);
This query produces the following result set:
AvgGrade AvgDistinctGrade
----------- ----------------
88 65
In this example, we have a student with six grades—five perfect grades of 100 and one failing grade of 30; the average of all of the grades is 88. Within these grades are two distinct grades, and the average of these distinct grades is 65.
When utilizing the AVG function, the expression parameter must be one of the numeric data types.
5-2. Creating Aggregations Based upon the Values of the Data
Problem
You want to aggregate one or more columns, with the aggregations applied to a set of columns whenever the data in those columns change. For example, for each order, you want to see the number of line items, as well as the average, minimum, maximum, and total for those line items.
Solution
The SalesOrderID column is included in the query in order to make the results meaningful (e.g. which order is this for?). Each desired aggregation is performed against the LineTotal column. Finally, the query utilizes the GROUP BY clause in order to group the data by the SalesOrderID. See the following:
SELECT TOP (10)
SalesOrderID,
SUM(LineTotal) AS OrderTotal,
MIN(LineTotal) AS MinLine,
MAX(LineTotal) AS MaxLine,
AVG(LineTotal) AS AvgLine,
COUNT(LineTotal) AS CountLine
FROM [Sales].[SalesOrderDetail]
GROUP BY SalesOrderID
ORDER BY SalesOrderID;
This query returns the following result set:
SalesOrderID OrderTotal MinLine MaxLine AvgLine CountLine
------------ ------------ ---------- ----------- ----------- ---------
43659 20565.620600 10.373000 6074.982000 1713.801716 12
43660 1294.252900 419.458900 874.794000 647.126450 2
43661 32726.478600 20.746000 8099.976000 2181.765240 15
43662 28832.528900 178.580800 5248.764000 1310.569495 22
43663 419.458900 419.458900 419.458900 419.458900 1
43664 24432.608800 28.840400 8099.976000 3054.076100 8
43665 14352.771300 10.373000 4049.988000 1435.277130 10
43666 5056.489600 356.898000 2146.962000 842.748266 6
43667 6107.082000 17.100000 2039.994000 1526.770500 4
43668 35944.156200 20.186500 5248.764000 1239.453662 29
How It Works
To determine the groups that rows should be put in, the GROUP BY clause is used in a SELECT query. When grouping a result set, the GROUP BY clause can specify multiple columns, and all columns listed in the SELECT clause must be either used in an aggregate function or referenced in the GROUP BY clause. If the query returns both aggregated and non-aggregated columns, and the GROUP BY clause doesn’t specify all of the non-aggregated columns, then the following error will be raised (this error was generated by removing the GROUP BY clause from the example in this recipe):
Msg 8120, Level 16, State 1, Line 9
Column 'Sales.SalesOrderDetail.SalesOrderID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
This error is raised because any column being returned by the query that is not used in an aggregate function in the SELECT list must be listed in the GROUP BY clause for the query.
5-3. Counting the Rows in a Group
Problem
You want to see the number of rows for each value of a column—for instance, the number of products you have in inventory on each shelf for your first five shelves.
Solution
Utilize the COUNT or COUNT_BIG function to return the count of rows in a group:
SELECT TOP (5)
Shelf,
COUNT(ProductID) AS ProductCount,
COUNT_BIG(ProductID) AS ProductCountBig
FROM Production.ProductInventory
GROUP BY Shelf
ORDER BY Shelf;
This query returns the following result set:
Shelf ProductCount ProductCountBig
---------- ------------ ---------------
A 81 81
B 36 36
C 55 55
D 50 50
E 85 85
The results of this query show each shelf and the number of products on each of those shelves.
How It Works
The COUNT and COUNT_BIG functions are utilized to return a count of the number of items in a group. The difference between these functions is the data type returned: COUNT returns an INTEGER, while COUNT_BIG returns a BIGINT. You should utilize COUNT_BIG if you will be counting more rows than the INTEGER data type supports (231−1). Throughout this book, the COUNT function will be used. The syntax for these functions is as follows:
COUNT | COUNT_BIG ( { [ [ ALL | DISTINCT ] expression ] | * } )
The difference between this syntax and the other aggregate functions is the optional asterisk (*) that can be specified. The use of COUNT(*)specifies that all rows should be counted so as to return the total number of rows within a table; conversely, if COUNT(<nullable column>) is used, then rows where that column is NULL will not be counted. COUNT(*) does not use any parameters, so it does not use any information about any column.
When utilizing the COUNT or COUNT_BIG functions, the expression parameter can be of any data type except for the text, image, or ntext data types. For instance, for the following table variable:
DECLARE @test TABLE (col1 TEXT);
this query:
SELECT COUNT(col1) FROM @test;
will return the following error:
Msg 8117, Level 16, State 1, Line 4
Operand data type text is invalid for count operator.
However, you can utilize COUNT(*) instead:
SELECT COUNT(*) FROM @test;
which returns this result set:
--
0
If you are using the COUNT function and you exceed the capacity for an integer, then the following error will be generated:
Msg 8115, Level 16, State 2, Line 1
Arithmetic overflow error converting expression to data type int.
5-4. Detecting Changes in a Table
Problem
You need to determine whether any changes have been made to the data in a column.
Solution
Utilize the CHECKSUM_AGG function to detect changes in a table. For example:
IF OBJECT_ID('tempdb.dbo.[#Recipe5.4]') IS NOT NULL DROP TABLE [#Recipe5.4];
CREATE TABLE [#Recipe5.4]
(
StudentID INTEGER,
Grade INTEGER
);
INSERT INTO [#Recipe5.4] (StudentID, Grade)
VALUES (1, 100),
(1, 95)
SELECT StudentID, CHECKSUM_AGG(Grade) AS GradeChecksumAgg
FROM [#Recipe5.4]
GROUP BY StudentID;
UPDATE [#Recipe5.4]
SET Grade = 99
WHERE Grade = 95;
SELECT StudentID, CHECKSUM_AGG(Grade) AS GradeChecksumAgg
FROM [#Recipe5.4]
GROUP BY StudentID;
These queries return the following result sets:
StudentID GradeChecksumAgg
----------- ----------------
1 59
StudentID GradeChecksumAgg
----------- ----------------
1 7
How It Works
The CHECKSUM_AGG function returns the checksum of the values in the group, in this case the values from the Grade column. In the second query, the last grade is changed, and when the first query is rerun, the aggregated checksum returns a different value.
When utilizing the CHECKSUM_AGG function, the expression parameter must be of an integer data type.
Note Because of the hashing algorithm being used, it is possible for the CHECKSUM_AGG function to return the same value with different data. You should use this function only if your application can tolerate occasionally missing a change.
5-5. Restricting a Result Set to Groups of Interest
Problem
You do not want to return all of the rows that could be returned by an aggregation; instead, you want only the rows where the aggregation itself is filtered. For example, you want to report on the reasons that the product was scrapped, but only for the reasons that have more than 50 occurrences.
Solution
Specify a HAVING clause, giving the conditions that the aggregated rows must meet in order to be returned.
This example queries two tables: Production.ScrapReason and Production.WorkOrder. The Production.ScrapReason table is a lookup table that contains manufacturing failure reasons, and the Production.WorkOrder table contains the manufacturing work orders that control which products are manufactured in the quantity and time period required in order to meet inventory and sales needs. A report is needed that shows which of the “failure reasons” have occurred more than 50 times, which can be achieved by the following code:
SELECT s.Name,
COUNT(w.WorkOrderID) AS Cnt
FROM Production.ScrapReason s
INNER JOIN Production.WorkOrder w
ON s.ScrapReasonID = w.ScrapReasonID
GROUP BY s.Name
HAVING COUNT(*) > 50;
This query returns the following result set:
Name Cnt
--------------------------------- ---
Gouge in metal 54
Stress test failed 52
Thermoform temperature too low 63
Trim length too long 52
Wheel misaligned 51
How It Works
The HAVING clause of the SELECT statement allows you to specify a search condition on a query that uses GROUP BY and/or an aggregated value. The syntax is as follows:
SELECT select_list
FROM table_list
[ WHERE search_conditions ]
[ GROUP BY group_by_list ]
[ HAVING search_conditions ]
The HAVING clause is used to qualify the results after the GROUP BY has been applied. The WHERE clause, in contrast, is used to qualify the rows that are returned from the tables specified in the FROM clause before the data is aggregated or grouped. HAVING qualifies the aggregated data after the data has been grouped.
In this recipe, the SELECT clause requests a count of WorkOrderIDs by failure name:
SELECT s.Name,
COUNT(w.WorkOrderID) AS Cnt
Two tables are joined by the ScrapReasonID column:
FROM Production.ScrapReason s
INNER JOIN Production.WorkOrder w
ON s.ScrapReasonID = w.ScrapReasonID
Because an aggregate function is used in the SELECT clause, the nonaggregated columns must appear in the GROUP BY clause:
GROUP BY s.Name
Lastly, using the HAVING query ensures that, of the selected and grouped data, only those rows in the result set with a count of more than 50 will be returned:
HAVING COUNT(*)>50
5-6. Performing Aggregations against Unique Values Only
Problem
You need to know the quantity of unique values per date.
Solution
Add the DISTINCT clause to the COUNT function:
SELECT [RateChangeDate],
COUNT([Rate]) AS [Count],
COUNT(DISTINCT Rate) AS [DistinctCount]
FROM [HumanResources].[EmployeePayHistory]
WHERE RateChangeDate >= '2008-12-01'
AND RateChangeDate < '2008-12-10'
GROUP BY RateChangeDate;
This query returns the following result set:
RateChangeDate Count DistinctCount
----------------------- ----------- -------------
2008-12-01 00:00:00.000 2 2
2008-12-02 00:00:00.000 3 2
2008-12-03 00:00:00.000 1 1
2008-12-04 00:00:00.000 3 3
2008-12-05 00:00:00.000 1 1
2008-12-06 00:00:00.000 2 2
2008-12-07 00:00:00.000 5 3
2008-12-08 00:00:00.000 2 2
2008-12-09 00:00:00.000 3 3
How It Works
This query utilizes two COUNT functions; the second one also uses the DISTINCT clause. This forces the COUNT function to count only the distinct values in the specified column, in this case the Rate column. To further understand the difference, let’s examine the data for a date where the two functions are returning different values—2008-12-02:
SELECT RateChangeDate, Rate
FROM HumanResources.EmployeePayHistory
WHERE RateChangeDate = '2008-12-02';
This query returns the following three rows:
RateChangeDate Rate
----------------------- -----
2008-12-02 00:00:00.000 6.50
2008-12-02 00:00:00.000 10.00
2008-12-02 00:00:00.000 10.00
When looking at the data, you can see that for this date there are three rows; however, there are two distinct values in the Rate column. Therefore, the COUNT function returned 3, while COUNT(DISTINCT) returned 2.
5-7. Creating Hierarchical Summaries
Problem
You need to return a data set with the detail data as well as with subtotal rows and a grand total row based upon the GROUP BY clause.
Solution
You need to include the ROLLUP argument after the GROUP BY clause. This example uses the ROLLUP argument to produce subtotal lines at the Shelf level, as well as a grand total line:
SELECT i.Shelf,
p.Name,
SUM(i.Quantity) AS Total
FROM Production.ProductInventory i
INNER JOIN Production.Product p
ON i.ProductID = p.ProductID
WHERE i.Shelf IN ('A','B')
AND p.Name LIKE 'Metal%'
GROUP BY ROLLUP(i.Shelf, p.Name);
This query returns the following result set:
Shelf Name Total
---------- ----------- -----
A Metal Angle 404
A Metal Bar 1 353
A Metal Bar 2 622
A NULL 1379
B Metal Angle 355
B Metal Bar 1 403
B Metal Bar 2 497
B NULL 1255
NULL NULL 2634
How It Works
The order in which you place the columns in the GROUP BY ROLLUP clause affects how data is aggregated. ROLLUP in this query aggregates the total quantity for each change in Shelf. Notice the row with shelf A and the NULL name; this holds the total quantity for shelf A. Also notice that the final row is the grand total of all product quantities. Whereas CUBE creates a result set that aggregates all combinations for the selected columns, ROLLUP generates the aggregates for a hierarchy of values.
GROUP BY ROLLUP (i.Shelf, p.Name)
ROLLUP aggregated both a grand total and totals by shelf. Totals were not generated for the product name, but would have been had CUBE been designated instead.
ROLLUP uses a slightly different syntax than previous versions of SQL Server used. ROLLUP comes directly after the GROUP BY clause, instead of trailing the GROUP BY clause with a WITH ROLLUP clause. Notice also that the column lists are contained within parentheses.
Note The GROUP BY WITH ROLLUP feature does not follow the ISO standard, and it will be removed in a future version of Microsoft SQL Server. You should avoid using this feature in any new development work, and you should modify any applications that currently use this feature to use the ROLLUP argument instead.
In this example, I’ve shown how the ROLLUP clause works across a set of columns. Let’s examine another scenario now: we want to include the LocationID column in the output, and then we want to perform the ROLLUP at just the name level. This would be performed with the following query:
SELECT i.Shelf,
i.LocationID,
p.Name,
SUM(i.Quantity) AS Total
FROM Production.ProductInventory i
INNER JOIN Production.Product p
ON i.ProductID = p.ProductID
WHERE i.Shelf IN ('A','B')
AND p.Name LIKE 'Metal%'
GROUP BY i.Shelf, i.LocationID, ROLLUP(i.Shelf, p.Name)
ORDER BY i.Shelf, i.LocationID;
This is accomplished by grouping by Shelf and LocationID and then applying ROLLUP.
5-8. Creating Summary Totals and Subtotals
Problem
You need to return a data set with the detail data as well as with the data summarized on each combination of columns specified in the GROUP BY clause.
Solution
You need to include the CUBE argument after the GROUP BY clause. This example uses the CUBE argument to produce subtotal lines at both the Shelf and LocationID levels, as well as a grand total line:
SELECT Shelf,
LocationID,
SUM(Quantity) AS Total
FROM Production.ProductInventory
WHERE Shelf IN ('A','B')
AND LocationID IN (10, 20)
GROUP BY CUBE(Shelf, LocationID);
This query produces several levels of totals, the first being by LocationID.
Shelf LocationID Total
---------- ---------- -----
A 10 1379
B 10 1648
NULL 10 3027
A 20 1680
B 20 355
NULL 20 2035
NULL NULL 5062
A NULL 3059
B NULL 2003
How It Works
By using the CUBE argument, the query groups by the specified columns, and it creates additional rows that provide totals for each combination of the columns specified in the GROUP BY clause. The rows with NULL values indicate a totaling at either the subtotal or total level. When all of the columns specified in the GROUP BY CUBE are NULL, then this row is the total row. Rows with one or more, but not all, of the columns specified in the GROUP BY CUBE set to NULL are subtotals at the level of the non-null columns. See recipes 5-12 and 5-13 for differenting these groups from data when the columns specified in the GROUP BY CUBE clause contain legitimate NULL values.
CUBE uses a slightly different syntax than in previous versions of SQL Server: CUBE comes after the GROUP BY clause instead of trailing the GROUP BY clause with a WITH CUBE. Notice also that the column lists are contained within parentheses.
Note The GROUP BY WITH CUBE feature does not follow the ISO standard, and it will be removed in a future version of Microsoft SQL Server. You should avoid using this feature in any new development work, and you should modify any applications that currently use this feature to use the CUBE argument instead.
As with the ROLLUP feature, CUBE allows to you first group by columns, then a cube. For example:
SELECT Shelf,
LocationID,
SUM(i.Quantity) AS Total
FROM Production.ProductInventory
WHERE Shelf in ('A','B')
AND LocationID in (10,20)
GROUP BY shelf, CUBE(Shelf, LocationID);
5-9. Creating Custom Summaries
Problem
You need to have one result set with multiple custom aggregations.
Solution
You must include the GROUPING SETS argument after the GROUP BY clause and also include each of the custom aggregations that you want performed.
SQL Server gives you the ability to define your own grouping sets within a single query result set without having to resort to multiple UNION ALL queries. GROUPING SETS also provides you with more control over what is aggregated, as compared to the previously demonstrated CUBE and ROLLUP operations. This is performed by using the GROUPING SETS operator.
First, let’s define a business requirement for a query, which is to have a single result set that contains three different aggregate quantity summaries. Specifically, I would like to see quantity totals by shelf, quantity totals by both shelf and product name, and then also quantity totals by location and name.
We’ll use the GROUPING SETS operator to define the various aggregations we would like to have returned in a single result set:
SELECT i.Shelf,
i.LocationID,
p.Name,
SUM(i.Quantity) AS Total
FROM Production.ProductInventory i
INNER JOIN Production.Product p
ON i.ProductID = p.ProductID
WHERE Shelf IN ('A', 'C')
AND Name IN ('Chain', 'Decal', 'Head Tube')
GROUP BY GROUPING SETS((i.Shelf), (i.Shelf, p.Name), (i.LocationID, p.Name));
This returns the following results:
Shelf LocationID Name Total
---------- ---------- ---------- -----
NULL 1 Chain 236
NULL 5 Chain 192
NULL 50 Chain 161
NULL 20 Head Tube 544
A NULL Chain 353
A NULL Head Tube 544
A NULL NULL 897
C NULL Chain 236
C NULL NULL 236
How It Works
The new GROUPING SETS operator allows you to define varying aggregate groups in a single query while avoiding having multiple queries attached together using the UNION ALL operator. The core of this recipe’s example is the following line of code:
GROUP BY GROUPING SETS ((i.Shelf), (i.Shelf, p.Name), (i.LocationID, p.Name))
Notice that, unlike a regular aggregated query, the GROUP BY clause is not followed by a list of columns. Instead, it is followed by GROUPING SETS. GROUPING SETS is then followed by parentheses and the groupings of column names, each of which is also encapsulated in parentheses.
To achieve this in previous versions of SQL Server, you would have needed to use the UNION ALL operator with multiple queries, as follows:
SELECT NULL AS Shelf,
i.LocationID,
p.Name,
SUM(i.Quantity) AS Total
FROM Production.ProductInventory i
INNER JOIN Production.Product p
ON i.ProductID = p.ProductID
WHERE Shelf IN ('A', 'C')
AND Name IN ('Chain', 'Decal', 'Head Tube')
GROUP BY i.LocationID,
p.Name
UNION ALL
SELECT i.Shelf,
NULL,
NULL,
SUM(i.Quantity) AS Total
FROM Production.ProductInventory i
INNER JOIN Production.Product p
ON i.ProductID = p.ProductID
WHERE Shelf IN ('A', 'C')
AND Name IN ('Chain', 'Decal', 'Head Tube')
GROUP BY i.Shelf
UNION ALL
SELECT i.Shelf,
NULL,
p.Name,
SUM(i.Quantity) AS Total
FROM Production.ProductInventory i
INNER JOIN Production.Product p
ON i.ProductID = p.ProductID
WHERE Shelf IN ('A', 'C')
AND Name IN ('Chain', 'Decal', 'Head Tube')
GROUP BY i.Shelf,
p.Name;
This query returns the following result set, which has the same results as when using grouping sets (just ordered slightly differently by the database engine):
Shelf LocationID Name Total
---------- ---------- ----------- -----
NULL 1 Chain 236
NULL 5 Chain 192
NULL 50 Chain 161
NULL 20 Head Tube 544
A NULL NULL 897
C NULL NULL 236
A NULL Chain 353
C NULL Chain 236
A NULL Head Tube 544
As you can see, GROUPING SETS allows for quite a bit simpler code for the complex requirements. GROUPING SETS also allows you to use CUBE and ROLLUP as one of the sets.
5-10. Identifying Rows Generated by the GROUP BY Arguments
Problem
You need to differentiate between the rows that actually have stored NULL data and the total or subtotal rows generated by the GROUP BY arguments that have a NULL value generated for that column.
Solution
You need to utilize the GROUPING function in your query.
The following query uses a CASE statement to evaluate whether each row is a total by the ReorderPoint, total by Size, or a regular, noncubed row:
SELECT CASE WHEN GROUPING(ReorderPoint) = 1 THEN '--GROUP--'
ELSE CONVERT(VARCHAR(15), ReorderPoint)
END AS ReorderPointCalc,
ReorderPoint,
CASE WHEN GROUPING(Size) = 1 THEN '--GROUP--'
ELSE CONVERT(VARCHAR(15), Size)
END AS SizeCalc,
Size,
CASE WHEN GROUPING(ReorderPoint) = 0 AND GROUPING(Size) = 1 THEN 'Size Total'
WHEN GROUPING(ReorderPoint) = 1 AND GROUPING(Size) = 0 THEN 'ReorderPoint Total'
WHEN GROUPING(ReorderPoint) = 1 AND GROUPING(Size) = 1 THEN 'Grand Total'
ELSE 'Regular Row'
END AS RowType,
SUM(StandardCost) AS Total
FROM Production.Product
WHERE ReorderPoint = 3
GROUP BY CUBE(ReorderPoint, Size);
This query returns the following result set:
ReorderPointCalc ReorderPoint SizeCalc Size RowType Total
---------------- ------------ --------------- ----- ------------------ ---------
3 3 NULL NULL Regular Row 290.7344
--GROUP-- NULL NULL NULL ReorderPoint Total 290.7344
3 3 70 70 Regular Row 20.5663
--GROUP-- NULL 70 70 ReorderPoint Total 20.5663
3 3 L L Regular Row 254.3789
--GROUP-- NULL L L ReorderPoint Total 254.3789
3 3 M M Regular Row 254.3789
--GROUP-- NULL M M ReorderPoint Total 254.3789
3 3 S S Regular Row 247.6203
--GROUP-- NULL S S ReorderPoint Total 247.6203
3 3 XL XL Regular Row 104.8105
--GROUP-- NULL XL XL ReorderPoint Total 104.8105
--GROUP-- NULL --GROUP-- NULL Grand Total 1172.4893
3 3 --GROUP-- NULL Size Total 1172.4893
How It Works
Notice how the rows grouped in the previous recipes have NULL values in the columns that aren’t participating in the aggregate totals. For example, when shelf C is totaled up in the previous recipe, the location and product name columns are NULL:
C NULL NULL 236
If the data contains NULL values, then it can become difficult to differentiate the NULL values from the data, and NULL values from the grouping. To address this issue, you can use the GROUPING function. This allows you to differentiate and act upon those rows that are generated automatically for aggregates using CUBE, ROLLUP, and GROUPING SETS. In this example, the SELECT statement evaluates whether the data in the column is NULL due to the grouping; if so, it returns "—GROUP—". The SELECT statement also calculates for each row whether it is a regular row (a row that contains data from the table), or whether it is added to the result set as the result of the grouping. If it is the result of the grouping, it determines which grouping (Size, ReorderPoint, or Grand Total) that the row represents.
Tip For more on CASE, see Chapter 2.
When GROUPING returns a 1 value (true), it means the column NULL is not an actual data value but rather is a result of the aggregate operation, standing in for the value all. So, for example, if the ReorderPoint value is not NULL and the Size is NULL because of the CUBE aggregation process and not the data itself, the string Size Total is returned:
CASE WHEN GROUPING(ReorderPoint) = 0 AND GROUPING(Size) = 1 THEN 'Size Total'
The statement continues with similar logic, only this time if the ReorderPoint value is NULL because of the CUBE aggregation process but the Size is not null, a ReorderPoint total is provided:
WHEN GROUPING(ReorderPoint) = 1 AND GROUPING(Size) = 0 THEN 'ReorderPoint Total'
The last WHEN states that when both ReorderPoint and Size are NULL because of the CUBE aggregation process, then the row contains the grand total for the result set:
WHEN GROUPING(ReorderPoint) = 1 AND GROUPING(Size) = 1 THEN 'Grand Total'
Notice that the first two rows returned in this result set have a value of NULL for the Size. For the row where the ReorderPoint is 3, the NULL is from the actual data. Without the GROUPING function, it would be difficult to determine by looking at the data returned whether the NULL was from the data or from the grouping.
GROUPING returns only a 1 or a 0; however, you also have the option of using GROUPING_ID to compute grouping at a finer grain, as demonstrated in the next recipe.
5-11. Identifying Summary Levels
Problem
You need to identify which columns are being considered in the grouping rows added to the result set; however, using the GROUPING function on the multiple columns being grouped is making the query complex and difficult to understand.
Solution
You need to utilize the GROUPING_ID function in your query.
The following query uses the GROUPING_ID function to return those columns used in the grouping of that particular row:
SELECT Shelf,
LocationID,
Bin,
CASE GROUPING_ID(Shelf, LocationID, Bin)
WHEN 1 THEN 'Shelf/Location Total'
WHEN 2 THEN 'Shelf/Bin Total'
WHEN 3 THEN 'Shelf Total'
WHEN 4 THEN 'Location/Bin Total'
WHEN 5 THEN 'Location Total'
WHEN 6 THEN 'Bin Total'
WHEN 7 THEN 'Grand Total'
ELSE 'Regular Row'
END AS GroupingType,
SUM(Quantity) AS Total
FROM Production.ProductInventory
WHERE LocationID IN (3)
AND Bin IN (1, 2)
GROUP BY CUBE(Shelf, LocationID, Bin)
ORDER BY Shelf,
LocationID,
Bin;
The result set returned from this query has descriptions of the various aggregations that resulted from using CUBE.
Shelf LocationID Bin GroupingType Total
---------- ---------- ---- -------------------- -----
NULL NULL NULL Grand Total 90
NULL NULL 1 Bin Total 49
NULL NULL 2 Bin Total 41
NULL 3 NULL Location Total 90
NULL 3 1 Location/Bin Total 49
NULL 3 2 Location/Bin Total 41
A NULL NULL Shelf Total 90
A NULL 1 Shelf/Bin Total 49
A NULL 2 Shelf/Bin Total 41
A 3 NULL Shelf/Location Total 90
A 3 1 Regular Row 49
A 3 2 Regular Row 41
How It Works
Note This recipe assumes an understanding of the binary/base-2 number system.
Identifying which rows belong to which type of aggregate becomes progressively more difficult for each new column you add to the GROUP BY clause and for each unique data value that can be grouped and aggregated. For example, this query shows the quantity of products in location 3 within bins 1 and 2:
SELECT Shelf,
LocationID,
Bin,
Quantity
FROM Production.ProductInventory
WHERE LocationID IN (3)
AND Bin IN (1, 2);
This query returns only two rows:
Shelf LocationID Bin Quantity
---------- ---------- ---- --------
A 3 2 41
A 3 1 49
Now, what if we needed to report aggregations based on the various combinations of Shelf, Location, and Bin? We could use CUBE to give summaries of all these potential combinations:
SELECT Shelf,
LocationID,
Bin,
SUM(Quantity) AS Total
FROM Production.ProductInventory
WHERE LocationID IN (3)
AND Bin IN (1, 2)
GROUP BY CUBE(Shelf, LocationID, Bin)
ORDER BY Shelf,
LocationID,
Bin;
Although the query returns the various aggregations expected from CUBE, the results are difficult to decipher.
Shelf LocationID Bin Total
---------- ---------- ---- -----
NULL NULL NULL 90
NULL NULL 1 49
NULL NULL 2 41
NULL 3 NULL 90
NULL 3 1 49
NULL 3 2 41
A NULL NULL 90
A NULL 1 49
A NULL 2 41
A 3 NULL 90
A 3 1 49
A 3 2 41
This is where GROUPING_ID comes in handy. Using this function, we can determine the level of grouping for the row. This function is more complicated than GROUPING, however, because GROUPING_ID takes one or more columns as its input and then returns the integer equivalent of the base-2 (binary) number calculation on the columns.
Stepping through this, the query starts off with the list of the three nonaggregated columns to be returned in the result set:
SELECT i.Shelf,
i.LocationID,
i.Bin,
Next, a CASE statement evaluates the return value of GROUPING_ID for the list of the three columns:
CASE GROUPING_ID(i.Shelf, i.LocationID, i.Bin)
Since there are three columns in the GROUP BY CUBE, the various potential aggregations are represented in the following WHENs/THENs:
CASE GROUPING_ID(i.Shelf,i.LocationID, i.Bin)
WHEN 1 THEN 'Shelf/Location Total'
WHEN 2 THEN 'Shelf/Bin Total'
WHEN 3 THEN 'Shelf Total'
WHEN 4 THEN 'Location/Bin Total'
WHEN 5 THEN 'Location Total'
WHEN 6 THEN 'Bin Total'
WHEN 7 THEN 'Grand Total'
ELSE 'Regular Row'
END,
Each potential combination of aggregations is handled in the CASE statement. The rest of the query involves using an aggregate function on quantity and then using CUBE to find the various aggregation combinations for the shelf, location, and bin:
SUM(i.Quantity) AS Total
FROM Production.ProductInventory i
WHERE i.LocationID IN (3)
AND i.Bin IN (1, 2)
GROUP BY CUBE (i.Shelf, i.LocationID, i.Bin)
ORDER BY i.Shelf, i.LocationID, i.Bin;
To illustrate the concept of a base-2 conversion to an integer, let’s start by including the results of the GROUPING_ID function (for the set of columns defined above) and the individual GROUPING function outputs (for each of the three columns) to the query. The updated query that will be used is:
SELECT Shelf,
LocationID,
Bin,
CASE GROUPING_ID(Shelf, LocationID, Bin)
WHEN 1 THEN 'Shelf/Location Total'
WHEN 2 THEN 'Shelf/Bin Total'
WHEN 3 THEN 'Shelf Total'
WHEN 4 THEN 'Location/Bin Total'
WHEN 5 THEN 'Location Total'
WHEN 6 THEN 'Bin Total'
WHEN 7 THEN 'Grand Total'
ELSE 'Regular Row'
END AS GroupingType,
GROUPING_ID(Shelf, LocationID, Bin) AS [G_ID],
GROUPING(Shelf) AS [G_Shelf],
GROUPING(LocationID) AS [G_Loc],
GROUPING(Bin) AS [G_Bin],
(GROUPING(Shelf)*4) + (GROUPING(LocationID)*2) + GROUPING(Bin) AS [G_Total],
SUM(Quantity) AS Total
FROM Production.ProductInventory
WHERE LocationID IN (3)
AND Bin IN (1, 2)
GROUP BY CUBE(Shelf, LocationID, Bin)
ORDER BY Shelf,
LocationID,
Bin;
This query returns the following result set:
Shelf LocationID Bin GroupingType G_ID G_Shelf G_Loc G_Bin G_Total Total
----- ---------- ---- -------------------- ---- ------- ----- ----- ------- -----
NULL NULL NULL Grand Total 7 1 1 1 7 90
NULL NULL 1 Bin Total 6 1 1 0 6 49
NULL NULL 2 Bin Total 6 1 1 0 6 41
NULL 3 NULL Location Total 5 1 0 1 5 90
NULL 3 1 Location/Bin Total 4 1 0 0 4 49
NULL 3 2 Location/Bin Total 4 1 0 0 4 41
A NULL NULL Shelf Total 3 0 1 1 3 90
A NULL 1 Shelf/Bin Total 2 0 1 0 2 49
A NULL 2 Shelf/Bin Total 2 0 1 0 2 41
A 3 NULL Shelf/Location Total 1 0 0 1 1 90
A 3 1 Regular Row 0 0 0 0 0 49
A 3 2 Regular Row 0 0 0 0 0 41
In this query, the GROUPING function is used for each column in the same order as they are called in the GROUPING_ID function, and the results display a bit map for those three columns. Starting from the right-most column, the G_Bin column is taken as is. By their locations in the bit map, the G_Loc column has its value multiplied by two, and the G_Shelf column has its value multiplied by 4. The sum of these numbers is returned in the G_Total column, and you can see that it matches up exactly with the G_ID column.
It is possible to obtain the same information utilizing just the GROUPING function. If we modify the original query to also determine the Grouping Type by the GROUPING function, the query would become:
SELECT i.Shelf,
i.LocationID,
i.Bin,
CASE GROUPING_ID(i.Shelf, i.LocationID, i.Bin)
WHEN 1 THEN 'Shelf/Location Total'
WHEN 2 THEN 'Shelf/Bin Total'
WHEN 3 THEN 'Shelf Total'
WHEN 4 THEN 'Location/Bin Total'
WHEN 5 THEN 'Location Total'
WHEN 6 THEN 'Bin Total'
WHEN 7 THEN 'Grand Total'
ELSE 'Regular Row'
END AS GroupingType,
CASE WHEN GROUPING(Shelf) = 0 AND GROUPING(LocationID) = 0 AND GROUPING(Bin) = 1 THEN 'Shelf/Location Total'
WHEN GROUPING(Shelf) = 0 AND GROUPING(LocationID) = 1 AND GROUPING(Bin) = 0 THEN 'Shelf/Bin Total'
WHEN GROUPING(Shelf) = 0 AND GROUPING(LocationID) = 1 AND GROUPING(Bin) = 1 THEN 'Shelf Total'
WHEN GROUPING(Shelf) = 1 AND GROUPING(LocationID) = 0 AND GROUPING(Bin) = 0 THEN 'Location/Bin Total'
WHEN GROUPING(Shelf) = 1 AND GROUPING(LocationID) = 0 AND GROUPING(Bin) = 1 THEN 'Location Total'
WHEN GROUPING(Shelf) = 1 AND GROUPING(LocationID) = 1 AND GROUPING(Bin) = 0 THEN 'Bin Total'
WHEN GROUPING(Shelf) = 1 AND GROUPING(LocationID) = 1 AND GROUPING(Bin) = 1 THEN 'Grand Total'
ELSE 'Regular Row'
END,
SUM(i.Quantity) AS Total
FROM Production.ProductInventory i
WHERE i.LocationID IN (3)
AND i.Bin IN (1, 2)
GROUP BY CUBE(i.Shelf, i.LocationID, i.Bin)
ORDER BY i.Shelf,
i.LocationID,
i.Bin;
When run, this query produces the following result set:
Shelf LocationID Bin GroupingType GroupingTypeMod Total
---------- ---------- ---- -------------------- -------------------- -----
NULL NULL NULL Grand Total Grand Total 90
NULL NULL 1 Bin Total Bin Total 49
NULL NULL 2 Bin Total Bin Total 41
NULL 3 NULL Location Total Location Total 90
NULL 3 1 Location/Bin Total Location/Bin Total 49
NULL 3 2 Location/Bin Total Location/Bin Total 41
A NULL NULL Shelf Total Shelf Total 90
A NULL 1 Shelf/Bin Total Shelf/Bin Total 49
A NULL 2 Shelf/Bin Total Shelf/Bin Total 41
A 3 NULL Shelf/Location Total Shelf/Location Total 90
A 3 1 Regular Row Regular Row 49
A 3 2 Regular Row Regular Row 41
It can be seen that the same results can be produced utilizing just the GROUPING function. For either function, as you increase the number of columns that the grouping is being performed on, each additional column doubles the number of values being returned (and thus it doubles the number of WHEN expressions needed in the CASE statement). However, if you are utilizing the GROUPING function, then each column that you add also needs to be added into the WHEN expression, quickly making the use of the GROUPING_ID function simpler—with just the three columns being grouped on in this example, the section utilizing the GROUPING_ID function is already simpler to read and understand. Using the GROUPING_ID function is also more efficient; notice that there is just one call to the GROUPING_ID function, while there are twenty-one calls to the GROUPING function. Each call does take some additional CPU time—even if the GROUPING function call is extremely efficient, there will be a hit.
3.22.209.120