Advanced Grouping
The GROUP BY clause is a venerable member of the SQL statement family. After learning basic SELECT statements, it is one of first specialized parts of SQL that many practitioners cut their teeth on when learning to create aggregations from raw data and transforming that data into useful information.
At times in this chapter, you may think the examples seem trivial. These examples are constructed with the purpose of demonstrating the results of different facets of the GROUP BY clause, without any requirement to focus needlessly on the values in the output. Although there are many excellent examples based on financial data throughout the Oracle documentation, those examples are sometimes difficult to follow because too much attention is focused on the output values rather than how they were obtained. The goal in this chapter is to help you see the tremendous possibilities for how grouping techniques can assist you in summarizing data easily and effectively.
Basic GROUP BY Usage
If you needed to know the number of employees in each department of your company, you might use SQL such as that in Listing 7-1, because it produces one row of output for each row in the DEPT table plus a count of the employees from each department. The output includes the OPERATIONS department, which does not have any employees. This row would not appear in the output from a standard JOIN, so the LEFT OUTER JOIN statement was used to include rows from the DEPT table that do not have any matching rows in the EMP table.
Listing 7-1. Basic GROUP BY
SQL> select d.dname, count(empno) empcount
2 from scott.dept d
3 left outer join scott.emp e on d.deptno = e.deptno
4 group by d.dname
5 order by d.dname;
DNAME EMPCOUNT
-------------- ----------
ACCOUNTING 3
OPERATIONS 0
RESEARCH 5
SALES 6
The columns used in the GROUP BY must match the set of columns in the SELECT statement on which no aggregation functions are used. In Listing 7-1, for example, there are two columns in the SELECT list, deptno and empno. The COUNT() function is used to perform aggregation on the EMPNO column so that the total number of employees in each department can be determined. The only other column in the SELECT list, deptno, must then be included in the GROUP BY clause. Failure to include the correct columns results in an error condition, as seen in Listing 7-2.
Listing 7-2. GROUP BY Columns Requirement
SQL>select d.dname, d.loc, count(empno) empcount
2 from scott.emp e
3 join scott.dept d on d.deptno = e.deptno
4 group by d.dname;
select d.dname, d.loc, count(empno) empcount
*
ERROR at line 1:
ORA-00979: not a GROUP BY expression
There is a very important point you need to understand about GROUP BY: Although the output of a SELECT statement that includes a GROUP BY clause may always appear to be sorted, you cannot expect GROUP BY always to return your data in sorted order. If the output must be sorted, you must use an ORDER BY clause. This has always been the case with Oracle, and this behavior has been documented since at least Oracle 7.0.
Although the sorting behavior of GROUP BY is not mentioned specifically in the Oracle 7 documentation, there was little room for doubt when the 9i documentation was published, which specifically states that GROUP BY does not guarantee the order of the result set.
Listing 7-3 provides a good example of GROUP BY not returning results in sorted order. Notice that the data are not sorted. The only way to guarantee sorted data is by including the ORDER BY clause, which must follow the GROUP BY clause.
Listing 7-3. GROUP BY Not Sorted
SQL> select deptno, count(*)
2 from emp
3 group by deptno;
DEPTNO COUNT(*)
---------- ----------
30 6
20 5
10 3
The GROUP BY clause may just be one of the most underappreciated workhorses of all the SELECT clauses. It is quite easy to take it for granted, because after you understand how to include it in a SELECT statement, it is quite easy to use. Perhaps a better appreciation for just how much work it does (and how much work it saves you from doing) can be gained by trying to write the SELECT statement in Listing 7-1 without using the GROUP BY clause. There are likely many different methods by which this can be done.
Think for just a moment about how you might write that SELECT statement. One such attempt was made by me, your intrepid author, and this attempt is in Listing 7-4. This is not SQL that most people would care to maintain. As you can see, it does create nearly the same output as that found in Listing 7-1. In addition to being somewhat convoluted, you must ask yourself: What happens when a new department is added to the DEPT table? Just so there’s no mistake, when I say this example is convoluted, I actually should say, “Don’t do this!” Although this query may provide the correct result set, it is a very poor way to get the job done; it is more of an example of how not to write a query.
The answer to the question “What happens when a new department is added to the DEPT table?,” of course, is that you then need to modify the SQL statement in Listing 7-4 to accommodate the change in the data. Although it is possible to use dynamic SQL to duplicate the functionality of the SQL to cope with changes to the DEPT table data, doing so creates a piece of SQL that is even more difficult to follow and even harder to maintain.
Listing 7-4. Convoluted SQL
SQL> select /* lst7-4 */
2 distinct dname, decode(
3 d.deptno,
4 10, (select count(*) from emp where deptno= 10),
5 20, (select count(*) from emp where deptno= 20),
6 30, (select count(*) from emp where deptno= 30),
7 (select count(*) from emp where deptno not in (10,20,30))
8 ) dept_count
9 from (select distinct deptno from emp) d
10 join dept d2 on d2.deptno = d.deptno;
DNAME DEPT_COUNT
-------------- ----------
SALES 6
ACCOUNTING 3
RESEARCH 5
SQL> @pln lst7-4
----------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 3 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | TABLE ACCESS FULL | EMP | 1 | 3 | 3 |
| 3 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 4 | TABLE ACCESS FULL | EMP | 1 | 5 | 5 |
| 5 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 6 | TABLE ACCESS FULL | EMP | 1 | 6 | 6 |
| 7 | SORT AGGREGATE | | 0 | 1 | 0 |
|* 8 | TABLE ACCESS FULL | EMP | 0 | 4 | 0 |
| 9 | HASH UNIQUE | | 1 | 3 | 3 |
| 10 | MERGE JOIN SEMI | | 1 | 3 | 3 |
| 11 | TABLE ACCESS BY INDEX ROWID| DEPT | 1 | 4 | 4 |
| 12 | INDEX FULL SCAN | PK_DEPT | 1 | 4 | 4 |
|* 13 | SORT UNIQUE | | 4 | 14 | 3 |
| 14 | TABLE ACCESS FULL | EMP | 1 | 14 | 14 |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("DEPTNO"=10)
4 - filter("DEPTNO"=20)
6 - filter("DEPTNO"=30)
8 - filter(("DEPTNO"<>30 AND "DEPTNO"<>20 AND "DEPTNO"<>10))
13 - access("D2"."DEPTNO"="DEPTNO")
filter("D2"."DEPTNO"="DEPTNO")
In addition to simplifying immensely the SQL that must be written, the GROUP BY clause eliminates unnecessary IO in the database. Take another look at Listing 7-4. Notice that a full table scan was performed on the EMP table five times. If you think this seems rather excessive, you are on the right track. Listing 7-5 shows the same SQL as executed in Listing 7-1. There is still a full table scan taking place against the EMP table, but it takes place only once—not five times, as in the convoluted SQL in Listing 7-4.
Listing 7-5. GROUP BY Execution Plan
SQL> select /* lst7-5 */
2 d.dname
3 , count(empno) empcount
4 from scott.emp e
5 join scott.dept d on d.deptno = e.deptno
6 group by d.dname
7 order by d.dname;
DNAME EMPCOUNT
-------------- ----------
ACCOUNTING 3
RESEARCH 5
SALES 6
SQL> @pln lst7-5
----------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 3 |
| 1 | SORT GROUP BY | | 1 | 4 | 3 |
| 2 | MERGE JOIN | | 1 | 14 | 14 |
| 3 | TABLE ACCESS BY INDEX ROWID| DEPT | 1 | 4 | 4 |
| 4 | INDEX FULL SCAN | PK_DEPT | 1 | 4 | 4 |
|* 5 | SORT JOIN | | 4 | 14 | 14 |
| 6 | TABLE ACCESS FULL | EMP | 1 | 14 | 14 |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
5 - access("D"."DEPTNO"="E"."DEPTNO")
filter("D"."DEPTNO"="E"."DEPTNO")
Don’t underestimate the power of the GROUP BY clause! As seen in this section, the GROUP BY clause does the following:
HAVING Clause
Results generated by GROUP BY may be restricted by the criteria found in the HAVING clause. The HAVING clause is quite versatile, resembling the WHERE clause in the conditions that may be used. Functions, operators, and subqueries may all be used in the HAVING clause. Listing 7-6 shows a query that returns all departments that have hired at least five employees since the beginning of the first full year after hiring began.
It’s important to note that the HAVING clause applies to the data after they are grouped. On the other hand, the WHERE clause acts on the rows as they are fetched, before they are grouped. That the HAVING operation is executed after all data have been fetched can be seen as FILTER in step 1 of the execution plan shown in Listing 7-6. Notice that an operator, a function, and subqueries have all been used in the HAVING clause.
Listing 7-6. HAVING Clause
SQL> select /* lst7-6 */
2 d.dname
3 , trunc(e.hiredate,'YYYY') hiredate
4 , count(empno) empcount
5 from scott.emp e
6 join scott.dept d on d.deptno = e.deptno
7 group by d.dname, trunc(e.hiredate,'YYYY')
8 having
9 count(empno) >= 5
10 and trunc(e.hiredate,'YYYY') between
11 (select min(hiredate) from scott.emp)
12 and
13 (select max(hiredate) from scott.emp)
14 order by d.dname;
DNAME HIREDATE EMPCOUNT
-------------- ------------------- ----------
SALES 01/01/1981 00:00:00 6
SQL> @pln lst7-6
-----------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
|* 1 | FILTER | | 1 | | 1 |
| 2 | SORT GROUP BY | | 1 | 1 | 6 |
| 3 | MERGE JOIN | | 1 | 14 | 14 |
| 4 | TABLE ACCESS BY INDEX ROWID| DEPT | 1 | 4 | 4 |
| 5 | INDEX FULL SCAN | PK_DEPT | 1 | 4 | 4 |
|* 6 | SORT JOIN | | 4 | 14 | 14 |
| 7 | TABLE ACCESS FULL | EMP | 1 | 14 | 14 |
| 8 | SORT AGGREGATE | | 1 | 1 | 1 |
| 9 | TABLE ACCESS FULL | EMP | 1 | 14 | 14 |
| 10 | SORT AGGREGATE | | 1 | 1 | 1 |
| 11 | TABLE ACCESS FULL | EMP | 1 | 14 | 14 |
-----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter((COUNT(*)>=5 AND TRUNC(INTERNAL_FUNCTION("E"."HIREDATE"),
'fmyyyy')>= AND TRUNC(INTERNAL_FUNCTION("E"."HIREDATE"),'fmyyyy')<=))
6 - access("D"."DEPTNO"="E"."DEPTNO")
filter("D"."DEPTNO"="E"."DEPTNO")
“New” GROUP BY Functionality
At times, it’s necessary to write SQL that appears as unruly as the convoluted example in Listing 7-4 so that the desired output can be obtained. The need for writing such unwieldy SQL has become much less frequent because of the advanced functionality Oracle has included in SQL during the past few years. Much of what is covered in this chapter is not actually new; it has been available for quite some time.
You can start exploring some of the advanced grouping functionality in Oracle Database by experimenting with the CUBE and ROLLUP extensions to GROUP BY, and the GROUPING function. It takes a little effort to get started, because the benefits of newer functionality are not always clear until you spend some time learning to use them.
The CUBE extension is not exactly a newcomer to Oracle. It was first introduced in Oracle 8i in 1999. When used with a GROUP BY clause, it causes all possible combinations of the elements included in the arguments to CUBE to be considered for each row. This operation generates more rows than actually exist in the table. 1
Let’s look at an example that generates all possible combinations of FIRST_NAME and LAST_NAME for each row in the HR.EMPLOYEES table. The CUBE function was intended for use in generating cross-tab reports with lots of numbers and dollar signs. When trying to understand new functionality, I find it helps to dumb down the SQL a bit so I can see what’s going on without getting distracted with subtotals.
Examine Listing 7-7 to see the results of using CUBE as described with the HR.EMPLOYEES table. Notice there are three rows returned for most employees. In other words, there are 301 rows returned, even though there are only 107 rows in the table. The first query in the listing shows the basic GROUP BY, and resulting data, for comparison.
Listing 7-7. CUBE Operation on HR.EMPLOYEES
SQL>select last_name
, first_name
3 from hr.employees
4 group by first_name,last_name;
LAST_NAME FIRST_NAME
------------------------- --------------------
Abel Ellen
Ande Sundar
Atkinson Mozhe
Austin David
Baer Hermann
Baida Shelli
Banda Amit
Bates Elizabeth
Bell Sarah
Bernstein David
Bissot Laura
Bloom Harrison
Bull Alexis
Cabrio Anthony
Cambrault Gerald
Cambrault Nanette
Chen John
Chung Kelly
Colmenares Karen
Davies Curtis
De Haan Lex
Dellinger Julia
Dilly Jennifer
Doran Louise
Ernst Bruce
Errazuriz Alberto
Everett Britney
Faviet Daniel
Fay Pat
Feeney Kevin
Fleaur Jean
Fox Tayler
Fripp Adam
Gates Timothy
Gee Ki
Geoni Girard
Gietz William
Grant Douglas
Grant Kimberely
Greenberg Nancy
Greene Danielle
Hall Peter
Hartstein Michael
Higgins Shelley
Himuro Guy
Hunold Alexander
Hutton Alyssa
Johnson Charles
Jones Vance
Kaufling Payam
Khoo Alexander
King Janette
King Steven
Kochhar Neena
Kumar Sundita
Ladwig Renske
Landry James
Lee David
Livingston Jack
Lorentz Diana
Mallin Jason
Markle Steven
Marlow James
Marvins Mattea
Matos Randall
Mavris Susan
McCain Samuel
McEwen Allan
Mikkilineni Irene
Mourgos Kevin
Nayer Julia
OConnell Donald
Olsen Christopher
Olson TJ
Ozer Lisa
Partners Karen
Pataballa Valli
Patel Joshua
Perkins Randall
Philtanker Hazel
Popp Luis
Rajs Trenna
Raphaely Den
Rogers Michael
Russell John
Sarchand Nandita
Sciarra Ismael
Seo John
Sewall Sarath
Smith Lindsey
Smith William
Stiles Stephen
Sullivan Martha
Sully Patrick
Taylor Jonathon
Taylor Winston
Tobias Sigal
Tucker Peter
Tuvault Oliver
Urman Jose Manuel
Vargas Peter
Vishney Clara
Vollman Shanta
Walsh Alana
Weiss Matthew
Whalen Jennifer
Zlotkey Eleni
107 rows selected.
SQL> set autotrace on statistics
SQL> with emps as (
2 select /* lst7-7 */
3 last_name
4 , first_name
5 from hr.employees
6 group by cube(first_name,last_name)
7 )
8 select rownum
9 , last_name
10 , first_name
11 from emps;
ROWNUM LAST_NAME FIRST_NAME
---------- ------------------------- --------------------
1
2 Ki
3 TJ
4 Den
5 Guy
6 Lex
7 Pat
...
231 Vargas
232 Vargas Peter
233 Whalen
234 Whalen Jennifer
235 De Haan
236 De Haan Lex
237 Everett
238 Everett Britney
...
301 rows selected.
Statistics
---------------------------------------------------
759 recursive calls
0 db block gets
188 consistent gets
9 physical reads
0 redo size
5990 bytes sent via SQL*Net to client
557 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
7 sorts (memory)
0 sorts (disk)
301 rows processed
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 301 |
| 1 | COUNT | | 1 | | 301 |
| 2 | VIEW | | 1 | 107 | 301 |
| 3 | SORT GROUP BY | | 1 | 107 | 301 |
| 4 | GENERATE CUBE | | 1 | 107 | 428 |
| 5 | SORT GROUP BY NOSORT| | 1 | 107 | 107 |
| 6 | INDEX FULL SCAN | EMP_NAME_IX | 1 | 107 | 107 |
---------------------------------------------------------------------------
Table 7-1 shows why there are three rows returned for each name pair. For each LAST_NAME, FIRST_NAME pair, CUBE substitutes NULL for each element in turn. The rows generated by CUBE are referred to in the Oracle documentation as superaggregate rows, which are recognizable by the NULL values placed in the columns being operated in. The results described in Table 7-1 appear in the output in Listing 7-7 as a result of the GROUP BY CUBE(FIRST_NAME,LAST_NAME) operation.
Table 7-1. CUBE Operation
First Name | Last Name |
---|---|
Vance | Jones |
Vance | NULL |
NULL | Jones |
Did you notice that the first row returned in Listing 7-7 contains NULL for both LAST_NAME and FIRST_NAME? When considering all possible combinations of a pair of arguments to CUBE, as seen in Listing 7-7, there is a combination of (NULL, NULL) that is returned for each row in the GENERATE CUBE operation. These 428 rows are then processed by the SORT GROUP BY operation, which removes all but one of the NULL pair of columns to produce the final 301 rows to satisfy the query.
Knowing how CUBE operates, you can predict how many rows should be created when using GROUP BY CUBE. Listing 7-8 shows that the number of rows returned can be predicted by adding the count for three different, distinct combinations of names, and adding one to that to account for the null pair.
Listing 7-8. Predicting CUBE Return Rows
SQL> with counts as (
2 select
3 count(distinct first_name) first_name_count
4 , count(distinct last_name) last_name_count
5 , count(distinct(first_name||last_name)) full_name_count
6 from hr.employees
7 )
8 select first_name_count
9 ,last_name_count
10 ,full_name_count
11 ,first_name_count + last_name_count
12 + full_name_count + 1 total_count
13 from counts;
FIRST_NAME_COUNT LAST_NAME_COUNT FULL_NAME_COUNT TOTAL_COUNT
---------------- --------------- --------------- -----------
91 102 107 301
You can simulate the operation of CUBE by using SQL to reproduce the steps taken by the database, both to see how the operation works and to see just how much work the database is saving you by using GROUP BY CUBE.
By examining the execution plan shown in Listing 7-7, you can see that the SORT GROUP BY NOSORT operation (step 5) returns 107 rows to the GENERATE CUBE operation (step 4), which in turn generates 428 rows. Why are 428 rows generated? Listing 7-9 shows that 428 is the expected number of rows if all combinations of LAST_NAME and FIRST_NAME are generated. The GROUP BY then reduces the output to 301 rows, just as the CUBE extension did, but with an important difference: The manual method of UNION ALL and GROUP BY used in Listing 7-9 required three full scans of the EMP_NAME_IX index and one full scan of the EMP_EMAIL_UK index. Contrast this with the single full scan of the EMP_NAME_IX index in Listing 7-7 as performed by the GROUP BY extension.
The CUBE extension didn’t merely reduce the SQL required to generate the same data as the UNION ALL and GROUP BY combination, it also reduced the number of full index scans from four to one. The optimizer chose to use index EMP_EMAIL_UK rather than the EMP_NAME_IX index, resulting in ten physical reads rather than the nine seen in Listing 7-7. Using the small dataset in the Oracle demo schemas does not cause a large difference in execution time for the example queries. With large datasets, however, the effect of using four INDEX FULL SCAN operations rather than just one is quite obvious.
Listing 7-9. Generate CUBE Rows with UNION ALL
SQL> with emps as (
2 select last_name, first_name from hr.employees
3 ) ,
4 mycube as (
5 select last_name, first_name from emps
6 union all
7 select last_name, null first_name from emps
8 union all
9 select null last_name, first_name from emps
10 union all
11 select null last_name, null first_name from emps
12 )
13 select /*+ gather_plan_statistics */ *
14 from mycube
15 group by last_name, first_name;
LAST_NAME FIRST_NAME
------------------------- --------------------
Atkinson Mozhe
Bissot Laura
Grant Kimberely
...
301 rows selected.
Statistics
----------------------------------------------------------
759 recursive calls
0 db block gets
191 consistent gets
10 physical reads
0 redo size
5477 bytes sent via SQL*Net to client
557 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
6 sorts (memory)
0 sorts (disk)
301 rows processed
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows
---------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 301
| 1 | HASH GROUP BY | | 1 | 428 | 301
| 2 | VIEW | | 1 | 428 | 428
| 3 | UNION-ALL | | 1 | | 428
| 4 | INDEX FULL SCAN| EMP_NAME_IX | 1 | 107 | 107
| 5 | INDEX FULL SCAN| EMP_NAME_IX | 1 | 107 | 107
| 6 | INDEX FULL SCAN| EMP_NAME_IX | 1 | 107 | 107
| 7 | INDEX FULL SCAN| EMP_EMAIL_UK | 1 | 107 | 107
----------------------------------------------------------------------
Putting CUBE to Work
When teaching us a new word in fourth grade English class, Mrs. Draper would say, “Now use it in a sentence.” Much like that, you now need to put the CUBE extension to practical use. It was fun to see what it is doing and just how much work it saves you, but now you need to see its practical use.
When using the GROUP BY clause to perform aggregations, you probably write several similar SQL statements—just so you can see the aggregations based on different sets of columns, much like what is included in Listing 7-9. You already know that the CUBE extension can eliminate a lot of work in the database, so now let’s put it to “real-world” practice, using the test demo test data created earlier.
The SALES_HISTORY schema contains sales data for the years 1998 to 2001. You need to provide a report to satisfy the following request: Please show me all sales data for the year 2001. I would like to see sales summarized by product category, with aggregates based on ten-year customer age ranges, income levels, as well as summaries broken out by income level regardless of age group, and by age group regardless of income levels.
Your task probably seems daunting at first, but you know all the data are available. You need to build a query using the COSTS, CUSTOMERS, PRODUCTS, SALES, and TIMES tables. (Now would be a good time to put this book down and try your hand at building such a query.) Perhaps you create a query like the one in Listing 7-10, because it is a common type of solution for such a request. Prior to the introduction of the CUBE extension, Listing 7-10 is the style of query that would most often be used to satisfy the request.
Listing 7-10. UNION ALL Query of Sales Data
SQL> with tsales as (
2 select /* lst7-10 */
3 s.quantity_sold
4 , s.amount_sold
5 , to_char(mod(cust_year_of_birth,10) * 10 ) || '-' ||
6 to_char((mod(cust_year_of_birth,10) * 10 ) + 10) age_range
7 , nvl(c.cust_income_level,'A: Below 30,000') cust_income_level
8 , p.prod_name
9 , p.prod_desc
10 , p.prod_category
11 , (pf.unit_cost * s.quantity_sold) total_cost
12 , s.amount_sold - (pf.unit_cost * s.quantity_sold) profit
13 from sh.sales s
14 join sh.customers c on c.cust_id = s.cust_id
15 join sh.products p on p.prod_id = s.prod_id
16 join sh.times t on t.time_id = s.time_id
17 join sh.costs pf on
18 pf.channel_id = s.channel_id
19 and pf.prod_id = s.prod_id
20 and pf.promo_id = s.promo_id
21 and pf.time_id = s.time_id
22 where (t.fiscal_year = 2001)
23 )
24 , gb as (
25 select -- Q1 - all categories by cust income and age range
26 'Q1' query_tag
27 , prod_category
28 , cust_income_level
29 , age_range
30 , sum(profit) profit
31 from tsales
32 group by prod_category, cust_income_level, age_range
33 union all
34 select -- Q2 - all categories by cust age range
35 'Q2' query_tag
36 , prod_category
37 , 'ALL INCOME' cust_income_level
38 , age_range
39 , sum(profit) profit
40 from tsales
41 group by prod_category, 'ALL INCOME', age_range
42 union all
43 select -- Q3 – all categories by cust income
44 'Q3' query_tag
45 , prod_category
46 , cust_income_level
47 , 'ALL AGE' age_range
48 , sum(profit) profit
49 from tsales
50 group by prod_category, cust_income_level, 'ALL AGE'
51 union all
52 select -- Q4 - all categories
53 'Q4' query_tag
54 , prod_category
55 , 'ALL INCOME' cust_income_level
56 , 'ALL AGE' age_range
57 , sum(profit) profit
58 from tsales
59 group by prod_category, 'ALL INCOME', 'ALL AGE'
60 )
61 select *
62 from gb
63 order by prod_category, profit;
QUERY AGE
TAG PRODUCT CATEGORY INCOME LEVEL RANGE PROFIT
------ ------------------ -------------------- -------- ---------------
...
Q2 Hardware K: 250,000 - 299,999 ALL AGE $26,678.00
Q2 Hardware L: 300,000 and above ALL AGE $28,974.28
Q1 Hardware F: 110,000 - 129,999 70-80 $30,477.16
Q2 Hardware J: 190,000 - 249,999 ALL AGE $43,761.47
Q2 Hardware B: 30,000 - 49,999 ALL AGE $53,612.04
Q2 Hardware A: Below 30,000 ALL AGE $55,167.88
Q2 Hardware I: 170,000 - 189,999 ALL AGE $57,089.05
Q2 Hardware C: 50,000 - 69,999 ALL AGE $76,612.64
Q3 Hardware ALL INCOME 60-70 $85,314.04
Q3 Hardware ALL INCOME 10-20 $90,849.87
Q3 Hardware ALL INCOME 0-10 $92,207.47
Q3 Hardware ALL INCOME 50-60 $93,811.96
Q3 Hardware ALL INCOME 80-90 $95,391.82
Q2 Hardware H: 150,000 - 169,999 ALL AGE $95,437.74
Q3 Hardware ALL INCOME 40-50 $97,492.51
Q3 Hardware ALL INCOME 20-30 $101,140.69
Q2 Hardware D: 70,000 - 89,999 ALL AGE $102,940.44
Q3 Hardware ALL INCOME 30-40 $102,946.85
Q3 Hardware ALL INCOME 90-100 $110,310.69
Q2 Hardware G: 130,000 - 149,999 ALL AGE $112,688.64
Q3 Hardware ALL INCOME 70-80 $117,920.88
Q2 Hardware E: 90,000 - 109,999 ALL AGE $135,154.59
Q2 Hardware F: 110,000 - 129,999 ALL AGE $199,270.01
Q4 Hardware ALL INCOME ALL AGE $987,386.78
...
714 rows selected.
Elapsed: 00:00:14.53
Statistics
----------------------------------------------------------
18464 recursive calls
4253 db block gets
22759 consistent gets
10521 physical reads
4216 redo size
25086 bytes sent via SQL*Net to client
601 bytes received via SQL*Net from client
9 SQL*Net roundtrips to/from client
174 sorts (memory)
0 sorts (disk)
714 rows processed
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------
| Id |Operation |Name |Starts |E-Rows |A-Rows |
----------------------------------------------------------------------------
| 0|SELECT STATEMENT | | 1 | | 714 |
| 1| TEMP TABLE TRANSFORMATION| | 1 | | 714 |
| 2| LOAD AS SELECT | | 1 | | 0 |
|* 3| HASH JOIN | | 1 | 17116 | 258K|
| 4| TABLE ACCESS FULL |PRODUCTS | 1 | 72 | 72 |
|* 5| HASH JOIN | | 1 | 17116 | 258K|
|* 6| HASH JOIN | | 1 | 17116 | 258K|
|* 7| TABLE ACCESS FULL |TIMES | 1 | 304 | 364 |
| 8| PARTITION RANGE AND | | 1 | 82112 | 259K|
|* 9| HASH JOIN | | 4 | 82112 | 259K|
| 10| TABLE ACCESS FULL |COSTS | 4 | 82112 | 29766 |
| 11| TABLE ACCESS FULL |SALES | 4 | 918K| 259K|
| 12| TABLE ACCESS FULL |CUSTOMERS | 1 | 55500 | 55500 |
| 13| SORT ORDER BY | | 1 | 16 | 714 |
| 14| VIEW | | 1 | 16 | 714 |
| 15| UNION-ALL | | 1 | | 714 |
| 16| HASH GROUP BY | | 1 | 3 | 599 |
| 17| VIEW | | 1 | 17116 | 258K|
| 18| TABLE ACCESS FULL |SYS_TEMP_0FD9D6620| 1 | 17116 | 258K|
| 19| HASH GROUP BY | | 1 | 4 | 60 |
| 20| VIEW | | 1 | 17116 | 258K|
| 21| TABLE ACCESS FULL |SYS_TEMP_0FD9D6620| 1 | 17116 | 258K|
| 22| HASH GROUP BY | | 1 | 4 | 50 |
| 23| VIEW | | 1 | 17116 | 258K|
| 24| TABLE ACCESS FULL |SYS_TEMP_0FD9D6620| 1 | 17116 | 258K|
| 25| HASH GROUP BY | | 1 | 5 | 5 |
| 26| VIEW | | 1 | 17116 | 258K|
| 27| TABLE ACCESS FULL |SYS_TEMP_0FD9D6620| 1 | 17116 | 258K|
----------------------------------------------------------------------------
Looking at Listing 7-10, notice four separate queries joined by the UNION ALL operator. These queries are labeled Q1 through Q4. The output from the query includes a QUERY_TAG column so that the results from each separate query can be identified clearly in the output. The customer is happy; the output is exactly the output asked for. The query can also be changed easily to report data for any year.
The operations folks that run the data center, however, are not so happy with this new report. When you take a look at the query statistics for the SQL, you can understand why they may not hold this report in high regard. Maybe it’s the 10,521 physical reads that concerns them. If the query was run only once, this would not be problem, but the marketing folks are running this query multiple times daily to report on different years, trying to discover sales trends, and it is causing all sorts of havoc as IO rates and response times increase for other users of the database.
Notice there are four table scans taking place in the execution plan. The factored subquery tsales allows the optimizer to create a temporary table that can then be used by all the queries in the gb subquery, but the use of UNION ALL makes it necessary to do four full table scans on that table, resulting in a lot of database IO.
Thinking back on your earlier experiment with CUBE, you know that multiple queries, each doing a GROUP BY and joined by UNION ALL, can be replaced with one query using GROUP BY with the CUBE extension. This is because of the requirement to create summaries based on all possible combinations of the CUST_INCOME_LEVEL and AGE_RANGE columns output from the tsales subquery. The CUBE extension can accomplish the same result, but with less code and less database IO.
Although the difference in IO rate and timing in our earlier experiment was not very significant, notice now, that when used with larger datasets, the difference can be substantial. Listing 7-11 shows the query after it has been modified to use the CUBE extension to GROUP BY.
Listing 7-11. Replace UNION ALL with CUBE
SQL> with tsales as (
2 select /*+ gather_plan_statistics */
3 s.quantity_sold
4 , s.amount_sold
5 , to_char(mod(cust_year_of_birth,10) * 10 ) || '-' ||
6 to_char((mod(cust_year_of_birth,10) * 10 ) + 10) age_range
7 , nvl(c.cust_income_level,'A: Below 30,000') cust_income_level
8 , p.prod_name
9 , p.prod_desc
10 , p.prod_category
11 , (pf.unit_cost * s.quantity_sold) total_cost
12 , s.amount_sold - (pf.unit_cost * s.quantity_sold) profit
13 from sh.sales s
14 join sh.customers c on c.cust_id = s.cust_id
15 join sh.products p on p.prod_id = s.prod_id
16 join sh.times t on t.time_id = s.time_id
17 join sh.costs pf on
18 pf.channel_id = s.channel_id
19 and pf.prod_id = s.prod_id
20 and pf.promo_id = s.promo_id
21 and pf.time_id = s.time_id
22 where (t.fiscal_year = 2001)
23 )
24 select
25 'Q' || decode(cust_income_level,
26 null,decode(age_range,null,4,3),
27 decode(age_range,null,2,1)
28 ) query_tag
29 , prod_category
30 , cust_income_level
31 , age_range
32 , sum(profit) profit
33 from tsales
34 group by prod_category, cube(cust_income_level,age_range)
35 order by prod_category, profit;
QUERY AGE
TAG PRODUCT CATEGORY INCOME LEVEL RANGE PROFIT
------ ------------------ -------------------- -------- ---------------
...
Q2 Hardware K: 250,000 - 299,999 $26,678.00
Q2 Hardware L: 300,000 and above $28,974.28
Q1 Hardware F: 110,000 - 129,999 70-80 $30,477.16
Q2 Hardware J: 190,000 - 249,999 $43,761.47
Q2 Hardware B: 30,000 - 49,999 $53,612.04
Q2 Hardware A: Below 30,000 $55,167.88
Q2 Hardware I: 170,000 - 189,999 $57,089.05
Q2 Hardware C: 50,000 - 69,999 $76,612.64
Q3 Hardware 60-70 $85,314.04
Q3 Hardware 10-20 $90,849.87
Q3 Hardware 0-10 $92,207.47
Q3 Hardware 50-60 $93,811.96
Q3 Hardware 80-90 $95,391.82
Q2 Hardware H: 150,000 - 169,999 $95,437.74
Q3 Hardware 40-50 $97,492.51
Q3 Hardware 20-30 $101,140.69
Q2 Hardware D: 70,000 - 89,999 $102,940.44
Q3 Hardware 30-40 $102,946.85
Q3 Hardware 90-100 $110,310.69
Q2 Hardware G: 130,000 - 149,999 $112,688.64
Q3 Hardware 70-80 $117,920.88
Q2 Hardware E: 90,000 - 109,999 $135,154.59
Q2 Hardware F: 110,000 - 129,999 $199,270.01
Q4 Hardware $987,386.78
...
714 rows selected.
Elapsed: 00:00:08.98
Statistics
----------------------------------------------------------
17901 recursive calls
0 db block gets
5935 consistent gets
2169 physical reads
260 redo size
24694 bytes sent via SQL*Net to client
601 bytes received via SQL*Net from client
9 SQL*Net roundtrips to/from client
174 sorts (memory)
0 sorts (disk)
714 rows processed
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------
| Id |Operation | Name | Sts | E-Rows| A-Rows|
-----------------------------------------------------------------------------
| 0 |SELECT STATEMENT | | 1 | | 714|
| 1 | SORT ORDER BY | | 1 | 2251| 714|
| 2 | SORT GROUP BY | | 1 | 2251| 714|
| 3 | GENERATE CUBE | | 1 | 2251| 2396|
| 4 | SORT GROUP BY | | 1 | 2251| 599|
|* 5 | HASH JOIN | | 1 | 17116| 258K|
| 6 | VIEW | index$_join$_004 | 1 | 72| 72|
|* 7 | HASH JOIN | | 1 | | 72|
| 8 | INDEX FAST FULL SCAN| PRODUCTS_PK | 1 | 72| 72|
| 9 | INDEX FAST FULL SCAN| PRODUCTS_PCAT_IX | 1 | 72| 72|
|* 10 | HASH JOIN | | 1 | 17116| 258K|
|* 11 | HASH JOIN | | 1 | 17116| 258K|
|* 12 | TABLE ACCESS FULL | TIMES | 1 | 304| 364|
| 13 | PARTITION RANGE AND | | 1 | 82112| 259K|
|* 14 | HASH JOIN | | 4 | 82112| 259K|
| 15 | TABLE ACCESS FULL | COSTS | 4 | 82112| 29766|
| 16 | TABLE ACCESS FULL | SALES | 4 | 918K| 259K|
| 17 | TABLE ACCESS FULL | CUSTOMERS | 1 | 55500| 55500|
-----------------------------------------------------------------------------
After running the new query, the first thing to look at are the statistics and the execution plan. Removing the entire gb subquery and using GROUP BY CUBE on the output from the tsales subquery reduced logical IO (consistent gets) from 22,759 to 5935 (by nearly a factor of four) and reduced physical IO from 10,521 physical reads to 2169 (by nearly a factor of five). Generally speaking, comparing logical IO is the relevant comparison for SQL. If your SQL is written so that it requests fewer blocks, both the logical and physical reads should be reduced. But, for the most part, when comparing performance of different SQL statements, the prime metric is logical IO (consistent gets). The reduction in physical reads alone is enough to recommend the use of CUBE; the fact that it results in much less SQL to write is a bonus.
Eliminate NULLs with the GROUPING() Function
There seems to be a problem with the output from the new query seen in Listing 7-11. Although the numbers match the earlier query that used the UNION ALL operator, some of the rows have NULL values for the CUST_INCOME_LEVEL and AGE_RANGE rows, and one row has a NULL in both of these columns. You saw this type of result earlier in Table 7-1 as an expected part of the operation of CUBE. When generating the combinations of all columns included in the arguments to CUBE, a NULL value is generated n – 1 times for each column, where n is the number of columns in the list. In the example query, there are two columns, so you can expect to see a NULL value for CUST_INCOME_LEVEL generated once for each distinct value of AGE_RANGE. The same rule applies to the AGE_RANGE column.
These NULL values 2 can be a problem if there are rows in the data that have NULL values for either of these columns. How do you discern between NULLs in the data and NULLs inserted by the CUBE extension? The GROUPING() function was introduced in Oracle 8i, and it can be used to identify these superaggregate rows. The expression used as an argument to the GROUPING() function must match an expression that appears in the GROUP BY clause. For example, write decode(grouping(age_range),1,'ALL AGE',age_range) age_range to detect whether age_range is null as a result of a row generated by CUBE, or whether it is null as a result of a row in the database. The value returned is 1 if the current row is a superaggregate row generated by CUBE; the value is 0 for all other cases.
When used in combination with a CASE expression or the DECODE() function, the NULL values in superaggregate rows can be replaced with values that are useful in a report. In this case, DECODE() appears to be a better choice because of its simplicity and the fact that there are only two possible return values for the GROUPING() function. Listing 7-12 shows how GROUPING() was used to modify the SQL found in Listing 7-11. The relevant before-and-after parts of the SQL are shown, along with the output. Now the report is easier to read, and superaggregate NULLs are discernible from NULLs occurring in the data.
Listing 7-12. GROUPING() Function
Without GROUPING():
27 , cust_income_level
28 , age_range
With GROUPING():
27 -- either CASE or DECODE() works here. I prefer DECODE() for this
28 , case grouping(cust_income_level)
29 when 1 then 'ALL INCOME'
30 else cust_income_level
31 end cust_income_level
32 , decode(grouping(age_range),1,'ALL AGE',age_range) age_range
QUERY AGE
TAG PRODUCT CATEGORY INCOME LEVEL RANGE PROFIT
------ ------------------ -------------------- -------- ---------------
...
Q2 Hardware K: 250,000 - 299,999 ALL AGE $26,678.00
Q2 Hardware L: 300,000 and above ALL AGE $28,974.28
Q1 Hardware F: 110,000 - 129,999 70-80 $30,477.16
Q2 Hardware J: 190,000 - 249,999 ALL AGE $43,761.47
Q2 Hardware B: 30,000 - 49,999 ALL AGE $53,612.04
Q2 Hardware A: Below 30,000 ALL AGE $55,167.88
Q2 Hardware I: 170,000 - 189,999 ALL AGE $57,089.05
Q2 Hardware C: 50,000 - 69,999 ALL AGE $76,612.64
Q3 Hardware ALL INCOME 60-70 $85,314.04
Q3 Hardware ALL INCOME 10-20 $90,849.87
Q3 Hardware ALL INCOME 0-10 $92,207.47
Q3 Hardware ALL INCOME 50-60 $93,811.96
Q3 Hardware ALL INCOME 80-90 $95,391.82
Q2 Hardware H: 150,000 - 169,999 ALL AGE $95,437.74
Q3 Hardware ALL INCOME 40-50 $97,492.51
Q3 Hardware ALL INCOME 20-30 $101,140.69
Q2 Hardware D: 70,000 - 89,999 ALL AGE $102,940.44
Q3 Hardware ALL INCOME 30-40 $102,946.85
Q3 Hardware ALL INCOME 90-100 $110,310.69
Q2 Hardware G: 130,000 - 149,999 ALL AGE $112,688.64
Q3 Hardware ALL INCOME 70-80 $117,920.88
Q2 Hardware E: 90,000 - 109,999 ALL AGE $135,154.59
Q2 Hardware F: 110,000 - 129,999 ALL AGE $199,270.01
Q4 Hardware ALL INCOME ALL AGE $987,386.78
Extending Reports with GROUPING()
Another use of GROUPING() is in the HAVING clause, where it can be used to control which aggregation levels appear in the output. The report seen in previous examples creates about five pages of output, which may be more than the customer cares to see. By using the GROUPING() function, these aggregations can be condensed to roll up the totals for either or all the columns used in the CUBE extension. Several variations of GROUPING() have been used to modify the previous SQL. The modifications and resulting output are shown in Listing 7-13.
Examining the data in Listing 7-13, notice that applying GROUPING() to the CUST_INCOME_LEVEL column creates aggregates from all AGE_RANGE values to be accumulated across all income levels. Doing so for the AGE_RANGE column has similar effects, with totals aggregated for all values of INCOME_LEVEL without regard for the value of AGE_RANGE. Including all the columns from the CUBE extension as arguments to the GROUPING() function causes the aggregations to be condensed to a single row, similar to what could be done with SUM(PROFIT) and a simple GROUP BY PROD_CATEGORY. Using the CUBE extension, however, allows simple changes to the HAVING clause to create several different reports.
Listing 7-13. GROUPING() in the HAVING Clause
CUST_INCOME_LEVEL
35 group by prod_category, cube(cust_income_level,age_range)
36 having grouping(cust_income_level)=1
QUERY AGE
TAG PRODUCT CATEGORY INCOME LEVEL RANGE PROFIT
------ ------------------ -------------------- -------- ---------------
Q3 Hardware ALL INCOME 60-70 $85,314.04
Q3 Hardware ALL INCOME 10-20 $90,849.87
Q3 Hardware ALL INCOME 0-10 $92,207.47
...
Q4 Hardware ALL INCOME ALL AGE $987,386.78
AGE_RANGE
35 group by prod_category, cube(cust_income_level,age_range)
36 having grouping(age_range)=1
QUERY AGE
TAG PRODUCT CATEGORY INCOME LEVEL RANGE PROFIT
------ ------------------ -------------------- -------- ---------------
Q2 Hardware K: 250,000 - 299,999 ALL AGE $26,678.00
Q2 Hardware L: 300,000 and above ALL AGE $28,974.28
Q2 Hardware J: 190,000 - 249,999 ALL AGE $43,761.47
...
Q4 Hardware ALL INCOME ALL AGE $987,386.78
CUST_INCOME_LEVEL, AGE_RANGE
35 group by prod_category, cube(cust_income_level,age_range)
36 having grouping(cust_income_level)=1 and grouping(age_range)=1
QUERY AGE
TAG PRODUCT CATEGORY INCOME LEVEL RANGE PROFIT
------ ---------------------------- ------------ -------- -------------
Q4 Electronics ALL INCOME ALL AGE $838,994.19
Q4 Hardware ALL INCOME ALL AGE $987,386.78
Q4 Peripherals and Accessories ALL INCOME ALL AGE $1,751,079.16
Q4 Photo ALL INCOME ALL AGE $1,570,866.04
Q4 Software/Other ALL INCOME ALL AGE $873,603.25
Extending Reports with GROUPING_ID()
The GROUPING_ID() is relatively new compared with the GROUPING() function, having been introduced in Oracle 9i, and is somewhat similar to the GROUPING() function. Although GROUPING() evaluates the expression and returns a value of 0 or 1, the GROUPING_ID() function evaluates an expression, determines which, if any, of the columns in its arguments are being used to generate a superaggregate row, creates a bit vector, and returns that value as an integer.
Perhaps it is simpler to see how GROUPING_ID() works with an example. The SQL in Listing 7-14 first creates a single row consisting of two columns, BIT_1 and BIT_0, with values of 1 and 0, respectively. The subquery cubed uses GROUP BY CUBE to generate four rows from the single row of input. The GROUPING_ID() function returns to the current row the decimal value of the bit vector that represents the actions of CUBE. The first two uses of the GROUPING() function then create a 1 or 0 based on the actions of CUBE on the row, and they are used to create a bit vector in the final output. The next two GROUPING() functions then create values displayed in the final output that indicate on which column CUBE is currently working. The final output displays the decimal bit vector as well as a binary representation of the vector. As expected with two binary digits, there are four rows of output.
Listing 7-14. GROUPING_ID() Bit Vector
SQL> with rowgen as (
2 select 1 bit_1, 0 bit_0
3 from dual
4 ),
5 cubed as (
6 select
7 grouping_id(bit_1,bit_0) gid
8 , to_char(grouping(bit_1)) bv_1
9 , to_char(grouping(bit_0)) bv_0
10 , decode(grouping(bit_1),1,'GRP BIT 1') gb_1
11 , decode(grouping(bit_0),1,'GRP BIT 0') gb_0
12 from rowgen
13 group by cube(bit_1,bit_0)
14 )
15 select
16 gid
17 , bv_1 || bv_0 bit_vector
18 , gb_1
19 , gb_0
20 from cubed
21 order by gid;
BIT GROUPING GROUPING
GID VECTOR BIT 1 BIT 0
---- ------ --------- ---------
0 00
1 01 GRP BIT 0
2 10 GRP BIT 1
3 11 GRP BIT 1 GRP BIT 0
So, what good is this? You already know how to use GROUPING() to control output via the HAVING clause, why learn another way? These are fair questions when you consider that the examples in Listing 7-13 can already create the output needed.
In the interest of database efficiency, a single GROUPING_ID() call can be used to replace all the different HAVING GROUPING() clauses from Listing 7-13. The GROUPING() function is limited in its ability to discriminate rows; it can return only a 0 or a 1. Because the GROUPING_ID() function returns a decimal value based on a bit vector, it can be used easily to make many different comparisons without any changes to the SQL.
Why should you care about changing comparisons without changing the SQL? If you are building an application based on the sales history examples, the user may be given four choices of output, and any one or more of them may be chosen. The user choices can be used as inputs to a single SQL statement that uses HAVING GROUPING_ID(), rather than multiple SQL statements based on different combinations of HAVING GROUPING(), so it requires less parsing of SQL by the database. It also results in fewer SQL statements to execute, less IO, and less memory usage.
Just as using CUBE eliminates multiple SQL statements joined by UNION ALL, GROUPING_ID() can eliminate multiple SQL statements in your application. The choices given to the user are as follows:
The application, a SQL*Plus script in this case, assigns to variables values corresponding to the user’s choices. The SQL statement, in turn, evaluates those variables via HAVING GROUPING_ID() to output the requested rows. Listing 7-15 simulates the choices a user might make and demonstrates how to use these inputs in the SQL. In the example, the only rows to be output are those that are aggregates of all income levels regardless of age group (ALL_AGE) and the summary columns for each product category (ALL_AGE and ALL_INCOME_LEVEL). This is accomplished by setting N_AGE_RANGE and N_SUMMARY to 2 and 4, respectively. These values correspond to the bit vector generated by the GROUPING_ID() function found in the HAVING clause.
As used in the HAVING clause, one is added to the value generated by GROUPING_ID(), which enables some consistency in setting the values of the variables that control the output. Without adding one to the value, the N_ALL_DATA variable would be set to 0 to enable output, and some other value, such as –1, to disable it. Increasing this comparison value by one makes it possible to use 0 consistently as a value to disable output.
Listing 7-15. GROUPING_ID() to Control Report Output
SQL> variable N_ALL_DATA number
SQL> variable N_AGE_RANGE number
SQL> variable N_INCOME_LEVEL number
SQL> variable N_SUMMARY number
SQL>
SQL> begin
2 -- set values to 0 to disable
3 :N_ALL_DATA := 0; -- 1 to enable
4 :N_AGE_RANGE := 2; -- 2 to enable
5 :N_INCOME_LEVEL := 0; -- 3 to enable
6 :N_SUMMARY := 4; -- 4 to enable
7 end;
8 /
SQL> with tsales as (
2 select /* lst7-15 */
3 s.quantity_sold
4 , s.amount_sold
5 , to_char(mod(cust_year_of_birth,10) * 10 ) || '-' ||
6 to_char((mod(cust_year_of_birth,10) * 10 ) + 10) age_range
7 , nvl(c.cust_income_level,'A: Below 30,000') cust_income_level
8 , p.prod_name
9 , p.prod_desc
10 , p.prod_category
11 , (pf.unit_cost * s.quantity_sold) total_cost
12 , s.amount_sold - (pf.unit_cost * s.quantity_sold) profit
13 from sh.sales s
14 join sh.customers c on c.cust_id = s.cust_id
15 join sh.products p on p.prod_id = s.prod_id
16 join sh.times t on t.time_id = s.time_id
17 join sh.costs pf on
18 pf.channel_id = s.channel_id
19 and pf.prod_id = s.prod_id
20 and pf.promo_id = s.promo_id
21 and pf.time_id = s.time_id
22 where (t.fiscal_year = 2001)
23 )
24 select
25 'Q' || to_char(grouping_id(cust_income_level,age_range)+1) query_tag
26 , prod_category , decode(grouping(cust_income_level),1,
27 'ALL INCOME',cust_income_level) cust_income_level
28 , decode(grouping(age_range),1,'ALL AGE',age_range) age_range
29 , sum(profit) profit
30 from tsales
31 group by prod_category, cube(cust_income_level,age_range)
32 having grouping_id(cust_income_level,age_range)+1
33 in(:N_ALL_DATA,:N_AGE_RANGE,:N_INCOME_LEVEL,:N_SUMMARY)
34 order by prod_category, profit;
QUERY AGE
TAG PRODUCT CATEGORY INCOME LEVEL RANGE PROFIT
------ ------------------ -------------------- -------- ---------------
...
Q2 Hardware K: 250,000 - 299,999 ALL AGE $26,678.00
Q2 Hardware L: 300,000 and above ALL AGE $28,974.28
Q2 Hardware J: 190,000 - 249,999 ALL AGE $43,761.47
...
Q2 Hardware E: 90,000 - 109,999 ALL AGE $135,154.59
Q2 Hardware F: 110,000 - 129,999 ALL AGE $199,270.01
Q4 Hardware ALL INCOME ALL AGE $987,386.78
...
65 rows selected.
To be fair, it is possible to achieve the same results using the GROUPING() function, but it requires several tests to be placed in the HAVING clause. The queries of sample sales history data include only two columns in the CUBE arguments. The total number of tests required in the HAVING clause is four, because the GROUPING() clause returns either a 1 or a 0, so there are two possible values for each of your columns, resulting in four tests. This doesn’t seem too bad, but consider what happens when there are three columns; the number of tests goes up to nine. The number of tests required is 2 n , where n is the number of columns or expressions in arguments to CUBE.
Listing 7-16 shows the HAVING clause as it might appear using GROUPING() rather than GROUPING_ID(). This approach soon becomes unwieldy if there are many arguments required for the CUBE extension. The four separate tests shown should not be too much trouble to maintain. However, if the number of columns in the CUBE arguments increases from two to three, there are then nine tests. This is not code that lends itself well to maintenance.
Listing 7-16. Using GROUPING() instead of GROUPING_ID()
32 having -- bin_to_num() requires 9i+
33 ( bin_to_num(grouping(cust_income_level), grouping(age_range))+1 = :N_ALL_DATA)
34 or ( bin_to_num(grouping(cust_income_level), grouping(age_range))+1 = :N_AGE_RANGE)
35 or ( bin_to_num(grouping(cust_income_level), grouping(age_range))+1 = :N_INCOME_LEVEL)
36 or ( bin_to_num(grouping(cust_income_level), grouping(age_range))+1 = :N_SUMMARY)
EXPERIMENT WITH GROUPING() AND GROUPING_ID()
As an exercise, modify the code from Listing 7-15 so that it adds another column to the arguments to CUBE. Then, modify the call to GROUPING_ID() in the HAVING clause to work with the new column. This requires a new variable as well in the PL/SQL block.
After you have that working, replace the GROUPING_ID() call with all the tests needed to accomplish the same output control with GROUPING(). Do you like the results? Is this code you would like to maintain?
GROUPING SETS () and ROLLUP()
There is yet another method that may be used to obtain the results seen in the previous two examples. The GROUPING SETS() extension to GROUP BY made its debut with Oracle 9i. The entire GROUP BY . . . HAVING clause of the previous example can be replaced with GROUP BY GROUPING SETS(). However, just because you can do something doesn’t mean you should. Let’s look at an example to understand just why you may not want to use GROUPING SETS(). Lines 31 through 33 in Listing 7-15 can be replaced with lines 31 through 36 in Listing 7-17.
Listing 7-17. GROUPING SETS()
SQL> with tsales as (
2 select /* lst7-17 */
3 s.quantity_sold
4 , s.amount_sold
5 , to_char(mod(cust_year_of_birth,10) * 10 ) || '-' ||
6 to_char((mod(cust_year_of_birth,10) * 10 ) + 10) age_range
7 , nvl(c.cust_income_level,'A: Below 30,000') cust_income_level
8 , p.prod_name
9 , p.prod_desc
10 , p.prod_category
11 , (pf.unit_cost * s.quantity_sold) total_cost
12 , s.amount_sold - (pf.unit_cost * s.quantity_sold) profit
13 from sh.sales s
14 join sh.customers c on c.cust_id = s.cust_id
15 join sh.products p on p.prod_id = s.prod_id
16 join sh.times t on t.time_id = s.time_id
17 join sh.costs pf on
18 pf.channel_id = s.channel_id
19 and pf.prod_id = s.prod_id
20 and pf.promo_id = s.promo_id
21 and pf.time_id = s.time_id
22 where (t.fiscal_year = 2001)
23 )
24 select
25 'Q' || to_char(grouping_id(cust_income_level,age_range)+1) query_tag
26 , prod_category, decode(grouping(cust_income_level),1,
27 'ALL INCOME',cust_income_level) cust_income_level
28 , decode(grouping(age_range),1,'ALL AGE',age_range) age_range
29 , sum(profit) profit
30 from tsales
31 group by prod_category, grouping sets(
32 rollup(prod_category), -- sub total by product category
33 (cust_income_level), -- agg by category and income levels only
34 (age_range), -- agg by category and age only
35 (cust_income_level,age_range) -- agg by category, all age and income
36 )
37 --having group_id() < 1
38 order by prod_category, profit;
QUERY AGE
TAG PRODUCT CATEGORY INCOME LEVEL RANGE PROFIT
------ ------------------ -------------------- -------- ---------------
...
Q2 Software/Other E: 90,000 - 109,999 ALL AGE $124,416.04
Q2 Software/Other F: 110,000 - 129,999 ALL AGE $169,482.11
Q4 Software/Other ALL INCOME ALL AGE $873,603.25
Q4 Software/Other ALL INCOME ALL AGE $873,603.25
756 rows selected.
The output shown in Listing 7-17 is similar to that seen when the SQL from Listing 7-15 is executed with all the output categories enabled. This is a major difference between using GROUP BY CUBE HAVING GROUPING_ID() and GROUP BY GROUPING SETS. The former may be used to modify the output easily simply by setting variables to the correct values, whereas output from the latter cannot be modified except by modifying or generating the SQL dynamically. Modifying the SQL means there is more code to maintain and more resources consumed in the database. Generating the SQL dynamically is, well, usually just not a good idea if it can be avoided; it consumes more database resources and it is much harder to troubleshoot when problems arise.
As mentioned previously, the output in Listing 7-17 is similar to that in Listing 7-15, but it is not the same. The last two lines of the output shown in Listing 7-17 are duplicates. Sometimes, the GROUPING_SETS() extension can cause duplicates to appear in the output. In this case, the duplicates are caused by the ROLLUP(PROD_CATEGORY) line. You can prove this to yourself by removing ROLLUP()from the code in Listing 7-17 and rerunning it. The duplicate lines no longer appear. However, the totals for each product category no longer appear either. The solution is to use the GROUP_ID() function to identify the duplicate rows, and insert it into a HAVING clause.
The HAVING clause can be seen commented out in Listing 7-17. If you “uncomment” it and then rerun the script, the output appears as expected, without the duplicate rows. Interestingly, if the ROLLUP(PROD_CATEGORY) line is replaced with (NULL), the HAVING clause can be removed and the output appears as expected.
The ROLLUP() extension to GROUP BY can also be used by itself to create running subtotals that otherwise require multiple queries joined by UNION ALL. Suppose that someone from the sales department asks you to create a report showing totals of all purchases by customers whose last name begins with Sul. In addition, there need to be subtotals for each year by customer, each product category by customer, and a grand total of all sales. This kind of task is handled easily by ROLLUP(). Listing 7-18 shows one way to write a query to satisfy this request.
Notice that the DECODE()and GROUPING()functions are again used to indicate subtotal rows. Also, the grand total is forced to appear at the end of the report by the use of GROUPING(M.CUST_NAME). Because the only time this value is greater than 0 is when the total for all customers is calculated, the grand total appears at the end of the report, as expected.
Listing 7-18. ROLLUP() Subtotals
SQL> with mysales as (
2 select
3 c.cust_last_name ||',' || c.cust_first_name cust_name
4 , p.prod_category
5 , to_char(trunc(time_id,'YYYY'),'YYYY') sale_year
6 , p.prod_name
7 , s.amount_sold
8 from sh.sales s
9 join sh.products p on p.prod_id = s.prod_id
10 join sh.customers c on c.cust_id = s.cust_id
11 where c.cust_last_name like 'Sul%'
12 --where s.time_id = to_date('01/01/2001','mm/dd/yyyy')
13 )
14 select
15 decode(grouping(m.cust_name),1,'GRAND TOTAL',m.cust_name) cust_name
16 ,decode(grouping(m.sale_year),1,'TOTAL BY YEAR',m.sale_year) sale_year
17 ,decode(grouping(m.prod_category),1,'TOTAL BY CATEGORY',
18 m.prod_category) prod_category, sum(m.amount_sold) amount_sold
19 from mysales m
20 group by rollup(m.cust_name, m.prod_category, m.sale_year)
21 order by grouping(m.cust_name), 1,2,3;
CUSTOMER SALE_YEAR PRODUCT CATEGORY AMT SOLD
-------------- ------------- ----------------------------- ------------
...
Sullivan,Rue 1998 Peripherals and Accessories $259.90
Sullivan,Rue 1998 Software/Other $19.59
Sullivan,Rue 2000 Electronics $2,213.30
Sullivan,Rue 2000 Hardware $1,359.06
Sullivan,Rue 2000 Peripherals and Accessories $1,169.94
Sullivan,Rue 2000 Photo $331.33
Sullivan,Rue 2000 Software/Other $933.87
Sullivan,Rue TOTAL BY YEAR Electronics $2,213.30
Sullivan,Rue TOTAL BY YEAR Hardware $1,359.06
Sullivan,Rue TOTAL BY YEAR Peripherals and Accessories $1,429.84
Sullivan,Rue TOTAL BY YEAR Photo $331.33
Sullivan,Rue TOTAL BY YEAR Software/Other $953.46
Sullivan,Rue TOTAL BY YEAR TOTAL BY CATEGORY $6,286.99
GRAND TOTAL TOTAL BY YEAR TOTAL BY CATEGORY $86,994.89
68 rows selected.
GROUP BY Restrictions
Our study of GROUP BY is incomplete without considering what it cannot do. The list of restrictions placed on GROUP BY is not very long. The restrictions are listed in the Oracle 12c SQL Language Reference (http://www.oracle.com/technetwork/indexes/documentation/index.html) for Oracle 12.1. For example, note the following:
SQL queries were constructed to demonstrate the first two restrictions, as shown in Listings 7-19 and 7-20. The error messages clearly show that LOB columns and scalar subqueries cannot be used as part of the GROUP BY clause.
Listing 7-19. GROUP BY Restrictions: LOB Not Allowed
SQL> with lobtest as (
2 select to_clob(d.dname ) dname
3 from scott.emp e
4 join scott.dept d on d.deptno = e.deptno
5 )
6 select l.dname
7 from lobtest l
8* group by l.dname
group by l.dname;
*
ERROR at line 8:
ORA-00932: inconsistent datatypes: expected - got CLOB
Listing 7-20. GROUP BY Restrictions: Scalar Subquery Not Allowed
SQL> select d.dname, count(empno) empcount
2 from scott.emp e
3 join scott.dept d on d.deptno = e.deptno
4 group by (select dname from scott.dept d2 where d2.dname = d.dname)
5 order by d.dname;
group by (select dname from scott.dept d2 where d2.dname = d.dname);
*
ERROR at line 4:
ORA-22818: subquery expressions not allowed here
The final restriction listed earlier appears to be a documentation error. Evidence for this can be seen in Listing 7-21, where the GROUP BY on an OBJECT datatype is being executed in parallel, contrary to what the documentation states. The member function match in the dept_location type body is used to compare the value for city, and this in turn is used by GROUP BY to group employees by CITY. Should you need to create aggregations based on data in an OBJECT column, you can certainly do so as of Oracle 11.1.0.7. Testing has shown that the GROUP BY of Listing 7-21 is not executed in parallel in Oracle 11.1.0.6.
Listing 7-21. GROUP BY on Object Column in Parallel
SQL> create type dept_location_type
2 as object
3 (
4 street_address VARCHAR2(40)
5 , postal_code VARCHAR2(10)
6 , city VARCHAR2(30)
7 , state_province VARCHAR2(10)
8 , country_id CHAR(2)
9 , order member function match (e dept_location_type) return integer
10 );
11 /
Type created.
SQL>
SQL> create or replace type body dept_location_type
2 as order member function match (e dept_location_type) return integer
3 is
4 begin
5 if city < e.city then
6 return -1;
7 elsif city > e.city then
8 return 1;
9 else
10 return 0;
11 end if;
12 end;
13 end;
14 /
Type body created.
SQL>
SQL> create table deptobj
2 as
3 select d.deptno,d.dname
4 from scott.dept d;
Table created.
SQL> alter table deptobj add (dept_location dept_location_type);
Table altered.
SQL> update deptobj set dept_location =
2 dept_location_type('1234 Money St', '97401','Eugene', 'OR', 'US')
3 where deptno=20;
1 row updated.
SQL> update deptobj set dept_location =
2 dept_location_type('459 Durella Street', '97463','Oakridge', 'OR', 'US')
3 where deptno=40;
1 row updated.
SQL> update deptobj set dept_location =
2 dept_location_type('12642 Rex Rd', '97006','Beavertown', 'OR', 'US')
3 where deptno=10;
1 row updated.
SQL> update deptobj set dept_location =
2 dept_location_type('9298 Hamilton Rd', '97140','George', 'WA', 'US')
3 where deptno=30;
1 row updated.
1 commit;
Commit complete.
PL/SQL procedure successfully completed.
SQL> select /*+ gather_plan_statistics parallel(e 2)*/
2 d.dept_location, count(e.ename) ecount
3 from scott.emp e, deptobj d
4 where e.deptno = d.deptno
5 group by dept_location
6 order by dept_location;
DEPT_LOCATION(STREET_ADDRESS, POSTAL_CODE, CITY, STATE_PROVI ECOUNT
------------------------------------------------------------ ------
DEPT_LOCATION_TYPE('1234 Money St', '97401', 'Eugene', 'OR', 'US') 5
DEPT_LOCATION_TYPE('12642 Rex Rd', '97006', 'Beavertown','OR','US') 3
DEPT_LOCATION_TYPE('9298 Hamilton Rd', '97140', 'George','WA','US') 6
3 rows selected.
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 3 |
| 1 | PX COORDINATOR | | 1 | | 3 |
| 2 | PX SEND QC (ORDER) | :TQ10002 | 0 | 14 | 0 |
| 3 | SORT GROUP BY | | 0 | 14 | 0 |
| 4 | PX RECEIVE | | 0 | 14 | 0 |
| 5 | PX SEND RANGE | :TQ10001 | 0 | 14 | 0 |
| 6 | HASH GROUP BY | | 0 | 14 | 0 |
|* 7 | HASH JOIN | | 0 | 14 | 0 |
| 8 | BUFFER SORT | | 0 | | 0 |
| 9 | PX RECEIVE | | 0 | 4 | 0 |
| 10 | PX SEND BROADCAST | :TQ10000 | 0 | 4 | 0 |
| 11 | TABLE ACCESS FULL| DEPTOBJ | 1 | 4 | 4 |
| 12 | PX BLOCK ITERATOR | | 0 | 14 | 0 |
|* 13 | TABLE ACCESS FULL | EMP | 0 | 14 | 0 |
---------------------------------------------------------------------------
Summary
Oracle has provided some excellent tools for the SQL practitioner in the form of extensions to the GROUP BY clause. Not only do they reduce code, they improve database efficiency. They do, however, take some dedication and practice to learn how best to use them. The introduction here to the advanced grouping features is by no means comprehensive. Most of these features can be combined for many different effects—far more than is practical to include in a book. Please endeavor to make use of these features in your own applications and continue to experiment with them based on what you have learned in this chapter.
1 If there are no rows in the table, GROUP BY CUBE() returns zero rows.
2 The NVL() function is used to provide a default value for sh.customers.cust_income_level so that output of examples may be easier to compare.
3.144.89.2