CHAPTER 7

image

Advanced Grouping

The GROUP BY clause is a venerable member of the SQL statement family. After learning basic SELECT statements, it is one of first specialized parts of SQL that many practitioners cut their teeth on when learning to create aggregations from raw data and transforming that data into useful information.

At times in this chapter, you may think the examples seem trivial. These examples are constructed with the purpose of demonstrating the results of different facets of the GROUP BY clause, without any requirement to focus needlessly on the values in the output. Although there are many excellent examples based on financial data throughout the Oracle documentation, those examples are sometimes difficult to follow because too much attention is focused on the output values rather than how they were obtained. The goal in this chapter is to help you see the tremendous possibilities for how grouping techniques can assist you in summarizing data easily and effectively.

Basic GROUP BY Usage

If you needed to know the number of employees in each department of your company, you might use SQL such as that in Listing 7-1, because it produces one row of output for each row in the DEPT table plus a count of the employees from each department. The output includes the OPERATIONS department, which does not have any employees. This row would not appear in the output from a standard JOIN, so the LEFT OUTER JOIN statement was used to include rows from the DEPT table that do not have any matching rows in the EMP table.

Listing 7-1.  Basic GROUP BY

SQL> select d.dname, count(empno) empcount
  2  from scott.dept d
  3  left outer join scott.emp e on d.deptno = e.deptno
  4  group by d.dname
  5  order by d.dname;

DNAME            EMPCOUNT
-------------- ----------
ACCOUNTING              3
OPERATIONS              0
RESEARCH                5
SALES                   6

The columns used in the GROUP BY must match the set of columns in the SELECT statement on which no aggregation functions are used. In Listing 7-1, for example, there are two columns in the SELECT list, deptno and empno. The COUNT() function is used to perform aggregation on the EMPNO column so that the total number of employees in each department can be determined. The only other column in the SELECT list, deptno, must then be included in the GROUP BY clause. Failure to include the correct columns results in an error condition, as seen in Listing 7-2.

Listing 7-2.  GROUP BY Columns Requirement

SQL>select d.dname, d.loc, count(empno) empcount
  2    from scott.emp e
  3    join scott.dept d on d.deptno = e.deptno
  4    group by d.dname;
select d.dname, d.loc, count(empno) empcount
                *
ERROR at line 1:
ORA-00979: not a GROUP BY expression

There is a very important point you need to understand about GROUP BY: Although the output of a SELECT statement that includes a GROUP BY clause may always appear to be sorted, you cannot expect GROUP BY always to return your data in sorted order. If the output must be sorted, you must use an ORDER BY clause. This has always been the case with Oracle, and this behavior has been documented since at least Oracle 7.0.

Although the sorting behavior of GROUP BY is not mentioned specifically in the Oracle 7 documentation, there was little room for doubt when the 9i documentation was published, which specifically states that GROUP BY does not guarantee the order of the result set.

Listing 7-3 provides a good example of GROUP BY not returning results in sorted order. Notice that the data are not sorted. The only way to guarantee sorted data is by including the ORDER BY clause, which must follow the GROUP BY clause.

Listing 7-3.  GROUP BY Not Sorted

SQL> select deptno, count(*)
  2  from emp
  3  group by deptno;

    DEPTNO   COUNT(*)
---------- ----------
        30          6
        20          5
        10          3

The GROUP BY clause may just be one of the most underappreciated workhorses of all the SELECT clauses. It is quite easy to take it for granted, because after you understand how to include it in a SELECT statement, it is quite easy to use. Perhaps a better appreciation for just how much work it does (and how much work it saves you from doing) can be gained by trying to write the SELECT statement in Listing 7-1 without using the GROUP BY clause. There are likely many different methods by which this can be done.

Think for just a moment about how you might write that SELECT statement. One such attempt was made by me, your intrepid author, and this attempt is in Listing 7-4. This is not SQL that most people would care to maintain. As you can see, it does create nearly the same output as that found in Listing 7-1. In addition to being somewhat convoluted, you must ask yourself: What happens when a new department is added to the DEPT table? Just so there’s no mistake, when I say this example is convoluted, I actually should say, “Don’t do this!” Although this query may provide the correct result set, it is a very poor way to get the job done; it is more of an example of how not to write a query.

The answer to the question “What happens when a new department is added to the DEPT table?,” of course, is that you then need to modify the SQL statement in Listing 7-4 to accommodate the change in the data. Although it is possible to use dynamic SQL to duplicate the functionality of the SQL to cope with changes to the DEPT table data, doing so creates a piece of SQL that is even more difficult to follow and even harder to maintain.

Listing 7-4.  Convoluted SQL

SQL> select /* lst7-4 */
  2  distinct dname, decode(
  3     d.deptno,
  4     10, (select count(*) from emp where deptno= 10),
  5     20, (select count(*) from emp where deptno= 20),
  6     30, (select count(*) from emp where deptno= 30),
  7     (select count(*) from emp where deptno not in (10,20,30))
  8  ) dept_count
  9  from (select distinct deptno from emp) d
 10  join dept d2 on d2.deptno = d.deptno;

DNAME          DEPT_COUNT
-------------- ----------
SALES                   6
ACCOUNTING              3
RESEARCH                5

SQL> @pln lst7-4
----------------------------------------------------------------------------
| Id  | Operation                     | Name    | Starts | E-Rows | A-Rows |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |         |      1 |        |      3 |
|   1 |  SORT AGGREGATE               |         |      1 |      1 |      1 |
|*  2 |   TABLE ACCESS FULL           | EMP     |      1 |      3 |      3 |
|   3 |    SORT AGGREGATE             |         |      1 |      1 |      1 |
|*  4 |     TABLE ACCESS FULL         | EMP     |      1 |      5 |      5 |
|   5 |      SORT AGGREGATE           |         |      1 |      1 |      1 |
|*  6 |       TABLE ACCESS FULL       | EMP     |      1 |      6 |      6 |
|   7 |        SORT AGGREGATE         |         |      0 |      1 |      0 |
|*  8 |         TABLE ACCESS FULL     | EMP     |      0 |      4 |      0 |
|   9 |  HASH UNIQUE                  |         |      1 |      3 |      3 |
|  10 |   MERGE JOIN SEMI             |         |      1 |      3 |      3 |
|  11 |    TABLE ACCESS BY INDEX ROWID| DEPT    |      1 |      4 |      4 |
|  12 |     INDEX FULL SCAN           | PK_DEPT |      1 |      4 |      4 |
|* 13 |    SORT UNIQUE                |         |      4 |     14 |      3 |
|  14 |     TABLE ACCESS FULL         | EMP     |      1 |     14 |     14 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("DEPTNO"=10)
   4 - filter("DEPTNO"=20)
   6 - filter("DEPTNO"=30)
   8 - filter(("DEPTNO"<>30 AND "DEPTNO"<>20 AND "DEPTNO"<>10))
  13 - access("D2"."DEPTNO"="DEPTNO")
       filter("D2"."DEPTNO"="DEPTNO")

In addition to simplifying immensely the SQL that must be written, the GROUP BY clause eliminates unnecessary IO in the database. Take another look at Listing 7-4. Notice that a full table scan was performed on the EMP table five times. If you think this seems rather excessive, you are on the right track. Listing 7-5 shows the same SQL as executed in Listing 7-1. There is still a full table scan taking place against the EMP table, but it takes place only once—not five times, as in the convoluted SQL in Listing 7-4.

Listing 7-5.  GROUP BY Execution Plan

SQL> select /* lst7-5 */
  2     d.dname
  3     , count(empno) empcount
  4  from scott.emp e
  5  join scott.dept d on d.deptno = e.deptno
  6  group by d.dname
  7  order by d.dname;
DNAME            EMPCOUNT
-------------- ----------
ACCOUNTING              3
RESEARCH                5
SALES                   6

SQL> @pln lst7-5

----------------------------------------------------------------------------
| Id  | Operation                     | Name    | Starts | E-Rows | A-Rows |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |         |      1 |        |      3 |
|   1 |  SORT GROUP BY                |         |      1 |      4 |      3 |
|   2 |   MERGE JOIN                  |         |      1 |     14 |     14 |
|   3 |    TABLE ACCESS BY INDEX ROWID| DEPT    |      1 |      4 |      4 |
|   4 |     INDEX FULL SCAN           | PK_DEPT |      1 |      4 |      4 |
|*  5 |    SORT JOIN                  |         |      4 |     14 |     14 |
|   6 |     TABLE ACCESS FULL         | EMP     |      1 |     14 |     14 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access("D"."DEPTNO"="E"."DEPTNO")
       filter("D"."DEPTNO"="E"."DEPTNO")

Don’t underestimate the power of the GROUP BY clause! As seen in this section, the GROUP BY clause does the following:

  • Makes the SQL more readable.
  • Is easier to write than using many correlated subqueries.
  • Eliminates the need to access the same objects repeatedly (which leads to better performance).

HAVING Clause

Results generated by GROUP BY may be restricted by the criteria found in the HAVING clause. The HAVING clause is quite versatile, resembling the WHERE clause in the conditions that may be used. Functions, operators, and subqueries may all be used in the HAVING clause. Listing 7-6 shows a query that returns all departments that have hired at least five employees since the beginning of the first full year after hiring began.

It’s important to note that the HAVING clause applies to the data after they are grouped. On the other hand, the WHERE clause acts on the rows as they are fetched, before they are grouped. That the HAVING operation is executed after all data have been fetched can be seen as FILTER in step 1 of the execution plan shown in Listing 7-6. Notice that an operator, a function, and subqueries have all been used in the HAVING clause.

Listing 7-6.  HAVING Clause

SQL> select /* lst7-6 */
  2    d.dname
  3    , trunc(e.hiredate,'YYYY') hiredate
  4    , count(empno) empcount
  5  from scott.emp e
  6  join scott.dept d on d.deptno = e.deptno
  7  group by d.dname, trunc(e.hiredate,'YYYY')
  8  having
  9    count(empno) >= 5
 10    and trunc(e.hiredate,'YYYY') between
 11      (select min(hiredate) from scott.emp)
 12      and
 13      (select max(hiredate) from scott.emp)
 14  order by d.dname;

DNAME          HIREDATE              EMPCOUNT
-------------- ------------------- ----------
SALES          01/01/1981 00:00:00          6

SQL> @pln lst7-6

-----------------------------------------------------------------------------
| Id  | Operation                      | Name    | Starts | E-Rows | A-Rows |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |         |      1 |        |      1 |
|*  1 |  FILTER                        |         |      1 |        |      1 |
|   2 |   SORT GROUP BY                |         |      1 |      1 |      6 |
|   3 |    MERGE JOIN                  |         |      1 |     14 |     14 |
|   4 |     TABLE ACCESS BY INDEX ROWID| DEPT    |      1 |      4 |      4 |
|   5 |      INDEX FULL SCAN           | PK_DEPT |      1 |      4 |      4 |
|*  6 |     SORT JOIN                  |         |      4 |     14 |     14 |
|   7 |      TABLE ACCESS FULL         | EMP     |      1 |     14 |     14 |
|   8 |   SORT AGGREGATE               |         |      1 |      1 |      1 |
|   9 |    TABLE ACCESS FULL           | EMP     |      1 |     14 |     14 |
|  10 |   SORT AGGREGATE               |         |      1 |      1 |      1 |
|  11 |    TABLE ACCESS FULL           | EMP     |      1 |     14 |     14 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter((COUNT(*)>=5 AND TRUNC(INTERNAL_FUNCTION("E"."HIREDATE"),
        'fmyyyy')>= AND TRUNC(INTERNAL_FUNCTION("E"."HIREDATE"),'fmyyyy')<=))
   6 - access("D"."DEPTNO"="E"."DEPTNO")
       filter("D"."DEPTNO"="E"."DEPTNO")

“New” GROUP BY Functionality

At times, it’s necessary to write SQL that appears as unruly as the convoluted example in Listing 7-4 so that the desired output can be obtained. The need for writing such unwieldy SQL has become much less frequent because of the advanced functionality Oracle has included in SQL during the past few years. Much of what is covered in this chapter is not actually new; it has been available for quite some time.

You can start exploring some of the advanced grouping functionality in Oracle Database by experimenting with the CUBE and ROLLUP extensions to GROUP BY, and the GROUPING function. It takes a little effort to get started, because the benefits of newer functionality are not always clear until you spend some time learning to use them.

CUBE Extension to GROUP BY

The CUBE extension is not exactly a newcomer to Oracle. It was first introduced in Oracle 8i in 1999. When used with a GROUP BY clause, it causes all possible combinations of the elements included in the arguments to CUBE to be considered for each row. This operation generates more rows than actually exist in the table. 1

Let’s look at an example that generates all possible combinations of FIRST_NAME and LAST_NAME for each row in the HR.EMPLOYEES table. The CUBE function was intended for use in generating cross-tab reports with lots of numbers and dollar signs. When trying to understand new functionality, I find it helps to dumb down the SQL a bit so I can see what’s going on without getting distracted with subtotals.

Examine Listing 7-7 to see the results of using CUBE as described with the HR.EMPLOYEES table. Notice there are three rows returned for most employees. In other words, there are 301 rows returned, even though there are only 107 rows in the table. The first query in the listing shows the basic GROUP BY, and resulting data, for comparison.

Listing 7-7.  CUBE Operation on HR.EMPLOYEES

SQL>select last_name
      , first_name
  3       from hr.employees
  4      group by first_name,last_name;

LAST_NAME                 FIRST_NAME
------------------------- --------------------
Abel                      Ellen
Ande                      Sundar
Atkinson                  Mozhe
Austin                    David
Baer                      Hermann
Baida                     Shelli
Banda                     Amit
Bates                     Elizabeth
Bell                      Sarah
Bernstein                 David
Bissot                    Laura
Bloom                     Harrison
Bull                      Alexis
Cabrio                    Anthony
Cambrault                 Gerald
Cambrault                 Nanette
Chen                      John
Chung                     Kelly
Colmenares                Karen
Davies                    Curtis
De Haan                   Lex
Dellinger                 Julia
Dilly                     Jennifer
Doran                     Louise
Ernst                     Bruce
Errazuriz                 Alberto
Everett                   Britney
Faviet                    Daniel
Fay                       Pat
Feeney                    Kevin
Fleaur                    Jean
Fox                       Tayler
Fripp                     Adam
Gates                     Timothy
Gee                       Ki
Geoni                     Girard
Gietz                     William
Grant                     Douglas
Grant                     Kimberely
Greenberg                 Nancy
Greene                    Danielle
Hall                      Peter
Hartstein                 Michael
Higgins                   Shelley
Himuro                    Guy
Hunold                    Alexander
Hutton                    Alyssa
Johnson                   Charles
Jones                     Vance
Kaufling                  Payam
Khoo                      Alexander
King                      Janette
King                      Steven
Kochhar                   Neena
Kumar                     Sundita
Ladwig                    Renske
Landry                    James
Lee                       David
Livingston                Jack
Lorentz                   Diana
Mallin                    Jason
Markle                    Steven
Marlow                    James
Marvins                   Mattea
Matos                     Randall
Mavris                    Susan
McCain                    Samuel
McEwen                    Allan
Mikkilineni               Irene
Mourgos                   Kevin
Nayer                     Julia
OConnell                  Donald
Olsen                     Christopher
Olson                     TJ
Ozer                      Lisa
Partners                  Karen
Pataballa                 Valli
Patel                     Joshua
Perkins                   Randall
Philtanker                Hazel
Popp                      Luis
Rajs                      Trenna
Raphaely                  Den
Rogers                    Michael
Russell                   John
Sarchand                  Nandita
Sciarra                   Ismael
Seo                       John
Sewall                    Sarath
Smith                     Lindsey
Smith                     William
Stiles                    Stephen
Sullivan                  Martha
Sully                     Patrick
Taylor                    Jonathon
Taylor                    Winston
Tobias                    Sigal
Tucker                    Peter
Tuvault                   Oliver
Urman                     Jose Manuel
Vargas                    Peter
Vishney                   Clara
Vollman                   Shanta
Walsh                     Alana
Weiss                     Matthew
Whalen                    Jennifer
Zlotkey                   Eleni

107 rows selected.

SQL> set autotrace on statistics

SQL> with emps as (
  2  select /* lst7-7 */
  3         last_name
  4       , first_name
  5     from hr.employees
  6     group by cube(first_name,last_name)
  7  )
  8  select rownum
  9     , last_name
 10     , first_name
 11  from emps;

    ROWNUM LAST_NAME                 FIRST_NAME
---------- ------------------------- --------------------
         1
         2                           Ki
         3                           TJ
         4                           Den
         5                           Guy
         6                           Lex
         7                           Pat
...
       231 Vargas
       232 Vargas                    Peter
       233 Whalen
       234 Whalen                    Jennifer
       235 De Haan
       236 De Haan                   Lex
       237 Everett
       238 Everett                   Britney
...
301 rows selected.

Statistics
---------------------------------------------------
        759  recursive calls
          0  db block gets
        188  consistent gets
          9  physical reads
          0  redo size
       5990  bytes sent via SQL*Net to client
        557  bytes received via SQL*Net from client
          5  SQL*Net roundtrips to/from client
          7  sorts (memory)
          0  sorts (disk)
        301  rows processed

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
| Id  | Operation                | Name        | Starts | E-Rows | A-Rows |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |             |      1 |        |    301 |
|   1 |  COUNT                   |             |      1 |        |    301 |
|   2 |   VIEW                   |             |      1 |    107 |    301 |
|   3 |    SORT GROUP BY         |             |      1 |    107 |    301 |
|   4 |     GENERATE CUBE        |             |      1 |    107 |    428 |
|   5 |      SORT GROUP BY NOSORT|             |      1 |    107 |    107 |
|   6 |       INDEX FULL SCAN    | EMP_NAME_IX |      1 |    107 |    107 |
---------------------------------------------------------------------------

Table 7-1 shows why there are three rows returned for each name pair. For each LAST_NAME, FIRST_NAME pair, CUBE substitutes NULL for each element in turn. The rows generated by CUBE are referred to in the Oracle documentation as superaggregate rows, which are recognizable by the NULL values placed in the columns being operated in. The results described in Table 7-1 appear in the output in Listing 7-7 as a result of the GROUP BY CUBE(FIRST_NAME,LAST_NAME) operation.

Table 7-1. CUBE Operation

First Name Last Name
Vance Jones
Vance NULL
NULL Jones

Did you notice that the first row returned in Listing 7-7 contains NULL for both LAST_NAME and FIRST_NAME? When considering all possible combinations of a pair of arguments to CUBE, as seen in Listing 7-7, there is a combination of (NULL, NULL) that is returned for each row in the GENERATE CUBE operation. These 428 rows are then processed by the SORT GROUP BY operation, which removes all but one of the NULL pair of columns to produce the final 301 rows to satisfy the query.

Knowing how CUBE operates, you can predict how many rows should be created when using GROUP BY CUBE. Listing 7-8 shows that the number of rows returned can be predicted by adding the count for three different, distinct combinations of names, and adding one to that to account for the null pair.

Listing 7-8.  Predicting CUBE Return Rows

SQL> with counts as (
  2     select
  3             count(distinct first_name) first_name_count
  4             , count(distinct last_name) last_name_count
  5             , count(distinct(first_name||last_name)) full_name_count
  6     from hr.employees
  7  )
  8  select first_name_count
  9        ,last_name_count
 10        ,full_name_count
 11        ,first_name_count + last_name_count
 12         + full_name_count + 1 total_count
 13  from counts;

FIRST_NAME_COUNT LAST_NAME_COUNT FULL_NAME_COUNT TOTAL_COUNT
---------------- --------------- --------------- -----------
              91             102             107         301

You can simulate the operation of CUBE by using SQL to reproduce the steps taken by the database, both to see how the operation works and to see just how much work the database is saving you by using GROUP BY CUBE.

By examining the execution plan shown in Listing 7-7, you can see that the SORT GROUP BY NOSORT operation (step 5) returns 107 rows to the GENERATE CUBE operation (step 4), which in turn generates 428 rows. Why are 428 rows generated? Listing 7-9 shows that 428 is the expected number of rows if all combinations of LAST_NAME and FIRST_NAME are generated. The GROUP BY then reduces the output to 301 rows, just as the CUBE extension did, but with an important difference: The manual method of UNION ALL and GROUP BY used in Listing 7-9 required three full scans of the EMP_NAME_IX index and one full scan of the EMP_EMAIL_UK index. Contrast this with the single full scan of the EMP_NAME_IX index in Listing 7-7 as performed by the GROUP BY extension.

The CUBE extension didn’t merely reduce the SQL required to generate the same data as the UNION ALL and GROUP BY combination, it also reduced the number of full index scans from four to one. The optimizer chose to use index EMP_EMAIL_UK rather than the EMP_NAME_IX index, resulting in ten physical reads rather than the nine seen in Listing 7-7. Using the small dataset in the Oracle demo schemas does not cause a large difference in execution time for the example queries. With large datasets, however, the effect of using four INDEX FULL SCAN operations rather than just one is quite obvious.

Listing 7-9.  Generate CUBE Rows with UNION ALL

SQL> with emps as (
  2     select last_name, first_name from hr.employees
  3  ) ,
  4  mycube as (
  5    select last_name, first_name from emps
  6    union all
  7    select last_name, null first_name from emps
  8    union all
  9    select null last_name, first_name from emps
 10    union all
 11    select null last_name, null first_name from emps
 12  )
 13  select /*+ gather_plan_statistics */ *
 14  from mycube
 15  group by last_name, first_name;

LAST_NAME                 FIRST_NAME
------------------------- --------------------
Atkinson                  Mozhe
Bissot                    Laura
Grant                     Kimberely
...
301 rows selected.

Statistics
----------------------------------------------------------
        759  recursive calls
          0  db block gets
        191  consistent gets
         10  physical reads
          0  redo size
       5477  bytes sent via SQL*Net to client
        557  bytes received via SQL*Net from client
          5  SQL*Net roundtrips to/from client
          6  sorts (memory)
          0  sorts (disk)
        301  rows processed

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------
| Id  | Operation          | Name         | Starts | E-Rows | A-Rows
---------------------------------------------------------------------
|   0 | SELECT STATEMENT   |              |      1 |        |    301
|   1 |  HASH GROUP BY     |              |      1 |    428 |    301
|   2 |   VIEW             |              |      1 |    428 |    428
|   3 |    UNION-ALL       |              |      1 |        |    428
|   4 |     INDEX FULL SCAN| EMP_NAME_IX  |      1 |    107 |    107
|   5 |     INDEX FULL SCAN| EMP_NAME_IX  |      1 |    107 |    107
|   6 |     INDEX FULL SCAN| EMP_NAME_IX  |      1 |    107 |    107
|   7 |     INDEX FULL SCAN| EMP_EMAIL_UK |      1 |    107 |    107
----------------------------------------------------------------------

Putting CUBE to Work

When teaching us a new word in fourth grade English class, Mrs. Draper would say, “Now use it in a sentence.” Much like that, you now need to put the CUBE extension to practical use. It was fun to see what it is doing and just how much work it saves you, but now you need to see its practical use.

When using the GROUP BY clause to perform aggregations, you probably write several similar SQL statements—just so you can see the aggregations based on different sets of columns, much like what is included in Listing 7-9. You already know that the CUBE extension can eliminate a lot of work in the database, so now let’s put it to “real-world” practice, using the test demo test data created earlier.

The SALES_HISTORY schema contains sales data for the years 1998 to 2001. You need to provide a report to satisfy the following request: Please show me all sales data for the year 2001. I would like to see sales summarized by product category, with aggregates based on ten-year customer age ranges, income levels, as well as summaries broken out by income level regardless of age group, and by age group regardless of income levels.

Your task probably seems daunting at first, but you know all the data are available. You need to build a query using the COSTS, CUSTOMERS, PRODUCTS, SALES, and TIMES tables. (Now would be a good time to put this book down and try your hand at building such a query.) Perhaps you create a query like the one in Listing 7-10, because it is a common type of solution for such a request. Prior to the introduction of the CUBE extension, Listing 7-10 is the style of query that would most often be used to satisfy the request.

Listing 7-10.  UNION ALL Query of Sales Data

SQL> with tsales as (
  2  select /* lst7-10 */
  3      s.quantity_sold
  4      , s.amount_sold
  5      , to_char(mod(cust_year_of_birth,10) * 10 ) || '-' ||
  6        to_char((mod(cust_year_of_birth,10) * 10 ) + 10) age_range
  7      , nvl(c.cust_income_level,'A: Below 30,000') cust_income_level
  8      , p.prod_name
  9      , p.prod_desc
 10      , p.prod_category
 11      , (pf.unit_cost * s.quantity_sold)  total_cost
 12      , s.amount_sold - (pf.unit_cost * s.quantity_sold)  profit
 13    from sh.sales s
 14    join sh.customers c on c.cust_id = s.cust_id
 15    join sh.products p on p.prod_id = s.prod_id
 16    join sh.times t on t.time_id = s.time_id
 17    join sh.costs pf on
 18      pf.channel_id = s.channel_id
 19      and pf.prod_id = s.prod_id
 20      and pf.promo_id = s.promo_id
 21      and pf.time_id = s.time_id
 22    where  (t.fiscal_year = 2001)
 23  )
 24  , gb as (
 25    select -- Q1 - all categories by cust income and age range
 26      'Q1' query_tag
 27      , prod_category
 28      , cust_income_level
 29      , age_range
 30      , sum(profit) profit
 31    from tsales
 32    group by prod_category, cust_income_level, age_range
 33    union all
 34    select -- Q2 - all categories by cust age range
 35      'Q2' query_tag
 36      , prod_category
 37      , 'ALL INCOME' cust_income_level
 38      , age_range
 39      , sum(profit) profit
 40    from tsales
 41    group by prod_category, 'ALL INCOME', age_range
 42    union all
 43    select -- Q3 – all categories by cust income
 44      'Q3' query_tag
 45      , prod_category
 46      , cust_income_level
 47      , 'ALL AGE' age_range
 48      , sum(profit) profit
 49    from tsales
 50    group by prod_category, cust_income_level, 'ALL AGE'
 51    union all
 52    select -- Q4 - all categories
 53      'Q4' query_tag
 54      , prod_category
 55      , 'ALL INCOME' cust_income_level
 56      , 'ALL AGE' age_range
 57      , sum(profit) profit
 58    from tsales
 59    group by prod_category, 'ALL INCOME', 'ALL AGE'
 60  )
 61  select *
 62  from gb
 63 order by prod_category, profit;

QUERY                                          AGE
TAG    PRODUCT CATEGORY   INCOME LEVEL         RANGE             PROFIT
------ ------------------ -------------------- -------- ---------------
...
Q2     Hardware           K: 250,000 - 299,999 ALL AGE       $26,678.00
Q2     Hardware           L: 300,000 and above ALL AGE       $28,974.28
Q1     Hardware           F: 110,000 - 129,999 70-80         $30,477.16
Q2     Hardware           J: 190,000 - 249,999 ALL AGE       $43,761.47
Q2     Hardware           B: 30,000 - 49,999   ALL AGE       $53,612.04
Q2     Hardware           A: Below 30,000      ALL AGE       $55,167.88
Q2     Hardware           I: 170,000 - 189,999 ALL AGE       $57,089.05
Q2     Hardware           C: 50,000 - 69,999   ALL AGE       $76,612.64
Q3     Hardware           ALL INCOME           60-70         $85,314.04
Q3     Hardware           ALL INCOME           10-20         $90,849.87
Q3     Hardware           ALL INCOME           0-10          $92,207.47
Q3     Hardware           ALL INCOME           50-60         $93,811.96
Q3     Hardware           ALL INCOME           80-90         $95,391.82
Q2     Hardware           H: 150,000 - 169,999 ALL AGE       $95,437.74
Q3     Hardware           ALL INCOME           40-50         $97,492.51
Q3     Hardware           ALL INCOME           20-30        $101,140.69
Q2     Hardware           D: 70,000 - 89,999   ALL AGE      $102,940.44
Q3     Hardware           ALL INCOME           30-40        $102,946.85
Q3     Hardware           ALL INCOME           90-100       $110,310.69
Q2     Hardware           G: 130,000 - 149,999 ALL AGE      $112,688.64
Q3     Hardware           ALL INCOME           70-80        $117,920.88
Q2     Hardware           E: 90,000 - 109,999  ALL AGE      $135,154.59
Q2     Hardware           F: 110,000 - 129,999 ALL AGE      $199,270.01
Q4     Hardware           ALL INCOME           ALL AGE      $987,386.78
...
714 rows selected.
Elapsed: 00:00:14.53

Statistics
----------------------------------------------------------
      18464  recursive calls
       4253  db block gets
      22759  consistent gets
      10521  physical reads
       4216  redo size
      25086  bytes sent via SQL*Net to client
        601  bytes received via SQL*Net from client
          9  SQL*Net roundtrips to/from client
        174  sorts (memory)
          0  sorts (disk)
        714  rows processed

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------
| Id |Operation                 |Name              |Starts |E-Rows |A-Rows |
----------------------------------------------------------------------------
|   0|SELECT STATEMENT          |                  |     1 |       |   714 |
|   1| TEMP TABLE TRANSFORMATION|                  |     1 |       |   714 |
|   2|  LOAD AS SELECT          |                  |     1 |       |     0 |
|*  3|   HASH JOIN              |                  |     1 | 17116 |   258K|
|   4|    TABLE ACCESS FULL     |PRODUCTS          |     1 |    72 |    72 |
|*  5|    HASH JOIN             |                  |     1 | 17116 |   258K|
|*  6|     HASH JOIN            |                  |     1 | 17116 |   258K|
|*  7|      TABLE ACCESS FULL   |TIMES             |     1 |   304 |   364 |
|   8|      PARTITION RANGE AND |                  |     1 | 82112 |   259K|
|*  9|       HASH JOIN          |                  |     4 | 82112 |   259K|
|  10|        TABLE ACCESS FULL |COSTS             |     4 | 82112 | 29766 |
|  11|        TABLE ACCESS FULL |SALES             |     4 |   918K|   259K|
|  12|     TABLE ACCESS FULL    |CUSTOMERS         |     1 | 55500 | 55500 |
|  13|  SORT ORDER BY           |                  |     1 |    16 |   714 |
|  14|   VIEW                   |                  |     1 |    16 |   714 |
|  15|    UNION-ALL             |                  |     1 |       |   714 |
|  16|     HASH GROUP BY        |                  |     1 |     3 |   599 |
|  17|      VIEW                |                  |     1 | 17116 |   258K|
|  18|       TABLE ACCESS FULL  |SYS_TEMP_0FD9D6620|     1 | 17116 |   258K|
|  19|     HASH GROUP BY        |                  |     1 |     4 |    60 |
|  20|      VIEW                |                  |     1 | 17116 |   258K|
|  21|       TABLE ACCESS FULL  |SYS_TEMP_0FD9D6620|     1 | 17116 |   258K|
|  22|     HASH GROUP BY        |                  |     1 |     4 |    50 |
|  23|      VIEW                |                  |     1 | 17116 |   258K|
|  24|       TABLE ACCESS FULL  |SYS_TEMP_0FD9D6620|     1 | 17116 |   258K|
|  25|     HASH GROUP BY        |                  |     1 |     5 |     5 |
|  26|      VIEW                |                  |     1 | 17116 |   258K|
|  27|       TABLE ACCESS FULL  |SYS_TEMP_0FD9D6620|     1 | 17116 |   258K|
----------------------------------------------------------------------------

Looking at Listing 7-10, notice four separate queries joined by the UNION ALL operator. These queries are labeled Q1 through Q4. The output from the query includes a QUERY_TAG column so that the results from each separate query can be identified clearly in the output. The customer is happy; the output is exactly the output asked for. The query can also be changed easily to report data for any year.

The operations folks that run the data center, however, are not so happy with this new report. When you take a look at the query statistics for the SQL, you can understand why they may not hold this report in high regard. Maybe it’s the 10,521 physical reads that concerns them. If the query was run only once, this would not be problem, but the marketing folks are running this query multiple times daily to report on different years, trying to discover sales trends, and it is causing all sorts of havoc as IO rates and response times increase for other users of the database.

Notice there are four table scans taking place in the execution plan. The factored subquery tsales allows the optimizer to create a temporary table that can then be used by all the queries in the gb subquery, but the use of UNION ALL makes it necessary to do four full table scans on that table, resulting in a lot of database IO.

Thinking back on your earlier experiment with CUBE, you know that multiple queries, each doing a GROUP BY and joined by UNION ALL, can be replaced with one query using GROUP BY with the CUBE extension. This is because of the requirement to create summaries based on all possible combinations of the CUST_INCOME_LEVEL and AGE_RANGE columns output from the tsales subquery. The CUBE extension can accomplish the same result, but with less code and less database IO.

Although the difference in IO rate and timing in our earlier experiment was not very significant, notice now, that when used with larger datasets, the difference can be substantial. Listing 7-11 shows the query after it has been modified to use the CUBE extension to GROUP BY.

Listing 7-11.  Replace UNION ALL with CUBE

SQL> with tsales as (
  2  select /*+ gather_plan_statistics */
  3      s.quantity_sold
  4      , s.amount_sold
  5      , to_char(mod(cust_year_of_birth,10) * 10 ) || '-' ||
  6        to_char((mod(cust_year_of_birth,10) * 10 ) + 10) age_range
  7      , nvl(c.cust_income_level,'A: Below 30,000') cust_income_level
  8      , p.prod_name
  9      , p.prod_desc
 10      , p.prod_category
 11      , (pf.unit_cost * s.quantity_sold)  total_cost
 12      , s.amount_sold - (pf.unit_cost * s.quantity_sold)  profit
 13    from sh.sales s
 14    join sh.customers c on c.cust_id = s.cust_id
 15    join sh.products p on p.prod_id = s.prod_id
 16    join sh.times t on t.time_id = s.time_id
 17    join sh.costs pf on
 18      pf.channel_id = s.channel_id
 19      and pf.prod_id = s.prod_id
 20      and pf.promo_id = s.promo_id
 21      and pf.time_id = s.time_id
 22    where  (t.fiscal_year = 2001)
 23  )
 24  select
 25    'Q' || decode(cust_income_level,
 26       null,decode(age_range,null,4,3),
 27       decode(age_range,null,2,1)
 28     ) query_tag
 29    , prod_category
 30    , cust_income_level
 31    , age_range
 32    , sum(profit) profit
 33  from tsales
 34  group by prod_category, cube(cust_income_level,age_range)
 35  order by prod_category, profit;

QUERY                                          AGE
TAG    PRODUCT CATEGORY   INCOME LEVEL         RANGE             PROFIT
------ ------------------ -------------------- -------- ---------------
...
Q2     Hardware           K: 250,000 - 299,999               $26,678.00
Q2     Hardware           L: 300,000 and above               $28,974.28
Q1     Hardware           F: 110,000 - 129,999 70-80         $30,477.16
Q2     Hardware           J: 190,000 - 249,999               $43,761.47
Q2     Hardware           B: 30,000 - 49,999                 $53,612.04
Q2     Hardware           A: Below 30,000                    $55,167.88
Q2     Hardware           I: 170,000 - 189,999               $57,089.05
Q2     Hardware           C: 50,000 - 69,999                 $76,612.64
Q3     Hardware                                60-70         $85,314.04
Q3     Hardware                                10-20         $90,849.87
Q3     Hardware                                0-10          $92,207.47
Q3     Hardware                                50-60         $93,811.96
Q3     Hardware                                80-90         $95,391.82
Q2     Hardware           H: 150,000 - 169,999               $95,437.74
Q3     Hardware                                40-50         $97,492.51
Q3     Hardware                                20-30        $101,140.69
Q2     Hardware           D: 70,000 - 89,999                $102,940.44
Q3     Hardware                                30-40        $102,946.85
Q3     Hardware                                90-100       $110,310.69
Q2     Hardware           G: 130,000 - 149,999              $112,688.64
Q3     Hardware                                70-80        $117,920.88
Q2     Hardware           E: 90,000 - 109,999               $135,154.59
Q2     Hardware           F: 110,000 - 129,999              $199,270.01
Q4     Hardware                                             $987,386.78
...
714 rows selected.
Elapsed: 00:00:08.98

Statistics
----------------------------------------------------------
      17901  recursive calls
          0  db block gets
       5935  consistent gets
       2169  physical reads
        260  redo size
      24694  bytes sent via SQL*Net to client
        601  bytes received via SQL*Net from client
          9  SQL*Net roundtrips to/from client
        174  sorts (memory)
          0  sorts (disk)
        714  rows processed

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------
| Id  |Operation                   | Name             | Sts | E-Rows| A-Rows|
-----------------------------------------------------------------------------
|   0 |SELECT STATEMENT            |                  |   1 |       |    714|
|   1 | SORT ORDER BY              |                  |   1 |   2251|    714|
|   2 |  SORT GROUP BY             |                  |   1 |   2251|    714|
|   3 |   GENERATE CUBE            |                  |   1 |   2251|   2396|
|   4 |    SORT GROUP BY           |                  |   1 |   2251|    599|
|*  5 |     HASH JOIN              |                  |   1 |  17116|   258K|
|   6 |      VIEW                  | index$_join$_004 |   1 |     72|     72|
|*  7 |       HASH JOIN            |                  |   1 |       |     72|
|   8 |        INDEX FAST FULL SCAN| PRODUCTS_PK      |   1 |     72|     72|
|   9 |        INDEX FAST FULL SCAN| PRODUCTS_PCAT_IX |   1 |     72|     72|
|* 10 |      HASH JOIN             |                  |   1 |  17116|   258K|
|* 11 |       HASH JOIN            |                  |   1 |  17116|   258K|
|* 12 |        TABLE ACCESS FULL   | TIMES            |   1 |    304|    364|
|  13 |        PARTITION RANGE AND |                  |   1 |  82112|   259K|
|* 14 |         HASH JOIN          |                  |   4 |  82112|   259K|
|  15 |          TABLE ACCESS FULL | COSTS            |   4 |  82112|  29766|
|  16 |          TABLE ACCESS FULL | SALES            |   4 |   918K|   259K|
|  17 |       TABLE ACCESS FULL    | CUSTOMERS        |   1 |  55500|  55500|
-----------------------------------------------------------------------------

After running the new query, the first thing to look at are the statistics and the execution plan. Removing the entire gb subquery and using GROUP BY CUBE on the output from the tsales subquery reduced logical IO (consistent gets) from 22,759 to 5935 (by nearly a factor of four) and reduced physical IO from 10,521 physical reads to 2169 (by nearly a factor of five). Generally speaking, comparing logical IO is the relevant comparison for SQL. If your SQL is written so that it requests fewer blocks, both the logical and physical reads should be reduced. But, for the most part, when comparing performance of different SQL statements, the prime metric is logical IO (consistent gets). The reduction in physical reads alone is enough to recommend the use of CUBE; the fact that it results in much less SQL to write is a bonus.

Eliminate NULLs with the GROUPING() Function

There seems to be a problem with the output from the new query seen in Listing 7-11. Although the numbers match the earlier query that used the UNION ALL operator, some of the rows have NULL values for the CUST_INCOME_LEVEL and AGE_RANGE rows, and one row has a NULL in both of these columns. You saw this type of result earlier in Table 7-1 as an expected part of the operation of CUBE. When generating the combinations of all columns included in the arguments to CUBE, a NULL value is generated n – 1 times for each column, where n is the number of columns in the list. In the example query, there are two columns, so you can expect to see a NULL value for CUST_INCOME_LEVEL generated once for each distinct value of AGE_RANGE. The same rule applies to the AGE_RANGE column.

These NULL values 2 can be a problem if there are rows in the data that have NULL values for either of these columns. How do you discern between NULLs in the data and NULLs inserted by the CUBE extension? The GROUPING() function was introduced in Oracle 8i, and it can be used to identify these superaggregate rows. The expression used as an argument to the GROUPING() function must match an expression that appears in the GROUP BY clause. For example, write decode(grouping(age_range),1,'ALL AGE',age_range) age_range to detect whether age_range is null as a result of a row generated by CUBE, or whether it is null as a result of a row in the database. The value returned is 1 if the current row is a superaggregate row generated by CUBE; the value is 0 for all other cases.

When used in combination with a CASE expression or the DECODE() function, the NULL values in superaggregate rows can be replaced with values that are useful in a report. In this case, DECODE() appears to be a better choice because of its simplicity and the fact that there are only two possible return values for the GROUPING() function. Listing 7-12 shows how GROUPING() was used to modify the SQL found in Listing 7-11. The relevant before-and-after parts of the SQL are shown, along with the output. Now the report is easier to read, and superaggregate NULLs are discernible from NULLs occurring in the data.

Listing 7-12.  GROUPING() Function

Without GROUPING():

27    , cust_income_level
28    , age_range

With GROUPING():

27    -- either CASE or DECODE() works here. I prefer DECODE() for this
28    , case grouping(cust_income_level)
29        when 1 then 'ALL INCOME'
30        else cust_income_level
31      end cust_income_level
32    , decode(grouping(age_range),1,'ALL AGE',age_range) age_range

QUERY                                          AGE
TAG    PRODUCT CATEGORY   INCOME LEVEL         RANGE             PROFIT
------ ------------------ -------------------- -------- ---------------
...
Q2     Hardware           K: 250,000 - 299,999 ALL AGE       $26,678.00
Q2     Hardware           L: 300,000 and above ALL AGE       $28,974.28
Q1     Hardware           F: 110,000 - 129,999 70-80         $30,477.16
Q2     Hardware           J: 190,000 - 249,999 ALL AGE       $43,761.47
Q2     Hardware           B: 30,000 - 49,999   ALL AGE       $53,612.04
Q2     Hardware           A: Below 30,000      ALL AGE       $55,167.88
Q2     Hardware           I: 170,000 - 189,999 ALL AGE       $57,089.05
Q2     Hardware           C: 50,000 - 69,999   ALL AGE       $76,612.64
Q3     Hardware           ALL INCOME           60-70         $85,314.04
Q3     Hardware           ALL INCOME           10-20         $90,849.87
Q3     Hardware           ALL INCOME           0-10          $92,207.47
Q3     Hardware           ALL INCOME           50-60         $93,811.96
Q3     Hardware           ALL INCOME           80-90         $95,391.82
Q2     Hardware           H: 150,000 - 169,999 ALL AGE       $95,437.74
Q3     Hardware           ALL INCOME           40-50         $97,492.51
Q3     Hardware           ALL INCOME           20-30        $101,140.69
Q2     Hardware           D: 70,000 - 89,999   ALL AGE      $102,940.44
Q3     Hardware           ALL INCOME           30-40        $102,946.85
Q3     Hardware           ALL INCOME           90-100       $110,310.69
Q2     Hardware           G: 130,000 - 149,999 ALL AGE      $112,688.64
Q3     Hardware           ALL INCOME           70-80        $117,920.88
Q2     Hardware           E: 90,000 - 109,999  ALL AGE      $135,154.59
Q2     Hardware           F: 110,000 - 129,999 ALL AGE      $199,270.01
Q4     Hardware           ALL INCOME           ALL AGE      $987,386.78

Extending Reports with GROUPING()

Another use of GROUPING() is in the HAVING clause, where it can be used to control which aggregation levels appear in the output. The report seen in previous examples creates about five pages of output, which may be more than the customer cares to see. By using the GROUPING() function, these aggregations can be condensed to roll up the totals for either or all the columns used in the CUBE extension. Several variations of GROUPING() have been used to modify the previous SQL. The modifications and resulting output are shown in Listing 7-13.

Examining the data in Listing 7-13, notice that applying GROUPING() to the CUST_INCOME_LEVEL column creates aggregates from all AGE_RANGE values to be accumulated across all income levels. Doing so for the AGE_RANGE column has similar effects, with totals aggregated for all values of INCOME_LEVEL without regard for the value of AGE_RANGE. Including all the columns from the CUBE extension as arguments to the GROUPING() function causes the aggregations to be condensed to a single row, similar to what could be done with SUM(PROFIT) and a simple GROUP BY PROD_CATEGORY. Using the CUBE extension, however, allows simple changes to the HAVING clause to create several different reports.

Listing 7-13.  GROUPING() in the HAVING Clause

CUST_INCOME_LEVEL

35  group by prod_category, cube(cust_income_level,age_range)
36    having grouping(cust_income_level)=1

QUERY                                          AGE
TAG    PRODUCT CATEGORY   INCOME LEVEL         RANGE             PROFIT
------ ------------------ -------------------- -------- ---------------
Q3     Hardware           ALL INCOME           60-70         $85,314.04
Q3     Hardware           ALL INCOME           10-20         $90,849.87
Q3     Hardware           ALL INCOME           0-10          $92,207.47
...
Q4     Hardware           ALL INCOME           ALL AGE      $987,386.78

AGE_RANGE

35  group by prod_category, cube(cust_income_level,age_range)
36    having grouping(age_range)=1

QUERY                                          AGE
TAG    PRODUCT CATEGORY   INCOME LEVEL         RANGE             PROFIT
------ ------------------ -------------------- -------- ---------------
Q2     Hardware           K: 250,000 - 299,999 ALL AGE       $26,678.00
Q2     Hardware           L: 300,000 and above ALL AGE       $28,974.28
Q2     Hardware           J: 190,000 - 249,999 ALL AGE       $43,761.47
...
Q4     Hardware           ALL INCOME           ALL AGE      $987,386.78


CUST_INCOME_LEVEL, AGE_RANGE

35  group by prod_category, cube(cust_income_level,age_range)
36    having grouping(cust_income_level)=1 and grouping(age_range)=1

QUERY                                            AGE
TAG    PRODUCT CATEGORY             INCOME LEVEL RANGE           PROFIT
------ ---------------------------- ------------ -------- -------------
Q4     Electronics                  ALL INCOME   ALL AGE    $838,994.19
Q4     Hardware                     ALL INCOME   ALL AGE    $987,386.78
Q4     Peripherals and Accessories  ALL INCOME   ALL AGE  $1,751,079.16
Q4     Photo                        ALL INCOME   ALL AGE  $1,570,866.04
Q4     Software/Other               ALL INCOME   ALL AGE    $873,603.25

Extending Reports with GROUPING_ID()

The GROUPING_ID() is relatively new compared with the GROUPING() function, having been introduced in Oracle 9i, and is somewhat similar to the GROUPING() function. Although GROUPING() evaluates the expression and returns a value of 0 or 1, the GROUPING_ID() function evaluates an expression, determines which, if any, of the columns in its arguments are being used to generate a superaggregate row, creates a bit vector, and returns that value as an integer.

Perhaps it is simpler to see how GROUPING_ID() works with an example. The SQL in Listing 7-14 first creates a single row consisting of two columns, BIT_1 and BIT_0, with values of 1 and 0, respectively. The subquery cubed uses GROUP BY CUBE to generate four rows from the single row of input. The GROUPING_ID() function returns to the current row the decimal value of the bit vector that represents the actions of CUBE. The first two uses of the GROUPING() function then create a 1 or 0 based on the actions of CUBE on the row, and they are used to create a bit vector in the final output. The next two GROUPING() functions then create values displayed in the final output that indicate on which column CUBE is currently working. The final output displays the decimal bit vector as well as a binary representation of the vector. As expected with two binary digits, there are four rows of output.

Listing 7-14.  GROUPING_ID() Bit Vector

SQL> with rowgen as (
  2    select 1 bit_1, 0 bit_0
  3    from dual
  4  ),
  5  cubed as (
  6    select
  7      grouping_id(bit_1,bit_0) gid
  8      , to_char(grouping(bit_1)) bv_1
  9      , to_char(grouping(bit_0)) bv_0
 10      , decode(grouping(bit_1),1,'GRP BIT 1') gb_1
 11      , decode(grouping(bit_0),1,'GRP BIT 0') gb_0
 12    from rowgen
 13    group by cube(bit_1,bit_0)
 14  )
 15  select
 16    gid
 17    , bv_1 || bv_0 bit_vector
 18    , gb_1
 19    , gb_0
 20  from cubed
 21  order by gid;

     BIT    GROUPING  GROUPING
 GID VECTOR BIT 1     BIT 0
---- ------ --------- ---------
   0 00
   1 01               GRP BIT 0
   2 10     GRP BIT 1
   3 11     GRP BIT 1 GRP BIT 0

So, what good is this? You already know how to use GROUPING() to control output via the HAVING clause, why learn another way? These are fair questions when you consider that the examples in Listing 7-13 can already create the output needed.

In the interest of database efficiency, a single GROUPING_ID() call can be used to replace all the different HAVING GROUPING() clauses from Listing 7-13. The GROUPING() function is limited in its ability to discriminate rows; it can return only a 0 or a 1. Because the GROUPING_ID() function returns a decimal value based on a bit vector, it can be used easily to make many different comparisons without any changes to the SQL.

Why should you care about changing comparisons without changing the SQL? If you are building an application based on the sales history examples, the user may be given four choices of output, and any one or more of them may be chosen. The user choices can be used as inputs to a single SQL statement that uses HAVING GROUPING_ID(), rather than multiple SQL statements based on different combinations of HAVING GROUPING(), so it requires less parsing of SQL by the database. It also results in fewer SQL statements to execute, less IO, and less memory usage.

Just as using CUBE eliminates multiple SQL statements joined by UNION ALL, GROUPING_ID() can eliminate multiple SQL statements in your application. The choices given to the user are as follows:

  • All data: Display all income level and age range aggregations.
  • All age: Aggregate all age ranges together.
  • All income: Aggregate all income levels together.
  • Summary: Display a summary only.

The application, a SQL*Plus script in this case, assigns to variables values corresponding to the user’s choices. The SQL statement, in turn, evaluates those variables via HAVING GROUPING_ID() to output the requested rows. Listing 7-15 simulates the choices a user might make and demonstrates how to use these inputs in the SQL. In the example, the only rows to be output are those that are aggregates of all income levels regardless of age group (ALL_AGE) and the summary columns for each product category (ALL_AGE and ALL_INCOME_LEVEL). This is accomplished by setting N_AGE_RANGE and N_SUMMARY to 2 and 4, respectively. These values correspond to the bit vector generated by the GROUPING_ID() function found in the HAVING clause.

As used in the HAVING clause, one is added to the value generated by GROUPING_ID(), which enables some consistency in setting the values of the variables that control the output. Without adding one to the value, the N_ALL_DATA variable would be set to 0 to enable output, and some other value, such as –1, to disable it. Increasing this comparison value by one makes it possible to use 0 consistently as a value to disable output.

Listing 7-15.  GROUPING_ID() to Control Report Output

SQL> variable N_ALL_DATA number
SQL> variable N_AGE_RANGE number
SQL> variable N_INCOME_LEVEL number
SQL> variable N_SUMMARY number
SQL>
SQL> begin
  2    -- set values to 0 to disable
  3    :N_ALL_DATA     := 0; -- 1 to enable
  4    :N_AGE_RANGE    := 2; -- 2 to enable
  5    :N_INCOME_LEVEL := 0; -- 3 to enable
  6    :N_SUMMARY      := 4; -- 4 to enable
  7  end;
  8  /

SQL> with tsales as (
  2  select /* lst7-15 */
  3      s.quantity_sold
  4      , s.amount_sold
  5      , to_char(mod(cust_year_of_birth,10) * 10 ) || '-' ||
  6        to_char((mod(cust_year_of_birth,10) * 10 ) + 10) age_range
  7      , nvl(c.cust_income_level,'A: Below 30,000') cust_income_level
  8      , p.prod_name
  9      , p.prod_desc
 10      , p.prod_category
 11      , (pf.unit_cost * s.quantity_sold)  total_cost
 12      , s.amount_sold - (pf.unit_cost * s.quantity_sold)  profit
 13    from sh.sales s
 14    join sh.customers c on c.cust_id = s.cust_id
 15    join sh.products p on p.prod_id = s.prod_id
 16    join sh.times t on t.time_id = s.time_id
 17    join sh.costs pf on
 18      pf.channel_id = s.channel_id
 19      and pf.prod_id = s.prod_id
 20      and pf.promo_id = s.promo_id
 21      and pf.time_id = s.time_id
 22    where  (t.fiscal_year = 2001)
 23  )
 24  select
 25    'Q' || to_char(grouping_id(cust_income_level,age_range)+1) query_tag
 26    , prod_category , decode(grouping(cust_income_level),1,
 27    'ALL INCOME',cust_income_level) cust_income_level
 28    , decode(grouping(age_range),1,'ALL AGE',age_range) age_range
 29    , sum(profit) profit
 30  from tsales
 31  group by prod_category, cube(cust_income_level,age_range)
 32    having grouping_id(cust_income_level,age_range)+1
 33      in(:N_ALL_DATA,:N_AGE_RANGE,:N_INCOME_LEVEL,:N_SUMMARY)
 34  order by prod_category, profit;

QUERY                                          AGE
TAG    PRODUCT CATEGORY   INCOME LEVEL         RANGE             PROFIT
------ ------------------ -------------------- -------- ---------------
...
Q2     Hardware           K: 250,000 - 299,999 ALL AGE       $26,678.00
Q2     Hardware           L: 300,000 and above ALL AGE       $28,974.28
Q2     Hardware           J: 190,000 - 249,999 ALL AGE       $43,761.47
...
Q2     Hardware           E: 90,000 - 109,999  ALL AGE      $135,154.59
Q2     Hardware           F: 110,000 - 129,999 ALL AGE      $199,270.01
Q4     Hardware           ALL INCOME           ALL AGE      $987,386.78
...
65 rows selected.

To be fair, it is possible to achieve the same results using the GROUPING() function, but it requires several tests to be placed in the HAVING clause. The queries of sample sales history data include only two columns in the CUBE arguments. The total number of tests required in the HAVING clause is four, because the GROUPING() clause returns either a 1 or a 0, so there are two possible values for each of your columns, resulting in four tests. This doesn’t seem too bad, but consider what happens when there are three columns; the number of tests goes up to nine. The number of tests required is 2 n , where n is the number of columns or expressions in arguments to CUBE.

Listing 7-16 shows the HAVING clause as it might appear using GROUPING() rather than GROUPING_ID(). This approach soon becomes unwieldy if there are many arguments required for the CUBE extension. The four separate tests shown should not be too much trouble to maintain. However, if the number of columns in the CUBE arguments increases from two to three, there are then nine tests. This is not code that lends itself well to maintenance.

Listing 7-16.  Using GROUPING() instead of GROUPING_ID()

32 having  -- bin_to_num() requires 9i+
33   ( bin_to_num(grouping(cust_income_level), grouping(age_range))+1 = :N_ALL_DATA)
34   or ( bin_to_num(grouping(cust_income_level), grouping(age_range))+1 = :N_AGE_RANGE)
35   or ( bin_to_num(grouping(cust_income_level), grouping(age_range))+1 = :N_INCOME_LEVEL)
36   or ( bin_to_num(grouping(cust_income_level), grouping(age_range))+1 = :N_SUMMARY)

EXPERIMENT WITH GROUPING() AND GROUPING_ID()

As an exercise, modify the code from Listing 7-15 so that it adds another column to the arguments to CUBE. Then, modify the call to GROUPING_ID() in the HAVING clause to work with the new column. This requires a new variable as well in the PL/SQL block.

After you have that working, replace the GROUPING_ID() call with all the tests needed to accomplish the same output control with GROUPING(). Do you like the results? Is this code you would like to maintain?

GROUPING SETS () and ROLLUP()

There is yet another method that may be used to obtain the results seen in the previous two examples. The GROUPING SETS() extension to GROUP BY made its debut with Oracle 9i. The entire GROUP BY . . . HAVING clause of the previous example can be replaced with GROUP BY GROUPING SETS(). However, just because you can do something doesn’t mean you should. Let’s look at an example to understand just why you may not want to use GROUPING SETS(). Lines 31 through 33 in Listing 7-15 can be replaced with lines 31 through 36 in Listing 7-17.

Listing 7-17.  GROUPING SETS()

SQL> with tsales as (
  2  select /* lst7-17 */
  3      s.quantity_sold
  4      , s.amount_sold
  5      , to_char(mod(cust_year_of_birth,10) * 10 ) || '-' ||
  6        to_char((mod(cust_year_of_birth,10) * 10 ) + 10) age_range
  7      , nvl(c.cust_income_level,'A: Below 30,000') cust_income_level
  8      , p.prod_name
  9      , p.prod_desc
 10      , p.prod_category
 11      , (pf.unit_cost * s.quantity_sold)  total_cost
 12      , s.amount_sold - (pf.unit_cost * s.quantity_sold)  profit
 13    from sh.sales s
 14    join sh.customers c on c.cust_id = s.cust_id
 15    join sh.products p on p.prod_id = s.prod_id
 16    join sh.times t on t.time_id = s.time_id
 17    join sh.costs pf on
 18      pf.channel_id = s.channel_id
 19      and pf.prod_id = s.prod_id
 20      and pf.promo_id = s.promo_id
 21      and pf.time_id = s.time_id
 22    where  (t.fiscal_year = 2001)
 23  )
 24  select
 25    'Q' || to_char(grouping_id(cust_income_level,age_range)+1) query_tag
 26    , prod_category, decode(grouping(cust_income_level),1,
 27    'ALL INCOME',cust_income_level) cust_income_level
 28    , decode(grouping(age_range),1,'ALL AGE',age_range) age_range
 29    , sum(profit) profit
 30  from tsales
 31  group by prod_category, grouping sets(
 32    rollup(prod_category), -- sub total by product category
 33    (cust_income_level),  -- agg by category and income levels only
 34    (age_range),          -- agg by category and age only
 35    (cust_income_level,age_range) -- agg by category, all age and income
 36  )
 37  --having group_id() < 1
 38 order by prod_category, profit;

QUERY                                          AGE
TAG    PRODUCT CATEGORY   INCOME LEVEL         RANGE             PROFIT
------ ------------------ -------------------- -------- ---------------
...
Q2     Software/Other     E: 90,000 - 109,999  ALL AGE      $124,416.04
Q2     Software/Other     F: 110,000 - 129,999 ALL AGE      $169,482.11
Q4     Software/Other     ALL INCOME           ALL AGE      $873,603.25
Q4     Software/Other     ALL INCOME           ALL AGE      $873,603.25

756 rows selected.

The output shown in Listing 7-17 is similar to that seen when the SQL from Listing 7-15 is executed with all the output categories enabled. This is a major difference between using GROUP BY CUBE HAVING GROUPING_ID() and GROUP BY GROUPING SETS. The former may be used to modify the output easily simply by setting variables to the correct values, whereas output from the latter cannot be modified except by modifying or generating the SQL dynamically. Modifying the SQL means there is more code to maintain and more resources consumed in the database. Generating the SQL dynamically is, well, usually just not a good idea if it can be avoided; it consumes more database resources and it is much harder to troubleshoot when problems arise.

As mentioned previously, the output in Listing 7-17 is similar to that in Listing 7-15, but it is not the same. The last two lines of the output shown in Listing 7-17 are duplicates. Sometimes, the GROUPING_SETS() extension can cause duplicates to appear in the output. In this case, the duplicates are caused by the ROLLUP(PROD_CATEGORY) line. You can prove this to yourself by removing ROLLUP()from the code in Listing 7-17 and rerunning it. The duplicate lines no longer appear. However, the totals for each product category no longer appear either. The solution is to use the GROUP_ID() function to identify the duplicate rows, and insert it into a HAVING clause.

The HAVING clause can be seen commented out in Listing 7-17. If you “uncomment” it and then rerun the script, the output appears as expected, without the duplicate rows. Interestingly, if the ROLLUP(PROD_CATEGORY) line is replaced with (NULL), the HAVING clause can be removed and the output appears as expected.

The ROLLUP() extension to GROUP BY can also be used by itself to create running subtotals that otherwise require multiple queries joined by UNION ALL. Suppose that someone from the sales department asks you to create a report showing totals of all purchases by customers whose last name begins with Sul. In addition, there need to be subtotals for each year by customer, each product category by customer, and a grand total of all sales. This kind of task is handled easily by ROLLUP(). Listing 7-18 shows one way to write a query to satisfy this request.

Notice that the DECODE()and GROUPING()functions are again used to indicate subtotal rows. Also, the grand total is forced to appear at the end of the report by the use of GROUPING(M.CUST_NAME). Because the only time this value is greater than 0 is when the total for all customers is calculated, the grand total appears at the end of the report, as expected.

Listing 7-18.  ROLLUP() Subtotals

SQL> with mysales as (
  2    select
  3      c.cust_last_name ||',' || c.cust_first_name cust_name
  4      , p.prod_category
  5      , to_char(trunc(time_id,'YYYY'),'YYYY') sale_year
  6      , p.prod_name
  7      , s.amount_sold
  8    from sh.sales s
  9    join sh.products p on p.prod_id = s.prod_id
 10    join sh.customers c on c.cust_id = s.cust_id
 11    where c.cust_last_name like 'Sul%'
 12    --where s.time_id = to_date('01/01/2001','mm/dd/yyyy')
 13  )
 14  select
 15    decode(grouping(m.cust_name),1,'GRAND TOTAL',m.cust_name) cust_name
 16    ,decode(grouping(m.sale_year),1,'TOTAL BY YEAR',m.sale_year) sale_year
 17    ,decode(grouping(m.prod_category),1,'TOTAL BY CATEGORY',
 18    m.prod_category) prod_category, sum(m.amount_sold) amount_sold
 19  from mysales m
 20  group by rollup(m.cust_name, m.prod_category, m.sale_year)
 21  order by grouping(m.cust_name), 1,2,3;

CUSTOMER       SALE_YEAR     PRODUCT CATEGORY                  AMT SOLD
-------------- ------------- ----------------------------- ------------
...
Sullivan,Rue   1998          Peripherals and Accessories        $259.90
Sullivan,Rue   1998          Software/Other                      $19.59
Sullivan,Rue   2000          Electronics                      $2,213.30
Sullivan,Rue   2000          Hardware                         $1,359.06
Sullivan,Rue   2000          Peripherals and Accessories      $1,169.94
Sullivan,Rue   2000          Photo                              $331.33
Sullivan,Rue   2000          Software/Other                     $933.87
Sullivan,Rue   TOTAL BY YEAR Electronics                      $2,213.30
Sullivan,Rue   TOTAL BY YEAR Hardware                         $1,359.06
Sullivan,Rue   TOTAL BY YEAR Peripherals and Accessories      $1,429.84
Sullivan,Rue   TOTAL BY YEAR Photo                              $331.33
Sullivan,Rue   TOTAL BY YEAR Software/Other                     $953.46
Sullivan,Rue   TOTAL BY YEAR TOTAL BY CATEGORY                $6,286.99
GRAND TOTAL    TOTAL BY YEAR TOTAL BY CATEGORY               $86,994.89

68 rows selected.

GROUP BY Restrictions

Our study of GROUP BY is incomplete without considering what it cannot do. The list of restrictions placed on GROUP BY is not very long. The restrictions are listed in the Oracle 12c SQL Language Reference (http://www.oracle.com/technetwork/indexes/documentation/index.html) for Oracle 12.1. For example, note the following:

  • LOB columns, nested tables, or arrays may not be used as part of a GROUP BY expression.
  • Scalar subquery expressions are not allowed.
  • Queries cannot be “parallelized” if the GROUP BY clause references any object type columns.

SQL queries were constructed to demonstrate the first two restrictions, as shown in Listings 7-19 and 7-20. The error messages clearly show that LOB columns and scalar subqueries cannot be used as part of the GROUP BY clause.

Listing 7-19.  GROUP BY Restrictions: LOB Not Allowed

SQL> with lobtest as (
  2     select to_clob(d.dname ) dname
  3     from scott.emp e
  4     join scott.dept d on d.deptno = e.deptno
  5  )
  6  select l.dname
  7  from lobtest l
  8* group by l.dname
group by l.dname;
         *
ERROR at line 8:
ORA-00932: inconsistent datatypes: expected - got CLOB

Listing 7-20.  GROUP BY Restrictions: Scalar Subquery Not Allowed

SQL> select d.dname, count(empno) empcount
  2  from scott.emp e
  3  join scott.dept d on d.deptno = e.deptno
  4  group by (select dname from scott.dept d2 where d2.dname = d.dname)
  5  order by d.dname;
group by (select dname from scott.dept d2 where d2.dname = d.dname);
          *
ERROR at line 4:
ORA-22818: subquery expressions not allowed here

The final restriction listed earlier appears to be a documentation error. Evidence for this can be seen in Listing 7-21, where the GROUP BY on an OBJECT datatype is being executed in parallel, contrary to what the documentation states. The member function match in the dept_location type body is used to compare the value for city, and this in turn is used by GROUP BY to group employees by CITY. Should you need to create aggregations based on data in an OBJECT column, you can certainly do so as of Oracle 11.1.0.7. Testing has shown that the GROUP BY of Listing 7-21 is not executed in parallel in Oracle 11.1.0.6.

Listing 7-21.  GROUP BY on Object Column in Parallel

SQL> create type dept_location_type
  2  as object
  3    (
  4      street_address     VARCHAR2(40)
  5      , postal_code        VARCHAR2(10)
  6      , city               VARCHAR2(30)
  7      , state_province     VARCHAR2(10)
  8      , country_id         CHAR(2)
  9      , order member function match (e dept_location_type) return integer
 10  );
 11  /

Type created.

SQL>
SQL> create or replace type body dept_location_type
  2  as order member function match (e dept_location_type) return integer
  3  is
  4    begin
  5      if city < e.city then
  6        return -1;
  7      elsif city > e.city then
  8        return 1;
  9      else
 10        return 0;
 11      end if;
 12    end;
 13  end;
 14  /

Type body created.

SQL>
SQL> create table deptobj
  2  as
  3  select d.deptno,d.dname
  4  from scott.dept d;
Table created.
SQL> alter table deptobj add (dept_location dept_location_type);
Table altered.
SQL> update deptobj set dept_location =
  2    dept_location_type('1234 Money St', '97401','Eugene', 'OR', 'US')
  3  where deptno=20;
1 row updated.
SQL> update deptobj set dept_location =
  2    dept_location_type('459 Durella Street', '97463','Oakridge', 'OR', 'US')
  3  where deptno=40;
1 row updated.
SQL> update deptobj set dept_location =
  2    dept_location_type('12642 Rex Rd', '97006','Beavertown', 'OR', 'US')
  3  where deptno=10;
1 row updated.
SQL> update deptobj set dept_location =
  2    dept_location_type('9298 Hamilton Rd', '97140','George', 'WA', 'US')
  3  where deptno=30;
1 row updated.

1 commit;
Commit complete.
PL/SQL procedure successfully completed.

SQL> select /*+ gather_plan_statistics parallel(e 2)*/
  2    d.dept_location, count(e.ename) ecount
  3  from scott.emp e, deptobj d
  4  where e.deptno = d.deptno
  5  group by dept_location
  6  order by dept_location;

DEPT_LOCATION(STREET_ADDRESS, POSTAL_CODE, CITY, STATE_PROVI       ECOUNT
------------------------------------------------------------       ------
DEPT_LOCATION_TYPE('1234 Money St', '97401', 'Eugene', 'OR', 'US')      5
DEPT_LOCATION_TYPE('12642 Rex Rd', '97006', 'Beavertown','OR','US')     3
DEPT_LOCATION_TYPE('9298 Hamilton Rd', '97140', 'George','WA','US')     6

3 rows selected.

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
| Id  | Operation                   | Name     | Starts | E-Rows | A-Rows |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |      1 |        |      3 |
|   1 |  PX COORDINATOR             |          |      1 |        |      3 |
|   2 |   PX SEND QC (ORDER)        | :TQ10002 |      0 |     14 |      0 |
|   3 |    SORT GROUP BY            |          |      0 |     14 |      0 |
|   4 |     PX RECEIVE              |          |      0 |     14 |      0 |
|   5 |      PX SEND RANGE          | :TQ10001 |      0 |     14 |      0 |
|   6 |       HASH GROUP BY         |          |      0 |     14 |      0 |
|*  7 |        HASH JOIN            |          |      0 |     14 |      0 |
|   8 |         BUFFER SORT         |          |      0 |        |      0 |
|   9 |          PX RECEIVE         |          |      0 |      4 |      0 |
|  10 |           PX SEND BROADCAST | :TQ10000 |      0 |      4 |      0 |
|  11 |            TABLE ACCESS FULL| DEPTOBJ  |      1 |      4 |      4 |
|  12 |         PX BLOCK ITERATOR   |          |      0 |     14 |      0 |
|* 13 |          TABLE ACCESS FULL  | EMP      |      0 |     14 |      0 |
---------------------------------------------------------------------------

Summary

Oracle has provided some excellent tools for the SQL practitioner in the form of extensions to the GROUP BY clause. Not only do they reduce code, they improve database efficiency. They do, however, take some dedication and practice to learn how best to use them. The introduction here to the advanced grouping features is by no means comprehensive. Most of these features can be combined for many different effects—far more than is practical to include in a book. Please endeavor to make use of these features in your own applications and continue to experiment with them based on what you have learned in this chapter.

1 If there are no rows in the table, GROUP BY CUBE() returns zero rows.

2 The NVL() function is used to provide a default value for sh.customers.cust_income_level so that output of examples may be easier to compare.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.89.2