CHAPTER 10: Subquery Factoring

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 10

Subquery Factoring

You may not be familiar with the term subquery factoring. Prior to the release of Oracle 11gR2, the official Oracle documentation barely mentions it, providing just a brief synopsis of its use, a couple of restrictions, and a single example. If I instead refer to the WITH clause of the SELECT statement, you probably know immediately what I mean, because these terms are more recognizable. Both terms are used in this chapter.

In Oracle 11gR2 (version 11.2), the WITH clause was enhanced with the ability to recurse; that is, the factored subquery is allowed to call itself within some limitation. The value of this may not be readily apparent. If you have used the CONNECT BY clause to create hierarchical queries, you can appreciate that recursive subqueries allow the same functionality to be implemented in an ANSI standard format.

If the term subquery factoring is not known to you, perhaps you have heard of the ANSI standard term common table expression (commonly called CTE). Common table expressions were first specified in the 1999 ANSI SQL Standard. For some reason, Oracle has chosen to obfuscate this name. Other database vendors refer to common table expressions, so perhaps Oracle chose subquery factoring just to be different.

Standard Usage

One of the most useful features of the WITH clause when it was first introduced was to clean up complex SQL queries. When a large number of tables and columns are involved in a query, it can become difficult to follow the flow of data through the query. Via the use of subquery factoring, a query can be made more understandable by moving some of the complexity away from the main body of the query.

The query in Listing 10-1 generates a cross-tab report using the PIVOT operator. The formatting helps make the SQL somewhat readable, but there is quite a bit going on here. The innermost query is creating a set of aggregates on key sales columns, whereas the next most outer query simply provides column names that are presented to the PIVOT operator, where the final values of sales by channel and quarter for each product are generated.

Listing 10-1. Cross-tab without Subquery Factoring

select *
from (
   select /*+ gather_plan_statistics */
      product
      , channel
      , quarter
      , country
      , quantity_sold
   from
   (
      select
         prod_name product
         , country_name country
         , channel_id channel
         , substr(calendar_quarter_desc, 6,2) quarter
         , sum(amount_sold) amount_sold
         , sum(quantity_sold) quantity_sold
      from
         sh.sales
         join sh.times on times.time_id = sales.time_id
         join sh.customers on customers.cust_id = sales.cust_id
         join sh.countries on countries.country_id = customers.country_id
         join sh.products on products.prod_id = sales.prod_id
      group by
        prod_name
        , country_name
        , channel_id
        , substr(calendar_quarter_desc, 6, 2)
   )
) PIVOT (
   sum(quantity_sold)
   FOR (channel, quarter) IN
   (
      (5, '02') AS CATALOG_Q2,
      (4, '01') AS INTERNET_Q1,
      (4, '04') AS INTERNET_Q4,
      (2, '02') AS PARTNERS_Q2,
      (9, '03') AS TELE_Q3
   )
)
order by product, country;

Now let’s use the WITH clause to break the query into byte-size chunks that are easier to comprehend. The SQL has been rewritten in Listing 10-2 using the WITH clause to create three subfactored queries: sales_countries, top_sales, and sales_rpt. Notice that both the top_sales and sales_rpt subqueries refer to other subqueries by name, as if they are a table or a view. By choosing names that make the intent of each subquery easy to follow, the readability of the SQL is improved. For instance, the subquery name sales_countries refers to the countries in which the sales took place, top_sales collects the sales data, and the sales_rpt subquery aggregates the data. The results of the sales_rpt subquery are used in the main query, which answers the question “What is the breakdown of sales by product and country per quarter?” If you were not told the intent of the SQL in Listing 10-1, it would take some time to discern its purpose; on the other hand, the structure of the SQL in Listing 10-2 with subfactored queries makes it easier to understand the intent of the code.

In addition, the statements associated directly with the PIVOT operator are in the same section of the SQL statement at the bottom, further enhancing readability.

Listing 10-2. Cross-tab with Subquery Factoring

with
sales_countries as (
   select /*+ gather_plan_statistics */
      cu.cust_id
      , co.country_name
   from  sh.countries co, sh.customers cu
   where cu.country_id = co.country_id
),
top_sales as (
   select
      p.prod_name
      , sc.country_name
      , s.channel_id
      , t.calendar_quarter_desc
      , s.amount_sold
      , s.quantity_sold
   from
      sh.sales s
      join sh.times t on t.time_id = s.time_id
      join sh.customers c on c.cust_id = s.cust_id
      join sales_countries sc on sc.cust_id = c.cust_id
      join sh.products p on p.prod_id = s.prod_id
),
sales_rpt as (
   select
      prod_name product
      , country_name country
      , channel_id channel
      , substr(calendar_quarter_desc, 6,2) quarter
      , sum(amount_sold) amount_sold
      , sum(quantity_sold) quantity_sold
   from top_sales
   group by
      prod_name
      , country_name
      , channel_id
      , substr(calendar_quarter_desc, 6, 2)
)
select * from
(
  select product, channel, quarter, country, quantity_sold
  from sales_rpt
) pivot (
   sum(quantity_sold)
   for (channel, quarter) in
   (
      (5, '02') as catalog_q2,
      (4, '01') as internet_q1,
      (4, '04') as internet_q4,
      (2, '02') as partners_q2,
      (9, '03') as tele_q3
   )
)
order by product, country;

Although this is not an extremely complex SQL example, it does serve to illustrate the point of how the WITH clause can be used to make a statement more readable and easier to maintain. Large, complex queries can be made more understandable by using this technique.

WITH Using a PL/SQL Function

Oracle 12c introduced the ability to declare and define PL/SQL functions and procedures using the WITH clause. Once defined, you can reference the PL/SQL functions in the query in which you specify this clause (including any of its subqueries). Listing 10-3 walks through a simple example of how to create and use this new feature.

Listing 10-3. WITH Function Clause

SQL>WITH
  2  function calc_markup(p_markup number, p_price number) return number
  3  is
  4  begin
  5      return p_markup*p_price;
  6    end;
  7  select prod_name,
  8  prod_list_price cur_price,
  9  calc_markup(.05,prod_list_price) mup5,
 10  round(prod_list_price + calc_markup(.10,prod_list_price),2) new_price
 11  from sh.products;
 12  /

PROD_NAME                                                CUR_PRICE            MUP5       NEW_PRICE
-------------------------------------------------- --------------- --------------- ---------------
5MP Telephoto Digital Camera                                899.99         44.9995          989.99
17" LCD w/built-in HDTV Tuner                               999.99         49.9995         1099.99
Envoy 256MB - 40GB                                          999.99         49.9995         1099.99
Y Box                                                       299.99         14.9995          329.99
Mini DV Camcorder with 3.5" Swivel LCD                     1099.99         54.9995         1209.99
...
Finding Fido                                                 12.99           .6495           14.29
Adventures with Numbers                                      11.99           .5995           13.19
Extension Cable                                               7.99           .3995            8.79
Xtend Memory                                                 20.99          1.0495           23.09

72 rows selected.

Elapsed: 00:00:00.02

-- Using a WITH function within an outer query block

SQL>SELECT prod_name, cur_price, mup5, new_price
  2  FROM (
  3  WITH
  4    function calc_markup(p_markup number, p_price number) return number
  5    is
  6    begin
  7      return p_markup*p_price;
  8    end;
  9  select  prod_name,
 10  prod_list_price cur_price,
 11  calc_markup(.05,prod_list_price) mup5,
 12  round(prod_list_price + calc_markup(.10,prod_list_price),2) new_price
from sh.products
)
WHERE cur_price < 1000
AND   new_price >= 1000 ;
 17  /
WITH
*
ERROR at line 3:
ORA-32034: unsupported use of WITH clause

Elapsed: 00:00:00.01
SQL>
SQL>SELECT /*+ WITH_PLSQL */ prod_name, cur_price, mup5, new_price
  2  FROM (
  3  WITH
  4    function calc_markup(p_markup number, p_price number) return number
  5    is
  6    begin
  7      return p_markup*p_price;
  8    end;
  9  select  prod_name,
 10  prod_list_price cur_price,
 11  calc_markup(.05,prod_list_price) mup5,
 12  round(prod_list_price + calc_markup(.10,prod_list_price),2) new_price
 13  from sh.products
)
 15  WHERE cur_price < 1000
 16  AND   new_price >= 1000 ;
 17  /

PROD_NAME                                                CUR_PRICE            MUP5       NEW_PRICE
-------------------------------------------------- --------------- --------------- ---------------
17" LCD w/built-in HDTV Tuner                               999.99         49.9995         1099.99
Envoy 256MB - 40GB                                          999.99         49.9995         1099.99

2 rows selected.

Elapsed: 00:00:00.02

To run the statement, you must use the forward slash (/), which resembles how you would execute an anonymous PL/SQL block. This example shows only a single function declaration, but you may create as many functions and/or procedures as you wish. As also shown, one problem you may face occurs when you wish to place the WITH function inside an outer query block. When the WITH function is not the first declaration before the top-level query, you get an “ORA-32034: unsupported use of WITH clause” error. Fortunately, you can use the WITH_PLSQL hint to correct the problem. The hint, however, enables you to specify only the WITH function as shown; it is not an optimizer hint in that it does not have any bearing on the optimizer’s execution plan decisions.

Although this example is quite simple, imagine the possibilities of how you could use the WITH function clause. There are some expressions that are difficult to build without multiple steps, and you may have created user-defined PL/SQL functions that you use in your SQL statements. However, doing so requires that context switches must occur between the SQL and PL/SQL “engines” when your SQL statement is executed. These context switches incur overhead and can cause performance problems if used frequently. But now, with the advent of the WITH function clause in 12c, we have a construct built into the SQL language that can reduce or eliminate the need for user-defined PL/SQL functions and can allow you to do everything you need directly inline in your SQL statement.

Optimizing SQL

When a SQL query is designed or modified to take advantage of subquery factoring, there are some not-so-subtle changes that may take place when the optimizer creates an execution plan for the query. The following quote comes from the Oracle documentation in the Oracle Database SQL Language Reference (http://www.oracle.com/technetwork/indexes/documentation/index.html) for SELECT, under the subquery_factoring_clause heading: “The WITH query_name clause lets you assign a name to a subquery block. You can then reference the subquery block multiple places in the query by specifying query_name. Oracle Database optimizes the query by treating the query name as either an inline view or as a temporary table.”

Notice that Oracle may treat the factored subquery as a temporary table. In queries in which a table is referenced more than once, this could be a distinct performance advantage, because Oracle can materialize result sets from the query, thereby avoiding performing some expensive database operations more than once. The caveat here is that it “could be” a distinct performance advantage. Keep in mind that materializing the result set requires creating a temporary table and inserting the rows into it. Doing so may be of value if the same result set is referred to many times, or it may be a big performance penalty.

Testing Execution Plans

When examining the execution plans for subfactored queries, it may not be readily apparent whether Oracle is choosing the best execution plan. It may seem that the use of the INLINE or MATERIALIZE ¹ hint would result in better performing SQL. In some cases it may, but the use of these hints needs to be tested and considered in the context of overall application performance.

The need to test for optimum query performance can be illustrated by a report that management has requested. The report must show the distribution of customers by country and income level, showing only those countries and income levels that make up 1 percent or more of the entire customer base. A country and income level should also be reported if the number of customers in an income level bracket is greater than or equal to 25 percent of all customers in that income bracket. ²

The query in Listing 10-4 is the end result. ³ The cust factored subquery has been retained from previous queries. New are the subqueries in the HAVING clause; these are used to enforce the rules stipulated for the report.

Listing 10-4. WITH and MATERIALIZE

  1  with cust as (
  2   select /*+ materialize gather_plan_statistics */
  3     b.cust_income_level,
  4     a.country_name
  5   from sh.customers b
  6   join sh.countries a on a.country_id = b.country_id
  7  )
  8  select country_name, cust_income_level, count(country_name) country_cust_count
  9  from cust c
 10  having count(country_name) >
 11   (
 12     select count(*) * .01
 13     from cust c2
 14   )
 15   or count(cust_income_level) >=
 16   (
 17     select median(income_level_count)
 18     from (
 19       select cust_income_level, count(*) *.25 income_level_count
 20       from cust
 21       group by cust_income_level
 22     )
 23   )
 24  group by country_name, cust_income_level
 25 order by 1,2;
                                                    CUSTOMER
COUNTRY                        INCOME LEVEL            COUNT
------------------------------ -------------------- --------
France                         E: 90,000 - 109,999       585
France                         F: 110,000 - 129,999      651
...
United States of America       H: 150,000 - 169,999     1857
United States of America       I: 170,000 - 189,999     1395
...
35 rows selected.

Elapsed: 00:00:01.37

Statistics
-----------------------------------------------------
       1854  recursive calls
        307  db block gets
       2791  consistent gets
       1804  physical reads
        672  redo size
       4609  bytes sent via SQL*Net to client
        700  bytes received via SQL*Net from client
         18  SQL*Net roundtrips to/from client
         38  sorts (memory)
          0  sorts (disk)
         35  rows processed

----------------------------------------------------------------------------
| Id  | Operation                  | Name       | Starts | E-Rows | A-Rows |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |            |      1 |        |     35 |
|   1 |  TEMP TABLE TRANSFORMATION |            |      1 |        |     35 |
|   2 |   LOAD AS SELECT           |            |      1 |        |      0 |
|*  3 |    HASH JOIN               |            |      1 |  55500 |  55500 |
|   4 |     TABLE ACCESS FULL      | COUNTRIES  |      1 |     23 |     23 |
|   5 |     TABLE ACCESS FULL      | CUSTOMERS  |      1 |  55500 |  55500 |
|*  6 |   FILTER                   |            |      1 |        |     35 |
|   7 |    SORT GROUP BY           |            |      1 |     18 |    209 |
|   8 |     VIEW                   |            |      1 |  55500 |  55500 |
|   9 |      TABLE ACCESS FULL     | SYS_TEMP_0F|      1 |  55500 |  55500 |
|  10 |    SORT AGGREGATE          |            |      1 |      1 |      1 |
|  11 |     VIEW                   |            |      1 |  55500 |  55500 |
|  12 |      TABLE ACCESS FULL     | SYS_TEMP_0F|      1 |  55500 |  55500 |
|  13 |    SORT GROUP BY           |            |      1 |      1 |      1 |
|  14 |     VIEW                   |            |      1 |     11 |     13 |
|  15 |      SORT GROUP BY         |            |      1 |     11 |     13 |
|  16 |       VIEW                 |            |      1 |  55500 |  55500 |
|  17 |        TABLE ACCESS FULL   | SYS_TEMP_0F|      1 |  55500 |  55500 |
----------------------------------------------------------------------------

When executing ⁴ the SQL, all appears as you expect, then you check the execution plan and find that the join of the customers and countries tables underwent a TEMP TABLE TRANSFORMATION, and the rest of the query was satisfied by using the temporary table sys_temp_of. ⁵ At this point, you might rightly wonder if the execution plan chosen was a reasonable one. This can be tested easily, thanks to the MATERIALIZED and INLINE hints.

By using the INLINE hint, Oracle can be instructed to satisfy all portions of the query without using a TEMP TABLE TRANSFORMATION. The results of doing so are shown in Listing 10-5. Only the relevant portion of the SQL that has changed is shown here (the rest is identical to that in Listing 10-4).

Listing 10-5. WITH and INLINE Hint

1  with cust as (
2     select /*+ inline gather_plan_statistics */
3       b.cust_income_level,
4       a.country_name
5     from sh.customers b
6     join sh.countries a on a.country_id = b.country_id
7  )
...

COUNTRY                        INCOME LEVEL            COUNT
------------------------------ -------------------- --------
France                         E: 90,000 - 109,999       585
France                         F: 110,000 - 129,999      651
...
United States of America       I: 170,000 - 189,999     1395
United States of America       J: 190,000 - 249,999     1390
...
35 rows selected.

Elapsed: 00:00:00.62

Statistics
----------------------------------------------------------
       1501  recursive calls
          0  db block gets
       4758  consistent gets
       1486  physical reads
          0  redo size
       4609  bytes sent via SQL*Net to client
        700  bytes received via SQL*Net from client
         18  SQL*Net roundtrips to/from client
         34  sorts (memory)
          0  sorts (disk)
         35  rows processed

--------------------------------------------------------------------------
| Id  | Operation              | Name         | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |              |      1 |        |     35 |
|*  1 |  FILTER                |              |      1 |        |     35 |
|   2 |   SORT GROUP BY        |              |      1 |     20 |    236 |
|*  3 |    HASH JOIN           |              |      1 |  55500 |  55500 |
|   4 |     TABLE ACCESS FULL  | COUNTRIES    |      1 |     23 |     23 |
|   5 |     TABLE ACCESS FULL  | CUSTOMERS    |      1 |  55500 |  55500 |
|   6 |   SORT AGGREGATE       |              |      1 |      1 |      1 |
|*  7 |    HASH JOIN           |              |      1 |  55500 |  55500 |
|   8 |     INDEX FULL SCAN    | COUNTRIES_PK |      1 |     23 |     23 |
|   9 |     TABLE ACCESS FULL  | CUSTOMERS    |      1 |  55500 |  55500 |
|  10 |   SORT GROUP BY        |              |      1 |      1 |      1 |
|  11 |    VIEW                |              |      1 |     12 |     13 |
|  12 |     SORT GROUP BY      |              |      1 |     12 |     13 |
|* 13 |      HASH JOIN         |              |      1 |  55500 |  55500 |
|  14 |       INDEX FULL SCAN  | COUNTRIES_PK |      1 |     23 |     23 |
|  15 |       TABLE ACCESS FULL| CUSTOMERS    |      1 |  55500 |  55500 |
--------------------------------------------------------------------------

From the execution plan in Listing 10-5, you can see that three full scans were performed on the customers table and one full scan on the countries table. Two of the executions against the cust subquery required only the information in the COUNTRIES_PK index, so a full scan of the index was performed rather than a full scan of the table, saving time and resources. In this example, the query that used the MATERIALIZE hint had 2791 consistent gets whereas the second had 4758. However, it’s also important to note there was another difference: The second query used less resources for redo generation: 0 vs. 672. When deciding which format works best for your environment, you must consider carefully the decreased CPU resulting from lower consistent gets against the higher redo required to materialize the temporary table.

What may be surprising is that the execution using full table scans took 0.75 second, which is about 100 percent faster than when a temporary table is used. Of course, the cache was cold for both queries, because both the buffer cache and the shared pool were flushed prior to running each query. From these simple tests you might feel safe using the INLINE hint

in this bit of code, convinced that it will perform well based on the amount of physical IO required in the first test outweighing the memory usage and the logical IO required for the second test. If you know for sure that the size of the datasets will not grow and that the system load will remain fairly constant, using the INLINE hint in this query is probably a good idea. The problem, however, is that data are rarely static; often, data grow to a larger size than originally intended when developing a query. In this event, retesting these queries is in order to determine whether the use of the INLINE hint is still valid.

Testing the Effects of Query Changes

Even as data do not remain static, SQL is not always static. Sometimes requirements change, so code must be modified. What if the requirements changed for the examples in Listings 10-4 and 10-5? Would minor changes invalidate the use of the hints embedded in the SQL? This is probably something worth investigating, so let’s do so.

Previously, we were reporting on income brackets when the count of them for any country was greater than or equal to 25 percent of the total global count for that bracket. Now we are asked to include an income bracket if it is among those income brackets the number of which is greater than the median, based on the number of customers per bracket. This SQL is seen in Listing 10-6. Notice that the INLINE hint has been left in. So now there’s an additional full table scan and index scan compared with the execution plan in Listing 10-5. Although the elapsed time has increased, it still seems reasonable.

Listing 10-6. Modified Income Search: INLINE

  1  with cust as (
  2   select /*+ inline gather_plan_statistics */
  3     b.cust_income_level,
  4     a.country_name
  5   from sh.customers b
  6   join sh.countries a on a.country_id = b.country_id
  7  ),
  8  median_income_set as (
  9   select /*+ inline */ cust_income_level, count(*) income_level_count
 10   from cust
 11   group by cust_income_level
 12   having count(cust_income_level) > (
 13      select median(income_level_count) income_level_count
 14      from (
 15         select cust_income_level, count(*) income_level_count
 16         from cust
 17         group by cust_income_level
 18      )
 19   )
 20  )
 21  select country_name, cust_income_level, count(country_name) country_cust_count
 22  from cust c
 23  having count(country_name) >
 24   (
 25     select count(*) * .01
 26     from cust c2
 27   )
 28   or cust_income_level in ( select mis.cust_income_level from median_income_set mis)
 29  group by country_name, cust_income_level;
                                                    CUSTOMER
COUNTRY                        INCOME LEVEL            COUNT
------------------------------ -------------------- --------
Argentina                      D: 70,000 - 89,999         25
Argentina                      E: 90,000 - 109,999        39
...
United States of America       K: 250,000 - 299,999     1062
United States of America       L: 300,000 and above      982

123 rows selected.

Elapsed: 00:00:01.26

Statistics
----------------------------------------------------------
       1524  recursive calls
          0  db block gets
      23362  consistent gets
       1486  physical reads
          0  redo size
      15570  bytes sent via SQL*Net to client
       1195  bytes received via SQL*Net from client
         63  SQL*Net roundtrips to/from client
          3  sorts (memory)
          0  sorts (disk)
        123  rows processed

---------------------------------------------------------------------
| Id  | Operation               | Name        |Starts|E-Rows|A-Rows |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT        |             |     1|      |   123 |
|*  1 |  FILTER                 |             |     1|      |   123 |
|   2 |   SORT GROUP BY         |             |     1|    20|   236 |
|*  3 |    HASH JOIN            |             |     1| 55500| 55500 |
|   4 |     TABLE ACCESS FULL   | COUNTRIES   |     1|    23|    23 |
|   5 |     TABLE ACCESS FULL   | CUSTOMERS   |     1| 55500| 55500 |
|   6 |   SORT AGGREGATE        |             |     1|     1|     1 |
|*  7 |    HASH JOIN            |             |     1| 55500| 55500 |
|   8 |     INDEX FULL SCAN     | COUNTRIES_PK|     1|    23|    23 |
|   9 |     TABLE ACCESS FULL   | CUSTOMERS   |     1| 55500| 55500 |
|* 10 |   FILTER                |             |    13|      |     6 |
|  11 |    HASH GROUP BY        |             |    13|     1|   133 |
|* 12 |     HASH JOIN           |             |    13| 55500|   721K|
|  13 |      INDEX FULL SCAN    | COUNTRIES_PK|    13|    23|   299 |
|  14 |      TABLE ACCESS FULL  | CUSTOMERS   |    13| 55500|   721K|
|  15 |    SORT GROUP BY        |             |     1|     1|     1 |
|  16 |     VIEW                |             |     1|    12|    13 |
|  17 |      SORT GROUP BY      |             |     1|    12|    13 |
|* 18 |       HASH JOIN         |             |     1| 55500| 55500 |
|  19 |        INDEX FULL SCAN  | COUNTRIES_PK|     1|    23|    23 |
|  20 |        TABLE ACCESS FULL| CUSTOMERS   |     1| 55500| 55500 |
---------------------------------------------------------------------

Now that there’s an additional table scan and index scan, how do you think the performance of this query fares if temporary table transformations are allowed to take place? The results can be seen in Listing 10-7.

Listing 10-7. Modified Income Search: MATERIALIZE

1  with cust as (
2   select /*+ materialize gather_plan_statistics */
3     b.cust_income_level,
4     a.country_name
5   from sh.customers b
6   join sh.countries a on a.country_id = b.country_id
7  ),
...
                                                    CUSTOMER
COUNTRY                        INCOME LEVEL            COUNT
------------------------------ -------------------- --------
Argentina                      D: 70,000 - 89,999         25
Argentina                      E: 90,000 - 109,999        39
...
United States of America       K: 250,000 - 299,999     1062
United States of America       L: 300,000 and above      982

123 rows selected.

Elapsed: 00:00:00.87

Statistics
----------------------------------------------------------
       2001  recursive calls
        324  db block gets
       3221  consistent gets
       1822  physical reads
       1244  redo size
      15570  bytes sent via SQL*Net to client
       1195  bytes received via SQL*Net from client
         63  SQL*Net roundtrips to/from client
         38  sorts (memory)
          0  sorts (disk)
        123  rows processed

-------------------------------------------------------------------
| Id |Operation                 |Name       |Starts|E-Rows|A-Rows |
-------------------------------------------------------------------
|   0|SELECT STATEMENT          |           |     1|      |   123 |
|   1| TEMP TABLE TRANSFORMATION|           |     1|      |   123 |
|   2|  LOAD AS SELECT          |           |     1|      |     0 |
|*  3|   HASH JOIN              |           |     1| 55500| 55500 |
|   4|    TABLE ACCESS FULL     |COUNTRIES  |     1|    23|    23 |
|   5|    TABLE ACCESS FULL     |CUSTOMERS  |     1| 55500| 55500 |
|   6|  LOAD AS SELECT          |           |     1|      |     0 |
|*  7|   FILTER                 |           |     1|      |     6 |
|   8|    HASH GROUP BY         |           |     1|     1|    13 |
|   9|     VIEW                 |           |     1| 55500| 55500 |
|  10|      TABLE ACCESS FULL   |SYS_TEMP_0F|     1| 55500| 55500 |
|  11|    SORT GROUP BY         |           |     1|     1|     1 |
|  12|     VIEW                 |           |     1|    12|    13 |
|  13|      SORT GROUP BY       |           |     1|    12|    13 |
|  14|       VIEW               |           |     1| 55500| 55500 |
|  15|        TABLE ACCESS FULL |SYS_TEMP_0F|     1| 55500| 55500 |
|* 16|  FILTER                  |           |     1|      |   123 |
|  17|   SORT GROUP BY          |           |     1|    20|   236 |
|  18|    VIEW                  |           |     1| 55500| 55500 |
|  19|     TABLE ACCESS FULL    |SYS_TEMP_0F|     1| 55500| 55500 |
|  20|   SORT AGGREGATE         |           |     1|     1|     1 |
|  21|    VIEW                  |           |     1| 55500| 55500 |
|  22|     TABLE ACCESS FULL    |SYS_TEMP_0F|     1| 55500| 55500 |
|* 23|   VIEW                   |           |    13|     1|     6 |
|  24|    TABLE ACCESS FULL     |SYS_TEMP_0F|    13|     1|    63 |
-------------------------------------------------------------------

Because there’s that additional scan taking place in the modified version of the query, the overhead of logical IO becomes more apparent. It is significantly more efficient with this query to allow Oracle to perform table transformations, writing the results of the hash join to a temporary table on disk, where they can be reused throughout the query.

Seizing Other Optimization Opportunities

There are other opportunities when subquery factoring may be used to your advantage. If you are working on applications that were originally written several years ago, you may find that some of the SQL could use a bit of improvement based on the features offered by Oracle versions 9i and later. The query in Listing 10-8, for example, does exactly what it was asked to do, which is to find the average, minimum, and maximum costs for each product that was produced in the year 2000, with the costs calculated for each of the sale channels in which the product was sold. This SQL is not only difficult to read and hard to modify, but also it is somewhat inefficient.

Listing 10-8. Old SQL to Calculate Costs

  1  select /*+ gather_plan_statistics */
  2     substr(prod_name,1,30) prod_name
  3     , channel_desc
  4     , (
  5        select avg(c2.unit_cost)
  6        from sh.costs c2
  7        where c2.prod_id = c.prod_id and c2.channel_id = c.channel_id
  8        and c2.time_id between to_date('01/01/2000','mm/dd/yyyy')
  9         and to_date('12/31/2000')
 10        ) avg_cost
 11     , (
 12        select min(c2.unit_cost)
 13        from sh.costs c2
 14        where c2.prod_id = c.prod_id and c2.channel_id = c.channel_id
 15        and c2.time_id between to_date('01/01/2000','mm/dd/yyyy')
 16         and to_date('12/31/2000')
 17        ) min_cost
 18     , (
 19        select max(c2.unit_cost)
 20        from sh.costs c2
 21        where c2.prod_id = c.prod_id and c2.channel_id = c.channel_id
 22        and c2.time_id between to_date('01/01/2000','mm/dd/yyyy')
 23         and to_date('12/31/2000')
 24        ) max_cost
 25  from (
 26     select distinct pr.prod_id, pr.prod_name, ch.channel_id, ch.channel_desc
 27     from sh.channels ch
 28        , sh.products pr
 29        , sh.costs co
 30     where ch.channel_id = co.channel_id
 31        and co.prod_id = pr.prod_id
 32        and co.time_id between to_date('01/01/2000','mm/dd/yyyy')
 33         and to_date('12/31/2000')
 34  ) c
 35 order by prod_name, channel_desc;

PRODUCT                        CHANNEL_DESC           AVG COST   MIN COST   MAX COST
------------------------------ -------------------- ---------- ---------- ----------
1.44MB External 3.5" Diskette  Direct Sales               8.36       7.43       9.17
1.44MB External 3.5" Diskette  Internet                   8.59       7.42       9.55
...
Y Box                          Internet                 266.73     245.00     282.30
Y Box                          Partners                 272.62     242.79     293.68
                                                    ---------- ---------- ----------
sum                                                  27,961.39  24,407.85  34,478.10

216 rows selected.
COLD CACHE Elapsed: 00:00:02.30
WARM CACHE Elapsed: 00:00:01.09
-----------------------------------------------------------------------------------
| Id  | Operation                            |Name          |Starts|E-Rows|A-Rows |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |              |     1|      |   216 |
|   1 |  SORT AGGREGATE                      |              |   216|     1|   216 |
|*  2 |   FILTER                             |              |   216|      | 17373 |
|   3 |    PARTITION RANGE ITERATOR          |              |   216|    96| 17373 |
|*  4 |     TABLE ACCESS BY LOCAL INDEX ROWID|COSTS         |   864|    96| 17373 |
|   5 |      BITMAP CONVERSION TO ROWIDS     |              |   864|      | 52119 |
|   6 |       BITMAP AND                     |              |   864|      |   840 |
|   7 |        BITMAP MERGE                  |              |   864|      |   864 |
|*  8 |         BITMAP INDEX RANGE SCAN      |COSTS_TIME_BIX|   864|      | 79056 |
|*  9 |        BITMAP INDEX SINGLE VALUE     |COSTS_PROD_BIX|   864|      |   840 |
|  10 |  SORT AGGREGATE                      |              |   216|     1|   216 |
|* 11 |   FILTER                             |              |   216|      | 17373 |
|  12 |    PARTITION RANGE ITERATOR          |              |   216|    96| 17373 |
|* 13 |     TABLE ACCESS BY LOCAL INDEX ROWID|COSTS         |   864|    96| 17373 |
|  14 |      BITMAP CONVERSION TO ROWIDS     |              |   864|      | 52119 |
|  15 |       BITMAP AND                     |              |   864|      |   840 |
|  16 |        BITMAP MERGE                  |              |   864|      |   864 |
|* 17 |         BITMAP INDEX RANGE SCAN      |COSTS_TIME_BIX|   864|      | 79056 |
|* 18 |        BITMAP INDEX SINGLE VALUE     |COSTS_PROD_BIX|   864|      |   840 |
|  19 |  SORT AGGREGATE                      |              |   216|     1|   216 |
|* 20 |   FILTER                             |              |   216|      | 17373 |
|  21 |    PARTITION RANGE ITERATOR          |              |   216|    96| 17373 |
|* 22 |     TABLE ACCESS BY LOCAL INDEX ROWID|COSTS         |   864|    96| 17373 |
|  23 |      BITMAP CONVERSION TO ROWIDS     |              |   864|      | 52119 |
|  24 |       BITMAP AND                     |              |   864|      |   840 |
|  25 |        BITMAP MERGE                  |              |   864|      |   864 |
|* 26 |         BITMAP INDEX RANGE SCAN      |COSTS_TIME_BIX|   864|      | 79056 |
|* 27 |        BITMAP INDEX SINGLE VALUE     |COSTS_PROD_BIX|   864|      |   840 |
|  28 |  SORT ORDER BY                       |              |     1| 20640|   216 |
|  29 |   VIEW                               |              |     1| 20640|   216 |
|  30 |    HASH UNIQUE                       |              |     1| 20640|   216 |
|* 31 |     FILTER                           |              |     1|      | 17373 |
|* 32 |      HASH JOIN                       |              |     1| 20640| 17373 |
|  33 |       TABLE ACCESS FULL              |PRODUCTS      |     1|    72|    72 |
|* 34 |       HASH JOIN                      |              |     1| 20640| 17373 |
|  35 |        TABLE ACCESS FULL             |CHANNELS      |     1|     5|     5 |
|  36 |        PARTITION RANGE ITERATOR      |              |     1| 20640| 17373 |
|* 37 |         TABLE ACCESS FULL            |COSTS         |     4| 20640| 17373 |
-----------------------------------------------------------------------------------

Note In several of the following listings, you will see an elapsed time for COLD CACHE and WARM CACHE. The COLD CACHE time is the first execution of the statement immediately following a flush of both the buffer cache and the shared pool.

Examining the output of Listing 10-8, note that the elapsed execution time on a cold cache is 2.30 seconds, and 1.09 seconds on a warm cache. These times don’t seem all that bad at first; but, when you examine the execution plan, you find that this query can be improved on from a performance perspective as well as a readability perspective.

The starts column is telling. Each execution against the costs table is executed 864 times. This is because there are 216 rows produced by a join between channels, products, and costs. Also, the costs table is queried in four separate places for the same information. By using subquery factoring, not only can this SQL be cleaned up and made easier to read, but also it can be made more efficient.

As seen in Listing 10-9, you can start by putting the begin_date and end_date columns in a separate query bookends, leaving only one place that the values need to be set. The data for products are placed in the prodmaster subquery. Although this bit of the SQL worked fine as a subquery in the FROM clause, the readability of the SQL statement as a whole is greatly improved by moving it to a factored subquery.

Listing 10-9. Old SQL Refactored Using the WITH Clause

  1  with bookends as (
  2     select
  3      to_date('01/01/2000','mm/dd/yyyy') begin_date
  4        ,to_date('12/31/2000','mm/dd/yyyy') end_date
  5     from dual
  6  ),
  7  prodmaster as (
  8     select distinct pr.prod_id, pr.prod_name, ch.channel_id, ch.channel_desc
  9     from sh.channels ch
 10        , sh.products pr
 11        , sh.costs co
 12     where ch.channel_id = co.channel_id
 13        and co.prod_id = pr.prod_id
 14        and co.time_id between (select begin_date from bookends)
 15         and (select end_date from bookends)
 16  ),
 17  cost_compare as (
 18        select
 19           prod_id
 20           , channel_id
 21           , avg(c2.unit_cost) avg_cost
 22           , min(c2.unit_cost) min_cost
 23           , max(c2.unit_cost) max_cost
 24        from sh.costs c2
 25        where c2.time_id between (select begin_date from bookends)
 26         and (select end_date from bookends)
 27        group by c2.prod_id, c2.channel_id
 28  )
 29  select /*+ gather_plan_statistics */
 30     substr(pm.prod_name,1,30) prod_name
 31     , pm.channel_desc
 32     , cc.avg_cost
 33     , cc.min_cost
 34     , cc.max_cost
 35  from prodmaster pm
 36  join cost_compare cc on cc.prod_id = pm.prod_id
 37     and cc.channel_id = pm.channel_id
 38 order by pm.prod_name, pm.channel_desc;

PRODUCT                        CHANNEL_DESC           AVG COST   MIN COST   MAX COST
------------------------------ -------------------- ---------- ---------- ----------
1.44MB External 3.5" Diskette  Direct Sales               8.36       7.43       9.17
1.44MB External 3.5" Diskette  Internet                   8.59       7.42       9.55

Y Box                          Internet                 266.73     245.00     282.30
Y Box                          Partners                 272.62     242.79     293.68
                                                    ---------- ---------- ----------
sum                                                  27,961.39  24,407.85  34,478.10

216 rows selected.

COLD CACHE Elapsed: 00:00:01.48
WARM CACHE Elapsed: 00:00:00.17
----------------------------------------------------------------------------------
| Id |Operation                              |Name          |Starts|E-Rows|A-Rows|
----------------------------------------------------------------------------------
|   0|SELECT STATEMENT                       |              |     1|      |   216|
|   1| SORT ORDER BY                         |              |     1| 17373|   216|
|*  2|  HASH JOIN                            |              |     1| 17373|   216|
|   3|   VIEW                                |              |     1|   216|   216|
|   4|    HASH GROUP BY                      |              |     1|   216|   216|
|   5|     PARTITION RANGE ITERATOR          |              |     1| 17373| 17373|
|   6|      TABLE ACCESS BY LOCAL INDEX ROWID|COSTS         |     4| 17373| 17373|
|   7|       BITMAP CONVERSION TO ROWIDS     |              |     4|      | 17373|
|*  8|        BITMAP INDEX RANGE SCAN        |COSTS_TIME_BIX|     4|      |   366|
|   9|         FAST DUAL                     |              |     1|     1|     1|
|  10|         FAST DUAL                     |              |     1|     1|     1|
|  11|   VIEW                                |              |     1| 17373|   216|
|  12|    HASH UNIQUE                        |              |     1| 17373|   216|
|* 13|     HASH JOIN                         |              |     1| 17373| 17373|
|  14|      TABLE ACCESS FULL                |PRODUCTS      |     1|    72|    72|
|  15|      MERGE JOIN                       |              |     1| 17373| 17373|
|  16|       TABLE ACCESS BY INDEX ROWID     |CHANNELS      |     1|     5|     4|
|  17|        INDEX FULL SCAN                |CHANNELS_PK   |     1|     5|     4|
|* 18|       SORT JOIN                       |              |     4| 17373| 17373|
|  19|        PARTITION RANGE ITERATOR       |              |     1| 17373| 17373|
|  20|         TABLE ACCESS BY LOCAL INDEX RO|COSTS         |     4| 17373| 17373|
|  21|          BITMAP CONVERSION TO ROWIDS  |              |     4|      | 17373|
|* 22|           BITMAP INDEX RANGE SCAN     |COSTS_TIME_BIX|     4|      |   366|
|  23|            FAST DUAL                  |              |     1|     1|     1|
|  24|            FAST DUAL                  |              |     1|     1|     1|
----------------------------------------------------------------------------------

The calculations for the average, minimum, and maximum costs are replaced with a single subquery called cost_compare. Last, the SQL that joins the prodmaster and cost_compare subqueries is added. The structure of the SQL is now much easier on the eyes and the overworked developer’s brain. It’s also simpler for the DBA to understand. The DBA will be especially happy with the execution statistics.

Where the old SQL queried the costs table and costs_time_bix index several hundred times, the new SQL queries each only eight times. This is quite an improvement, and it shows in the elapsed times. The query time on a cold cache is 1.48 seconds, about 25 percent better than the old SQL. On a warm cache, however, the refactored SQL really shines, running at 0.17 second whereas the old SQL managed only 1.09 seconds.

Applying Subquery Factoring to PL/SQL

We discussed the new 12c WITH PL/SQL function earlier, but there are other ways that PL/SQL can present golden opportunities for optimization using subquery factoring. Something that most of us have done at one time or another is to write a PL/SQL routine when we cannot figure out how to do what we want in a single SQL query. Sometimes it can be very difficult to capture everything in a single statement. It’s often just easier to think procedurally rather than in sets of data, and just write some code to do what we need. As you gain experience, you will rely less and less on thinking in terms of “How would I code this in PL/SQL?” and will rely more along the lines of “How do I capture this problem in a single SQL statement?” The more advanced features that Oracle has packed into SQL can help as well.

Here’s an example. You’ve been asked to create a report with the following criteria:

Only include customers who have purchased products in at least three different years.
Compute total aggregate sales per customer, broken down by product category.

At first, this doesn’t seem too difficult, but you may struggle for a bit when trying to capture this in one SQL statement. So, you decide to use a PL/SQL routine to get the needed data. The results may be similar to those in Listing 10-10. The logic is simple. Find all customers that fit the criteria and store their IDs in a temporary table. Then, loop through the newly saved customer IDs and find all their sales, sum them up, and add them to another temporary table. The results are then joined to the customers and products tables to generate the report.

Listing 10-10. PL/SQL to Generate Customer Report

SQL> create global temporary table cust3year ( cust_id number );
Table created.

SQL> create global temporary table sales3year(
  2     cust_id number ,
  3     prod_category varchar2(50),
  4     total_sale number
  5  )
  6  /
Table created.

SQL> begin
  2     execute immediate 'truncate table cust3year';
  3     execute immediate 'truncate table sales3year';
  4
  5     insert into cust3year
  6     select cust_id --, count(cust_years) year_count
  7     from (
  8             select distinct cust_id, trunc(time_id,'YEAR') cust_years
  9             from sh.sales
 10     )
 11     group by cust_id
 12     having count(cust_years) >= 3;
 13
 14     for crec in ( select cust_id from cust3year)
 15     loop
 16             insert into sales3year
 17             select s.cust_id,p.prod_category, sum(co.unit_price * s.quantity_sold)
 18             from sh.sales s
 19             join sh.products p on p.prod_id = s.prod_id
 20             join sh.costs co on co.prod_id = s.prod_id
 21                     and co.time_id = s.time_id
 22             join sh.customers cu on cu.cust_id = s.cust_id
 23             where s.cust_id = crec.cust_id
 24             group by s.cust_id, p.prod_category;
 25     end loop;
 26  end;
 27  /
PL/SQL procedure successfully completed.
Elapsed: 00:01:17.48
SQL> break on report
SQL> compute sum of total_sale on report

SQL> select c3.cust_id, c.cust_last_name, c.cust_first_name, s.prod_category, s.total_sale
  2  from cust3year c3
  3  join sales3year s on s.cust_id = c3.cust_id
  4  join sh.customers c on c.cust_id = c3.cust_id
  5  order by 1,4;

  CUST ID LAST NAME       FIRST NAME      PRODUCT CATEGORY                    TOTAL SALE
--------- --------------- --------------- ------------------------------ ---------------
        6 Charles         Harriett        Electronics                           2,838.57
        6 Charles         Harriett        Hardware                             19,535.38
    ...
    50833 Gravel          Grover          Photo                                15,469.64
    50833 Gravel          Grover          Software/Other                        9,028.87
                                                                         ---------------
sum                                                                       167,085,605.71
16018 rows selected.

The code in Listing 10-10 is fairly succinct and it only takes 1:17 minutes to run. This isn’t too bad, is it? Although this is a nice little chunk of PL/SQL, take another look at it and think in terms of subfactored subqueries. The section that determines the correct customer IDs can be captured in a WITH clause fairly easily. Once the customers are identified, it is then a fairly easy job to use the results of the subquery to look up the needed sales, product, and customer information to create the report.

Listing 10-11 has a single SQL statement that captures what is done with the PL/SQL routine from Listing 10-10—without the need to create temporary tables manually or use PL/SQL loops. Should the use of temporary tables make for a more efficient query, Oracle does so automatically, or you can choose how Oracle preserves the subquery results via the INLINE and MATERIALIZE hints. It is more efficient, too, with an elapsed time of 6.13 seconds.

Listing 10-11. Use of the WITH Clause to Generate the Customer Report

  1  with custyear as (
  2   select cust_id, extract(year from time_id) sales_year
  3   from sh.sales
  4   where extract(year from time_id) between 1998 and 2002
  5   group by cust_id,  extract(year from time_id)
  6  ),
  7  custselect as (
  8   select distinct cust_id
  9   from (
 10      select cust_id, count(*) over ( partition by cust_id) year_count
 11      from custyear
 12   )
 13   where  year_count >= 3 -- 3 or more years as a customer during period
 14  )
 15  select cu.cust_id, cu.cust_last_name, cu.cust_first_name, p.prod_category, sum(co.unit_price * s.quantity_sold) total_sale
 16  from custselect cs
 17  join sh.sales s on s.cust_id = cs.cust_id
 18  join sh.products p on p.prod_id = s.prod_id
 19  join sh.costs co on co.prod_id = s.prod_id
 20   and co.time_id = s.time_id
 21  join sh.customers cu on cu.cust_id = cs.cust_id
 22  group by cu.cust_id, cu.cust_last_name, cu.cust_first_name, p.prod_category
 23  order by cu.cust_id;

  CUST ID LAST NAME       FIRST NAME      PRODUCT CATEGORY                    TOTAL SALE
--------- --------------- --------------- ------------------------------ ---------------
        6 Charles         Harriett        Electronics                           2,838.57
        6 Charles         Harriett        Hardware                             19,535.38
...
    50833 Gravel          Grover          Photo                                15,469.64
    50833 Gravel          Grover          Software/Other                        9,028.87
                                                                         ---------------
sum                                                                       167,085,605.71

16018 rows selected.

Elapsed: 00:00:06.13

The WITH clause in Listing 10-11 actually uses two subqueries that can be combined into a single query, but I thought it easier to read when they are broken into two queries. Notice the use of the EXTRACT() function;it simplifies comparing years by extracting the year from a date and converting it to an integer.

The SQL examples in this section of the chapter are not meant to be tuning exercises, but merely demonstrations that show how subquery factoring may be used. When refactoring legacy SQL to take advantage of the WITH clause, be sure to test the results. Subquery factoring can be used to organize some queries better, and in some cases can even be used as an optimization tool. Learning to use it adds another tool to your Oracle toolbox.

EXPERIMENT WITH SUBQUERY FACTORING

Included in this chapter are two scripts in the Exercises folder that you may want to experiment with. These scripts both run against the SH demo schema.

Exercises/l_10_exercise_1.sql
Exercises/l_10_exercise_2.sql

Run these scripts with both the MATERIALIZE and INLINE hints to compare performance. In the tsales subquery, a WHERE clause limits the data returned to a single year. Comment out the WHERE clause and run the queries again. How does the efficiency of the two hints compare now? Would you feel comfortable using these hints when the size of the data set is set at runtime by user input?

Recursive Subqueries

Beginning in Oracle 11.2, recursive subquery factoring (RSF for the remainder of this chapter) was added. As you can probably guess, the ANSI name for this feature is recursive common table expression. Regardless of what you call it, Oracle has had a similar feature for a very long time in the form of the CONNECT BY ⁶ clause of the SELECT statement. This feature has been enhanced in Oracle 11gR2.

A CONNECT BY Example

Let’s begin by looking at a traditional CONNECT BY query such as that in Listing 10-12. The emp inline view is used to join the employee and department tables, and then the single dataset is presented to the SELECT . . . CONNECT BY statement. The PRIOR operator is used to match the current EMPLOYEE_ID to rows where this value is in the MANAGER_ID column. Doing so iteratively creates a recursive query.

Listing 10-12 contains a number of extra columns in the output to help explain how the PRIOR operator works. Let’s take a look at the output beginning with the row for Lex De Haan. You can see that the employee_id for Lex is 102. The PRIOR operator finds all rows for which the manager_id is 102 and includes them under the hierarchy for Lex De Haan. The only row that meets these criteria is the one for Alexander Hunold, with an employee_id of 103. The process is then repeated for Alexander Hunold: Are there any rows for which the manager_id is 103? There are four rows found with a manager_id of 103—those for employees Valli Pattaballa, Diana Lorentz, Bruce Ernst, and David Austin—so these are included in the output below Alexander Hunold. Because there are no rows for which any of the employee_id values for these four employees appears as a manager_id, Oracle moves back up to a level for which the rows have not yet been processed (in this case, for Alberto Errazuriz) and continues on to the end until all rows have been processed.

Listing 10-12. Basic CONNECT BY

  1  select lpad(' ', level*2-1,' ') || emp.emp_last_name emp_last_name
  2   , emp.emp_first_name
  3   , emp.employee_id
  4   , emp.mgr_last_name, emp.mgr_first_name
  5   , emp.manager_id
  6   , department_name
  7  from (
  8   select /*+ inline gather_plan_statistics */
  9      e.last_name emp_last_name, e.first_name emp_first_name
 10      , e.employee_id, d.department_id
 11      , e.manager_id, d.department_name
 12      , es.last_name mgr_last_name, es.first_name mgr_first_name
 13   from hr.employees e
 14   left outer join hr.departments d on d.department_id = e.department_id
 15   left outer join hr.employees es on es.employee_id = e.manager_id
 16  ) emp
 17  connect by prior emp.employee_id = emp.manager_id
 18  start with emp.manager_id is null
 19 order siblings by emp.emp_last_name;

EMP_LAST_NAME    EMP_FIRST_NAME  EMP ID MGR_LAST_NAME    MGR_FIRST_NAME  MGR ID DEPARTMENT
---------------- --------------- ------ ---------------- --------------- ------ ------------
 King            Steven             100                                     Executive
   Cambrault     Gerald             148 King             Steven             100 Sales
     Bates       Elizabeth          172 Cambrault        Gerald             148 Sales
     Bloom       Harrison           169 Cambrault        Gerald             148 Sales
     Fox         Tayler             170 Cambrault        Gerald             148 Sales
     Kumar       Sundita            173 Cambrault        Gerald             148 Sales
     Ozer         Lisa               168 Cambrault        Gerald             148 Sales
     Smith        William            171 Cambrault        Gerald             148 Sales
   De Haan        Lex                102 King             Steven             100 Executive
     Hunold       Alexander          103 De Haan          Lex                102 IT
       Austin     David              105 Hunold           Alexander          103 IT
       Ernst      Bruce              104 Hunold           Alexander          103 IT
       Lorentz    Diana              107 Hunold           Alexander          103 IT
       Pataballa  Valli              106 Hunold           Alexander          103 IT
   Errazuriz      Alberto            147 King             Steven             100 Sales
     Ande         Sundar             166 Errazuriz        Alberto            147 Sales
     Banda        Amit               167 Errazuriz        Alberto            147 Sales
...

107 rows selected.

The START WITH clause is instructed to begin with a value for which manager_id is null. Because this is an organizational hierarchy with a single person at the top of the hierarchy, this causes the query to start with Stephen King. As the chief executive officer, King does not have a manager, so the manager_id column is set to NULL for his row.

The level pseudocolumn holds the value for the depth of the recursion, allowing for a simple method to indent the output so that the organizational hierarchy is visible.

The Example Using an RSF

The example query on the employees table has been rewritten in Listing 10-13 to use RSF, in which the main subquery is emp_recurse. The anchor member in this case simply selects the topmost row in the hierarchy by selecting the only row where manager_id is null. This is equivalent to start with emp.manager_id is null in Listing 10-12. The recursive member references the defining query emp_recurse by joining it to emp query. This join is used to locate the row corresponding to each employee’s manager, which is equivalent to connect by prior emp.employee_id = emp.manager_id in Listing 10-12. The results in Listing 10-13 are identical to those in Listing 10-12.

Listing 10-13. Basic Recursive Subquery Factoring

  1  with emp as (
  2   select /*+ inline gather_plan_statistics */
  3      e.last_name, e.first_name, e.employee_id, e.manager_id, d.department_name
  4   from hr.employees e
  5   left outer join hr.departments d on d.department_id = e.department_id
  6  ),
  7  emp_recurse (last_name,first_name,employee_id,manager_id,department_name,lvl) as (
  8   select e.last_name, e.first_name
  9      , e.employee_id, e.manager_id
 10      , e.department_name, 1 as lvl
 11   from emp e where e.manager_id is null
 12   union all
 13   select emp.last_name, emp.first_name
 14   , emp.employee_id, emp.manager_id
 15   ,emp.department_name, empr.lvl + 1 as lvl
 16   from emp
 17   join emp_recurse empr on empr.employee_id = emp.manager_id
 18  )
 19   search depth first by last_name set order1
 20  select lpad(' ', lvl*2-1,' ') || er.last_name last_name
 21   , er.first_name
 22   , er.department_name
 23 from emp_recurse er;

LAST_NAME                 FIRST_NAME           DEPARTMENT
------------------------- -------------------- ------------
 King                     Steven               Executive
   Cambrault              Gerald               Sales
     Bates                Elizabeth            Sales
     Bloom                Harrison             Sales
     Fox                  Tayler               Sales
     Kumar                Sundita              Sales
     Ozer                 Lisa                 Sales
     Smith                William              Sales
   De Haan                Lex                  Executive
     Hunold               Alexander            IT
       Austin             David                IT
       Ernst              Bruce                IT
       Lorentz            Diana                IT
       Pataballa          Valli                IT
   Errazuriz              Alberto              Sales
     Ande                 Sundar               Sales
     Banda                Amit                 Sales
    ...

107 rows selected.

Although the new RSF method may at first appear verbose, the basis of how it works is simpler to understand than CONNECT BY, and it allows for more complex queries. The recursive WITH clause requires two query blocks: the anchor member and the recursive member. These two query blocks must be combined with the UNION ALL set operator. The anchor member is the query prior to the UNION ALL, whereas the recursive member is the query that follows. The recursive member must reference the defining subquery; in doing so, it is recursive.

Restrictions on RSF

As you might imagine, the use of RSF is quite a bit more flexible than CONNECT BY. There are some restrictions on its use, however. Per the Oracle documentation for the SELECT statement, the following elements cannot be used in the recursive member of an RSF:

The DISTINCT keyword or a GROUP BY clause
The MODEL clause
An aggregate function; however, analytic functions are permitted in the select list
Subqueries that refer to query_name
Outer joins that refer to query_name as the right table

Differences from CONNECT BY

There are several differences when using RSF compared with CONNECT BY, and some of them are apparent in Listing 10-13. You may have wondered what happened to the level pseudocolumn because it is missing in this query, replaced by the lvl column. I explain this a little later on. Also, notice that the columns returned by an RSF query must be specified in the query definition, as seen in line 7 of Listing 10-13. One more new feature is SEARCH DEPTH FIRST seen on line 19. The default search is BREADTH FIRST, which is not usually the output you want from a hierarchical query. Listing 10-14 shows the output when the SEARCH clause is not used or it is set to BREADTH FIRST. This search returns rows of all siblings at each level before returning any child rows. Specifying SEARCH DEPTH FIRST returns the rows in hierarchical order. The SET ORDER1 portion of the SEARCH clause sets the value of the order1 pseudocolumn to the value of the order in which the rows are returned, similar to what you might see with ROWNUM, but you get to name the column. This is also used in later examples.

Listing 10-14. Default BREADTH FIRST Search

...
        search breadth first by last_name set order1
select lpad(' ', lvl*2-1,' ') || er.last_name last_name
...

LAST_NAME                 FIRST_NAME           DEPARTMENT_NAME
------------------------- -------------------- -----------------
 King                     Steven               Executive
   Cambrault              Gerald               Sales
   De Haan                Lex                  Executive
   Errazuriz              Alberto              Sales
   Fripp                  Adam                 Shipping
   Hartstein              Michael              Marketing
   Kaufling               Payam                Shipping
   Kochhar                Neena                Executive
   Mourgos                Kevin                Shipping
   Partners               Karen                Sales
   Raphaely               Den                  Purchasing
   Russell                John                 Sales
   Vollman                Shanta               Shipping
   Weiss                  Matthew              Shipping
   Zlotkey                Eleni                Sales
     Abel                 Ellen                Sales
     Ande                 Sundar               Sales
...

Notice that the SEARCH clause, as it is used in Listings 10-13 and 10-14, specifies that the search be by last_name. This could also be by first_name, or by a column list, such as last_name,first_name. Doing so controls the order of the rows within each level. The SEARCH clause ends with SET ORDER1, which effectively adds the order1 pseudocolumn to the column list returned by the recursive subquery. You will see it used more in some of the following examples.

Duplicating CONNECT BY Functionality

As Oracle Database has progressed through several versions, the functionality of the CONNECT BY clause has progressed as well. There are a number of hierarchical query operators, pseudocolumns, and one function available to CONNECT BY that are not natively available to RSF. The functionality these provide, however, can be duplicated in RSF. The functionality may not mimic exactly what occurs when CONNECT BY is used, but it can likely be made to do what you need. The trick to getting what you want from RSF sometimes requires stepping away from the keyboard and thinking about the results you want to achieve, rather than thinking about how you are going to code it. It is amazing how the change in perspective helps you achieve the desired output easily from the SQL you write.

The operators and pseudocolumns for CONNECT BY are listed in Table 10-1. I go through each of these as needed, showing example usages for CONNECT BY, and then duplicating that functionality with RSF. Keep in mind that RSF is quite versatile, so TMTOWTDI (There’s More Than One Way to Do It) is definitely in force. Feel free to experiment and find other methods to achieve the same results.

Table 10-1. CONNECT BY Functions, Operators, and Pseudocolumns

Type	Name	Purpose
Function	SYS_CONNECT_BY_PATH	Returns all ancestors for the current row.
Operator	CONNECT_BY_ROOT	Returns the value from a root row.
Operator	PRIOR	Used to indicate hierarchical query. Not needed in a recursive subquery.
Pseudocolumn	connect_by_iscycle	Detects cycles in the hierarchy.
Parameter	NOCYCLE	Used with CONNECT_BY_ISCYCLE. Parameter for CONNECT BY.
Pseudocolumn	connect_by_isleaf	Identifies leaf rows.
Pseudocolumn	level	Used to indicate level of depth in the hierarchy.

I also cover the SEARCH clause of RSF because it is instrumental in solving some problems.

The level Pseudocolumn

Let’s start with the level pseudocolumn, which is used frequently in hierarchical queries to indent the output, creating a visual representation of the hierarchy. Listing 10-15 contains a simple example showing how level is generated. As the hierarchy increases in depth, level is incremented. Likewise, level is decremented when the hierarchy goes back a level.

Listing 10-15. The level Pseudocolumn

  1  select lpad(' ', level*2-1,' ') || e.last_name last_name, level
  2  from hr.employees e
  3  connect by prior e.employee_id = e.manager_id
  4  start with e.manager_id is null
  5  order siblings by e.last_name;

LAST_NAME                      LEVEL
------------------------- ----------
 King                              1
   Cambrault                       2
     Bates                         3
     Bloom                         3
     Fox                           3
     Kumar                         3
     Ozer                          3
     Smith                         3
   De Haan                         2
...

107 rows selected.

This can also be accomplished in RSF (see Listing 10-16), although it does require a little effort on your part. It may be somewhat surprising to see that this actually works. The value for lvl is never decremented, only incremented. Recall that the default search method for RSF is BREADTH FIRST. It is apparent that Oracle is processing the rows in sibling order, with the top of the hierarchy (King), followed by the child rows at the next level, continuing until the last row is reached. This behavior allows you to solve some other problems as well.

Listing 10-16. Create a lvl Column

  1  with emp_recurse(employee_id,manager_id,last_name, lvl) as (
  2   select e.employee_id, null, e.last_name, 1 as lvl
  3   from hr.employees e
  4   where e.manager_id is null
  5   union all
  6   select e1.employee_id, e1.manager_id, e1.last_name, e2.lvl + 1 as lvl
  7   from hr.employees e1
  8   join emp_recurse e2 on e2.employee_id= e1.manager_id
  9  )
 10  search depth first by last_name set last_name_order
 11  select lpad(' ', r.lvl*2-1,' ') || r.last_name last_name, r.lvl
 12  from emp_recurse r
 13  order by last_name_order;

LAST_NAME                        LVL
------------------------- ----------
 King                              1
   Cambrault                       2
     Bates                         3
     Bloom                         3
     Fox                           3
     Kumar                         3
     Ozer                          3
     Smith                         3
   De Haan                         2
...
107 rows selected.

The SYS_CONNECT_BY_PATH Function

The SYS_CONNECT_BY_PATH function is used to return the values that comprise the hierarchy up to the current row. It’s best explained with an example, such as the one in Listing 10-17, In which the SYS_CONNECT_BY_PATH function is used to build a colon-delimited list of the hierarchy, complete from root to node.

Listing 10-17. SYS_CONNECT_BY_PATH

  1  select lpad(' ',2*(level-1)) || e.last_name last_name
  2   , sys_connect_by_path(last_name,':') path
  3  from hr.employees e
  4  start with e.manager_id is null
  5  connect by prior e.employee_id = e.manager_id
  6  order siblings by e.last_name;

LAST_NAME                 PATH
------------------------- -----------------------
King                      :King
  Cambrault               :King:Cambrault
    Bates                 :King:Cambrault:Bates
    Bloom                 :King:Cambrault:Bloom
    Fox                   :King:Cambrault:Fox
    Kumar                 :King:Cambrault:Kumar
    Ozer                  :King:Cambrault:Ozer
    Smith                 :King:Cambrault:Smith
  De Haan                 :King:De Haan
...
107 rows selected.

Although the SYS_CONNECT_BY_PATH function is not available to RSF queries, this function can be duplicated using much the same method that was used to reproduce the level pseudocolumn. Rather than incrementing a counter, however, you now append to a string value. Listing 10-18 shows how this is done.

Listing 10-18. Build Your Own SYS_CONNECT_BY_PATH

  1  with emp_recurse(employee_id,manager_id,last_name,lvl, path) as (
  2   select e.employee_id, null, e.last_name
  3      , 1 as lvl
  4      ,':' || to_char(e.last_name) as path
  5   from hr.employees e
  6   where e.manager_id is null
  7   union all
  8   select e1.employee_id, e1.manager_id, e1.last_name
  9      ,e2.lvl + 1 as lvl
 10      ,e2.path || ':' || e1.last_name as path
 11   from hr.employees e1
 12   join emp_recurse e2 on e2.employee_id= e1.manager_id
 13  )
 14  search depth first by last_name set last_name_order
 15  select lpad(' ', r.lvl*2-1,' ') || r.last_name last_name, r.path
 16  from emp_recurse r
 17  order by last_name_order;

LAST_NAME                 PATH
------------------------- ----------------------
 King                     :King
   Cambrault              :King:Cambrault
     Bates                :King:Cambrault:Bates
     Bloom                :King:Cambrault:Bloom
     Fox                  :King:Cambrault:Fox
     Kumar                :King:Cambrault:Kumar
     Ozer                 :King:Cambrault:Ozer
     Smith                :King:Cambrault:Smith
   De Haan                :King:De Haan
...
107 rows selected.

The output of the SYS_CONNECT_BY_PATH as seen in Listing 10-17 is duplicated by the roll-your-own version using RSF in Listing 10-18. Take another look at this SQL. You may notice that there’s something here that SYS_CONNECT_BY_PATH cannot do. Consider, for instance, if you want the hierarchy to be displayed as a comma-delimited list. This is accomplished simply enough by changing “:” to “,”. The problem with SYS_CONNECT_BY_PATH is that the first character in the output is always a comma.

Using the RSF method, you can simply remove the delimiter in the anchor member and then change the delimiter in the recursive member to a comma. This is shown in Listing 10-19, along with a sample of the output. Should you feel inclined, the first character of the path could remain a colon, with the values delimited by commas.

Listing 10-19. Comma-Delimited PATH

  1  with emp_recurse(employee_id,manager_id,last_name,lvl,path) as (
  2   select e.employee_id, null, e.last_name
  3      , 1 as lvl
  4      ,e.last_name as path
  5   from hr.employees e
  6   where e.manager_id is null
  7   union all
  8   select e1.employee_id, e1.manager_id, e1.last_name
  9      ,e2.lvl + 1 as lvl
 10      ,e2.path || ',' || e1.last_name as path
 11   from hr.employees e1
 12   join emp_recurse e2 on e2.employee_id= e1.manager_id
 13  )
 14  search depth first by last_name set last_name_order
 15  select lpad(' ', r.lvl*2-1,' ') || r.last_name last_name, r.path
 16  from emp_recurse r
 17  order by last_name_order;

LAST_NAME                 PATH
------------------------- ----------------------
 King                     King
   Cambrault              King,Cambrault
     Bates                King,Cambrault,Bates
     Bloom                King,Cambrault,Bloom
     Fox                  King,Cambrault,Fox
     Kumar                King,Cambrault,Kumar
     Ozer                 King,Cambrault,Ozer
     Smith                King,Cambrault,Smith
   De Haan                King,De Haan
...
107 rows selected.

The CONNECT_BY_ROOT Operator

The CONNECT_BY_ROOT operator enhances the CONNECT BY syntax by returning the root node of the current row. In the example of the hr.employees table, all rows return King as the root. You can change it up a bit, however, by modifying the row temporarily for Neena Kochhar, putting her on the same level as the company president, Steven King. Then, the hierarchy can be shown for Kochhar by using the CONNECT_BY_ROOT operator to restrict the output. The results are shown in Listing 10-20.

Listing 10-20. CONNECT_BY_ROOT

  1* update hr.employees set manager_id= null where last_name ='Kochhar';
  1 row updated.

  1  select /*+ inline gather_plan_statistics */
  2   level
  3   , lpad(' ',2*(level-1)) || last_name last_name
  4   , first_name
  5   , CONNECT_BY_ROOT last_name as root
  6   , sys_connect_by_path(last_name,':') path
  7  from hr.employees
  8  where connect_by_root last_name = 'Kochhar'
  9  connect by prior employee_id = manager_id
 10  start with manager_id is null;

LEVEL LAST_NAME    FIRST_NAME   ROOT         PATH
----- ------------ ------------ ------------ ------------------------------
    1 Kochhar      Neena        Kochhar      :Kochhar
    2   Greenberg  Nancy        Kochhar      :Kochhar:Greenberg
    3     Faviet   Daniel       Kochhar      :Kochhar:Greenberg:Faviet
    3     Chen     John         Kochhar      :Kochhar:Greenberg:Chen
    3     Sciarra  Ismael       Kochhar      :Kochhar:Greenberg:Sciarra
    3     Urman    Jose Manuel  Kochhar      :Kochhar:Greenberg:Urman
    3     Popp     Luis         Kochhar      :Kochhar:Greenberg:Popp
    2   Whalen     Jennifer     Kochhar      :Kochhar:Whalen
    2   Mavris     Susan        Kochhar      :Kochhar:Mavris
    2   Baer       Hermann      Kochhar      :Kochhar:Baer
    2   Higgins    Shelley      Kochhar      :Kochhar:Higgins
    3     Gietz    William      Kochhar      :Kochhar:Higgins:Gietz

12 rows selected.
1 rollback;

This functionality can be duplicated in RSF, but it does require a little more SQL. The code in Listing 10-21 is based on the SYS_CONNECT_BY_PATH example, with some minor changes and additions. The delimiting character is now prepended and appended to the value for PATH in the anchor member. In the recursive member, the delimiter is appended to the PATH, whereas previously it was prepended to the last_name column. Doing so ensures that the root records always have a delimiting character at the end of the value, allowing the SUBSTR() function in the emps subquery to parse the root correctly from the string when the path comes from the anchor member only, such as the rows for King and Kochar. This is probably better explained by examining the output from the query.

Listing 10-21. Duplicate CONNECT_BY_ROOT

1 update hr.employees set manager_id= null where last_name ='Kochhar';
1 row updated.

  1  with emp_recurse(employee_id,manager_id,last_name,lvl,path) as (
  2   select /*+ gather_plan_statistics */
  3      e.employee_id
  4      , null as manager_id
  5      , e.last_name
  6      , 1 as lvl
  7      , ':' || e.last_name || ':' as path
  8   from hr.employees  e
  9   where e.manager_id is null
 10   union all
 11   select
 12      e.employee_id
 13      , e.manager_id
 14      , e.last_name
 15      , er.lvl + 1 as lvl
 16      , er .path || e.last_name  || ':' as path
 17   from hr.employees e
 18   join emp_recurse er on er.employee_id = e.manager_id
 19   join hr.employees e2 on e2.employee_id = e.manager_id
 20  )
 21  search depth first by last_name set order1 ,
 22  emps as (
 23   select lvl
 24      , last_name
 25      , path
 26      , substr(path,2,instr(path,':',2)-2) root
 27   from emp_recurse
 28  )
 29  select
 30   lvl
 31   , lpad(' ',2*(lvl-1)) || last_name last_name
 32   , root
 33   , path
 34  from emps
 35  where root = 'Kochhar';

       LVL LAST_NAME       ROOT            PATH
---------- --------------- --------------- ------------------------------
         1 Kochhar         Kochhar         :Kochhar:
         2   Baer          Kochhar         :Kochhar:Baer:
         2   Greenberg     Kochhar         :Kochhar:Greenberg:
         3     Chen        Kochhar         :Kochhar:Greenberg:Chen:
         3     Faviet      Kochhar         :Kochhar:Greenberg:Faviet:
         3     Popp        Kochhar         :Kochhar:Greenberg:Popp:
         3     Sciarra     Kochhar         :Kochhar:Greenberg:Sciarra:
         3     Urman       Kochhar         :Kochhar:Greenberg:Urman:
         2   Higgins       Kochhar         :Kochhar:Higgins:
         3     Gietz       Kochhar         :Kochhar:Higgins:Gietz:
         2   Mavris        Kochhar         :Kochhar:Mavris:
         2   Whalen        Kochhar         :Kochhar:Whalen:

12 rows selected.
1* rollback;

This is not a perfect duplication of the CONNECT_BY_ROOT operator. In this case, it does exactly what is needed. The built-in operator, however, does allow some flexibility in specifying the level and returning the root at that level. The example given needs more modification to match this ability; however, you may find that this example works well for most cases.

The connect_by_iscycle Pseudocolumn and NOCYCLE Parameter

The connect_by_iscycle pseudocolumn makes it easy to detect loops in a hierarchy. This is illustrated by the SQL in Listing 10-22. There, an intentional error has been introduced by updating the hr.employees row for the president, assigning Smith as King’s manager, which causes an error in the CONNECT BY.

Listing 10-22. Cycle Error in CONNECT BY

1 update hr.employees set manager_id = 171 where employee_id = 100;
1 row updated.
Elapsed: 00:00:00.02

  1  select lpad(' ',2*(level-1)) || last_name last_name
  2   ,first_name, employee_id, level
  3  from hr.employees
  4  start with employee_id = 100
  5* connect by prior employee_id = manager_id

LAST_NAME                 FIRST_NAME   EMPLOYEE_ID LEVEL
------------------------- ------------ ----------- -----
King                      Steven               100     1
  Kochhar                 Neena                101     2
    Greenberg             Nancy                108     3
...
    Smith                 William              171     3
      King                Steven               100     4
...
ERROR:
ORA-01436: CONNECT BY loop in user data

187 rows selected.
  1 rollback;

In the output, Smith appears as the manager of King, which you know to be incorrect. But, if you didn’t already know what the problem is, how would you find it? This is where the NOCYCLE parameter and CONNECT_BY_ISCYCLE operator come into play. These are used to detect a cycle in the hierarchy. The NOCYCLE parameter prevents the ORA-1436 error from occurring, allowing all rows to be output. The CONNECT_BY_ISCYCLE operator allows you to find the row causing the error easily.

As seen in Listing 10-23, the value of CONNECT_BY_ISCYCLE is 1, indicating that the row for Smith is somehow causing the error. The next query looks up the data for Smith, and all appears normal. Last, you query the table again, this time using Smith’s employee ID, to find all employees that he manages. The error becomes apparent—the president of the company does not have a manager—so the solution is to set the manager_id back to NULL for this row.

Listing 10-23. Detect the Cycle with CONNECT_BY_ISCYCLE

1* update hr.employees set manager_id = 171 where employee_id = 100
1 row updated.

  1  select lpad(' ',2*(level-1)) || last_name last_name
  2   ,first_name, employee_id, level
  3   , connect_by_iscycle
  4  from hr.employees
  5  start with employee_id = 100
  6  connect by nocycle prior employee_id = manager_id;

LAST_NAME                 FIRST_NAME   EMPLOYEE_ID LEVEL CONNECT_BY_ISCYCLE
------------------------- ------------ ----------- ----- ------------------
King                      Steven               100     1                  0
  Kochhar                 Neena                101     2                  0
...
    Smith                 William              171     3                  1
...
107 rows selected.

Elapsed: 00:00:00.03
  1  select last_name, first_name, employee_id, manager_id
  2  from hr.employees
  3* where employee_id = 171

LAST_NAME                 FIRST_NAME   EMPLOYEE_ID MANAGER_ID
------------------------- ------------ ----------- ----------
Smith                     William              171        148

  1  select last_name, first_name, employee_id, manager_id
  2  from hr.employees
  3* where manager_id = 171

LAST_NAME                 FIRST_NAME   EMPLOYEE_ID MANAGER_ID
------------------------- ------------ ----------- ----------
King                      Steven               100        171

  1 rollback;

So, how do you do this with RSF? It’s really quite simple, because Oracle has provided the built-in CYCLE clause that makes short work of detecting cycles in recursive queries. It is somewhat more robust than the connect_by_iscycle pseudocolumn in that it lets you determine which values are used to indicate a cycle, as well as provides a column name at the same time. Listing 10-24 uses the same data error in Listing 10-23, but this time it uses a recursive subfactored query.

Listing 10-24. Detect Cycles in Recursive Queries

1 update hr.employees set manager_id = 171 where employee_id = 100;
1 row updated.
Elapsed: 00:00:00.00

  1  with emp(employee_id,manager_id,last_name,first_name,lvl) as (
  2   select e.employee_id
  3      , null as manager_id
  4      , e.last_name
  5      , e.first_name
  6      , 1 as lvl
  7   from hr.employees  e
  8   where e.employee_id =100
  9   union all
 10   select e.employee_id
 11      , e.manager_id
 12      , e.last_name
 13      , e.first_name
 14      , emp.lvl + 1 as lvl
 15   from hr.employees e
 16   join emp on emp.employee_id = e.manager_id
 17  )
 18  search depth first by last_name set order1
 19  CYCLE employee_id SET is_cycle TO '1' DEFAULT '0'
 20  select lpad(' ',2*(lvl-1)) || last_name last_name
 21   , first_name
 22   , employee_id
 23   , lvl
 24   , is_cycle
 25  from emp
 26  order by order1;

LAST_NAME                 FIRST_NAME   EMPLOYEE_ID        LVL I
------------------------- ------------ ----------- ---------- -
King                      Steven               100          1 0
  Cambrault               Gerald               148          2 0
    Bates                 Elizabeth            172          3 0
    Bloom                 Harrison             169          3 0
    Fox                   Tayler               170          3 0
    Kumar                 Sundita              173          3 0
    Ozer                  Lisa                 168          3 0
    Smith                 William              171          3 0
      King                Steven               100          4 1
...

108 rows selected.

Elapsed: 00:00:00.04

  1  select last_name, first_name, employee_id, manager_id
  2  from hr.employees
  3  where employee_id = 100;

LAST_NAME                 FIRST_NAME   EMPLOYEE_ID MANAGER_ID
------------------------- ------------ ----------- ----------
King                      Steven               100        171
1 row selected.

 1  rollback;

Notice how the CYCLE clause lets you set the two possible values for the is_cycle column to 0 or 1. Only single-value characters are allowed here. The name of the column is also user defined and is set to is_cycle in this example. Examining the output, it appears that the CYCLE clause in RSF does a somewhat better job of identifying the row that causes the data cycle. The row with the error is identified clearly as that of King, so you can query that row and determine the error immediately.

The connect_by_isleaf Pseudocolumn

Last, there is the connect_by_isleaf pseudocolumn, which permits easy identification of leaf ⁷ nodes in hierarchical data. You can see that leaf nodes are identified in the output of Listing 10-25 when the value of connect_by_isleaf is 1.

Listing 10-25. CONNECT_BY_ISLEAF

  1  select lpad(' ',2*(level-1)) || e.last_name last_name, connect_by_isleaf
  2  from hr.employees e
  3  start with e.manager_id is null
  4  connect by prior e.employee_id = e.manager_id
  5  order siblings by e.last_name;
LAST_NAME                 CONNECT_BY_ISLEAF
------------------------- -----------------
King                                      0
  Cambrault                               0
    Bates                                 1
    Bloom                                 1
    Fox                                   1
    Kumar                                 1
    Ozer                                  1
    Smith                                 1
  De Haan                                 0
    Hunold                                0
      Austin                              1
      Ernst                               1
      Lorentz                             1
      Pataballa                           1
...

107 rows selected.

--------------------------------------------------------------------------
| Id  | Operation                                  | Name       | E-Rows |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |            |        |
|*  1 |  CONNECT BY NO FILTERING WITH START-WITH   |            |        |
|   2 |   TABLE ACCESS FULL                        | EMPLOYEES  |    107 |
-------------------------------------------------------------------------

Duplicating this in RSF is somewhat of a challenge. There are probably many methods that can be used to accomplish this, with some limitations. This is one of those problems that may require a little extra thought to solve, where “solve” means you get the output you desire, but you won’t necessarily duplicate the functionality of CONNECT_BY_ISLEAF completely.

In this case, you want to identify the leaf nodes in the employee hierarchy. By definition, none of the leaf nodes can be managers, so one way to accomplish this is to determine which rows are those of managers. All rows that are not those of managers are then leaf nodes.

Listing 10-26 uses this approach to solve the problem. The cost of solving it is two more extra scans of the hr.employees table and three index scans, but if RSF must be used, this is one way to get the desired results. The leaves subquery is used find the leaf nodes. This is then left outer joined to the employees table, and the value (or lack of a value) of the leaves.employee_id column indicates whether the current row is a leaf.

Listing 10-26. Finding Leaf Nodes in a Recursive Query

  1  with leaves as (
  2   select employee_id
  3   from hr.employees
  4   where employee_id not in (
  5      select manager_id
  6      from hr.employees
  7      where manager_id is not null
  8   )
  9  ),
 10  emp(manager_id,employee_id,last_name,lvl,isleaf) as (
 11   select e.manager_id, e.employee_id, e.last_name, 1 as lvl, 0 as isleaf
 12   from hr.employees e
 13   where e.manager_id is null
 14   union all
 15   select e.manager_id,nvl(e.employee_id,null) employee_id,e.last_name,emp.lvl + 1 as lvl
 16      , decode(l.employee_id,null,0,1) isleaf
 17   from hr.employees e
 18   join emp on emp.employee_id = e.manager_id
 19   left outer join leaves l on l.employee_id = e.employee_id
 20  )
 21  search depth first by last_name set order1
 22  select lpad(' ',2*(lvl-1)) || last_name last_name, isleaf
 23  from emp;

LAST_NAME                     ISLEAF
------------------------- ----------
King                               0
  Cambrault                        0
    Bates                          1
    Bloom                          1
    Fox                            1
    Kumar                          1
    Ozer                           1
    Smith                          1
  De Haan                          0
    Hunold                         0
      Austin                       1
      Ernst                        1
      Lorentz                      1
      Pataballa                    1
...
107 rows selected.

---------------------------------------------------------------------------
| Id  | Operation                               | Name           | E-Rows |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |                |        |
|   1 |  VIEW                                   |                |      7 |
|   2 |   UNION ALL (RECURSIVE WITH) DEPTH FIRST|                |        |
|*  3 |    TABLE ACCESS FULL                    | EMPLOYEES      |      1 |
|   4 |    NESTED LOOPS OUTER                   |                |      6 |
|   5 |     NESTED LOOPS                        |                |      6 |
|   6 |      RECURSIVE WITH PUMP                |                |        |
|   7 |      TABLE ACCESS BY INDEX ROWID        | EMPLOYEES      |      6 |
|*  8 |       INDEX RANGE SCAN                  | EMP_MANAGER_IX |      6 |
|   9 |     VIEW PUSHED PREDICATE               |                |      1 |
|  10 |      NESTED LOOPS ANTI                  |                |      1 |
|* 11 |       INDEX UNIQUE SCAN                 | EMP_EMP_ID_PK  |      1 |
|* 12 |       INDEX RANGE SCAN                  | EMP_MANAGER_IX |      6 |
---------------------------------------------------------------------------

Another way to accomplish this is presented in Listing 10-27, in which the analytic function LEAD() uses the value of the lvl column to determine whether a row is a leaf node. Although it does avoid two of the index scans that were seen in Listing 10-26, determining correctly whether a row is a leaf node is dependent on the order of the output, as seen in line 16. The LEAD() function depends on the value of the last_name_order column that is set in the SEARCH clause.

Listing 10-27. Using LEAD() to Find Leaf Nodes

  1  with emp(manager_id,employee_id,last_name,lvl) as (
  2   select e.manager_id, e.employee_id, e.last_name, 1 as lvl
  3   from hr.employees e
  4   where e.manager_id is null
  5   union all
  6   select e.manager_id, nvl(e.employee_id,null) employee_id
  7      ,  e.last_name, emp.lvl + 1 as lvl
  8   from hr.employees e
  9   join emp on emp.employee_id = e.manager_id
 10  )
 11  search depth first by last_name set last_name_order
 12  select lpad(' ',2*(lvl-1)) || last_name last_name,
 13   lvl,
 14   lead(lvl) over (order by last_name_order) leadlvlorder,
 15   case
 16   when ( lvl - lead(lvl) over (order by last_name_order) ) < 0
 17   then 0
 18   else 1
 19   end isleaf
 20  from emp;

LAST_NAME                        LVL LEADLVLORDER     ISLEAF
------------------------- ---------- ------------ ----------
King                               1            2          0
  Cambrault                        2            3          0
    Bates                          3            3          1
    Bloom                          3            3          1
    Fox                            3            3          1
    Kumar                          3            3          1
    Ozer                           3            3          1
    Smith                          3            2          1
  De Haan                          2            3          0
    Hunold                         3            4          0
      Austin                       4            4          1
      Ernst                        4            4          1
      Lorentz                      4            4          1
      Pataballa                    4            2          1
...
107 rows selected.

----------------------------------------------------------------------------
| Id  | Operation                                | Name           | E-Rows |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                |        |
|   1 |  WINDOW BUFFER                           |                |      7 |
|   2 |   VIEW                                   |                |      7 |
|   3 |    UNION ALL (RECURSIVE WITH) DEPTH FIRST|                |        |
|*  4 |     TABLE ACCESS FULL                    | EMPLOYEES      |      1 |
|   5 |     NESTED LOOPS                         |                |        |
|   6 |      NESTED LOOPS                        |                |      6 |
|   7 |       RECURSIVE WITH PUMP                |                |        |
|*  8 |       INDEX RANGE SCAN                   | EMP_MANAGER_IX |      6 |
|   9 |      TABLE ACCESS BY INDEX ROWID         | EMPLOYEES      |      6 |
----------------------------------------------------------------------------

What might happen if the SEARCH clause is changed from DEPTH FIRST to BREADTH FIRST? The results are shown in Listing 10-28. The use of the LEAD() function appears, at first, to be an elegant solution, but it is somewhat fragile in its dependency on the order of the data. The example in Listing 10-26 works regardless of the SEARCH clause parameters. It is readily apparent that the output in Listing 10-28 is incorrect.

Listing 10-28. LEAD() with BREADTH FIRST

  1  with emp(manager_id,employee_id,last_name,lvl) as (
  2   select e.manager_id, e.employee_id, e.last_name, 1 as lvl
  3   from hr.employees e
  4   where e.manager_id is null
  5   union all
  6   select e.manager_id, nvl(e.employee_id,null) employee_id
  7      ,  e.last_name, emp.lvl + 1 as lvl
  8   from hr.employees e
  9   join emp on emp.employee_id = e.manager_id
 10  )
 11  search breadth first by last_name set last_name_order
 12  select lpad(' ',2*(lvl-1)) || last_name last_name,
 13   lvl,
 14   lead(lvl) over (order by last_name_order) leadlvlorder,
 15   case
 16   when ( lvl - lead(lvl) over (order by last_name_order) ) < 0
 17   then 0
 18   else 1
 19   end isleaf
 20  from emp;
LAST_NAME                        LVL LEADLVLORDER     ISLEAF
------------------------- ---------- ------------ ----------
King                               1            2          0
  Cambrault                        2            2          1
  De Haan                          2            2          1
  Errazuriz                        2            2          1
  Fripp                            2            2          1
  Hartstein                        2            2          1
  Kaufling                         2            2          1

Summary

Although the functionality of CONNECT BY can be duplicated for most practical purposes in recursive subfactored queries, the question is, should you do so? In many cases, the CONNECT BY syntax is simpler to use, although the syntax does take some getting used to. Doing the same things in RSF requires quite a bit more SQL in most cases. In addition, CONNECT BY may produce better execution plans than RSF, especially for relatively simple queries. Keep in mind, however, that RSF is a new feature, and will likely improve in later versions of Oracle.

In addition, there may be good reasons not to use CONNECT BY. Perhaps you need to maintain ANSI compatibility in your application. Or perhaps the ability to write hierarchical queries that work in other databases that support recursive common table expressions simplifies the code for an application that runs on databases from different vendors. In this circumstance, RSF is quite useful. Whatever the need for hierarchical queries, with a little ingenuity you can write suitable queries on hierarchical data using recursive subfactored queries, and they will be capable of doing everything that can be done currently with CONNECT BY.

¹ Although well known in the Oracle community for some time now, the INLINE and MATERIALIZE hints remain undocumented by Oracle.

² If you run these examples on different versions of Oracle, the output may appear differently because the test data sometimes change with versions of Oracle.

³ The MATERIALIZE hint was used to ensure that the example works as expected, given that you may be testing on a different version or patch level of Oracle. On the test system I used, this was the default action by Oracle.

⁴ Initial executions are run after first flushing shared_pool and buffer_cache.

⁵ The actual table name was sys_temp_0fd9d66a2_453290, but was shortened in the listing for formatting purposes.

⁶CONNECT BY was first available in Oracle version 2 or, in others words, from the very beginning.

⁷ A leaf node is a node in the hierarchical tree that has no children.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 10: Subquery Factoring

Create new playlist

Sign In

Sign Up

Table of Contents for
CHAPTER 10: Subquery Factoring