You may not be familiar with the term subquery factoring. Prior to the release of Oracle 11gR2, the official Oracle documentation barely mentions it, providing just a brief synopsis of its use, a couple of restrictions, and a single example. If I instead refer to the WITH clause of the SELECT statement, you probably know immediately what I mean, because these terms are more recognizable. Both terms are used in this chapter.
In Oracle 11gR2 (version 11.2), the WITH clause was enhanced with the ability to recurse; that is, the factored subquery is allowed to call itself within some limitation. The value of this may not be readily apparent. If you have used the CONNECT BY clause to create hierarchical queries, you can appreciate that recursive subqueries allow the same functionality to be implemented in an ANSI standard format.
If the term subquery factoring is not known to you, perhaps you have heard of the ANSI standard term common table expression (commonly called CTE). Common table expressions were first specified in the 1999 ANSI SQL Standard. For some reason, Oracle has chosen to obfuscate this name. Other database vendors refer to common table expressions, so perhaps Oracle chose subquery factoring just to be different.
Standard Usage
One of the most useful features of the WITH clause when it was first introduced was to clean up complex SQL queries. When a large number of tables and columns are involved in a query, it can become difficult to follow the flow of data through the query. Via the use of subquery factoring, a query can be made more understandable by moving some of the complexity away from the main body of the query.
The query in Listing 10-1 generates a cross-tab report using the PIVOT operator. The formatting helps make the SQL somewhat readable, but there is quite a bit going on here. The innermost query is creating a set of aggregates on key sales columns, whereas the next most outer query simply provides column names that are presented to the PIVOT operator, where the final values of sales by channel and quarter for each product are generated.
Listing 10-1. Cross-tab without Subquery Factoring
select *
from (
select /*+ gather_plan_statistics */
product
, channel
, quarter
, country
, quantity_sold
from
(
select
prod_name product
, country_name country
, channel_id channel
, substr(calendar_quarter_desc, 6,2) quarter
, sum(amount_sold) amount_sold
, sum(quantity_sold) quantity_sold
from
sh.sales
join sh.times on times.time_id = sales.time_id
join sh.customers on customers.cust_id = sales.cust_id
join sh.countries on countries.country_id = customers.country_id
join sh.products on products.prod_id = sales.prod_id
group by
prod_name
, country_name
, channel_id
, substr(calendar_quarter_desc, 6, 2)
)
) PIVOT (
sum(quantity_sold)
FOR (channel, quarter) IN
(
(5, '02') AS CATALOG_Q2,
(4, '01') AS INTERNET_Q1,
(4, '04') AS INTERNET_Q4,
(2, '02') AS PARTNERS_Q2,
(9, '03') AS TELE_Q3
)
)
order by product, country;
Now let’s use the WITH clause to break the query into byte-size chunks that are easier to comprehend. The SQL has been rewritten in Listing 10-2 using the WITH clause to create three subfactored queries: sales_countries, top_sales, and sales_rpt. Notice that both the top_sales and sales_rpt subqueries refer to other subqueries by name, as if they are a table or a view. By choosing names that make the intent of each subquery easy to follow, the readability of the SQL is improved. For instance, the subquery name sales_countries refers to the countries in which the sales took place, top_sales collects the sales data, and the sales_rpt subquery aggregates the data. The results of the sales_rpt subquery are used in the main query, which answers the question “What is the breakdown of sales by product and country per quarter?” If you were not told the intent of the SQL in Listing 10-1, it would take some time to discern its purpose; on the other hand, the structure of the SQL in Listing 10-2 with subfactored queries makes it easier to understand the intent of the code.
In addition, the statements associated directly with the PIVOT operator are in the same section of the SQL statement at the bottom, further enhancing readability.
Listing 10-2. Cross-tab with Subquery Factoring
with
sales_countries as (
select /*+ gather_plan_statistics */
cu.cust_id
, co.country_name
from sh.countries co, sh.customers cu
where cu.country_id = co.country_id
),
top_sales as (
select
p.prod_name
, sc.country_name
, s.channel_id
, t.calendar_quarter_desc
, s.amount_sold
, s.quantity_sold
from
sh.sales s
join sh.times t on t.time_id = s.time_id
join sh.customers c on c.cust_id = s.cust_id
join sales_countries sc on sc.cust_id = c.cust_id
join sh.products p on p.prod_id = s.prod_id
),
sales_rpt as (
select
prod_name product
, country_name country
, channel_id channel
, substr(calendar_quarter_desc, 6,2) quarter
, sum(amount_sold) amount_sold
, sum(quantity_sold) quantity_sold
from top_sales
group by
prod_name
, country_name
, channel_id
, substr(calendar_quarter_desc, 6, 2)
)
select * from
(
select product, channel, quarter, country, quantity_sold
from sales_rpt
) pivot (
sum(quantity_sold)
for (channel, quarter) in
(
(5, '02') as catalog_q2,
(4, '01') as internet_q1,
(4, '04') as internet_q4,
(2, '02') as partners_q2,
(9, '03') as tele_q3
)
)
order by product, country;
Although this is not an extremely complex SQL example, it does serve to illustrate the point of how the WITH clause can be used to make a statement more readable and easier to maintain. Large, complex queries can be made more understandable by using this technique.
Oracle 12c introduced the ability to declare and define PL/SQL functions and procedures using the WITH clause. Once defined, you can reference the PL/SQL functions in the query in which you specify this clause (including any of its subqueries). Listing 10-3 walks through a simple example of how to create and use this new feature.
Listing 10-3. WITH Function Clause
SQL>WITH
2 function calc_markup(p_markup number, p_price number) return number
3 is
4 begin
5 return p_markup*p_price;
6 end;
7 select prod_name,
8 prod_list_price cur_price,
9 calc_markup(.05,prod_list_price) mup5,
10 round(prod_list_price + calc_markup(.10,prod_list_price),2) new_price
11 from sh.products;
12 /
PROD_NAME CUR_PRICE MUP5 NEW_PRICE
-------------------------------------------------- --------------- --------------- ---------------
5MP Telephoto Digital Camera 899.99 44.9995 989.99
17" LCD w/built-in HDTV Tuner 999.99 49.9995 1099.99
Envoy 256MB - 40GB 999.99 49.9995 1099.99
Y Box 299.99 14.9995 329.99
Mini DV Camcorder with 3.5" Swivel LCD 1099.99 54.9995 1209.99
...
Finding Fido 12.99 .6495 14.29
Adventures with Numbers 11.99 .5995 13.19
Extension Cable 7.99 .3995 8.79
Xtend Memory 20.99 1.0495 23.09
72 rows selected.
Elapsed: 00:00:00.02
-- Using a WITH function within an outer query block
SQL>SELECT prod_name, cur_price, mup5, new_price
2 FROM (
3 WITH
4 function calc_markup(p_markup number, p_price number) return number
5 is
6 begin
7 return p_markup*p_price;
8 end;
9 select prod_name,
10 prod_list_price cur_price,
11 calc_markup(.05,prod_list_price) mup5,
12 round(prod_list_price + calc_markup(.10,prod_list_price),2) new_price
from sh.products
)
WHERE cur_price < 1000
AND new_price >= 1000 ;
17 /
WITH
*
ERROR at line 3:
ORA-32034: unsupported use of WITH clause
Elapsed: 00:00:00.01
SQL>
SQL>SELECT /*+ WITH_PLSQL */ prod_name, cur_price, mup5, new_price
2 FROM (
3 WITH
4 function calc_markup(p_markup number, p_price number) return number
5 is
6 begin
7 return p_markup*p_price;
8 end;
9 select prod_name,
10 prod_list_price cur_price,
11 calc_markup(.05,prod_list_price) mup5,
12 round(prod_list_price + calc_markup(.10,prod_list_price),2) new_price
13 from sh.products
)
15 WHERE cur_price < 1000
16 AND new_price >= 1000 ;
17 /
PROD_NAME CUR_PRICE MUP5 NEW_PRICE
-------------------------------------------------- --------------- --------------- ---------------
17" LCD w/built-in HDTV Tuner 999.99 49.9995 1099.99
Envoy 256MB - 40GB 999.99 49.9995 1099.99
2 rows selected.
Elapsed: 00:00:00.02
To run the statement, you must use the forward slash (/), which resembles how you would execute an anonymous PL/SQL block. This example shows only a single function declaration, but you may create as many functions and/or procedures as you wish. As also shown, one problem you may face occurs when you wish to place the WITH function inside an outer query block. When the WITH function is not the first declaration before the top-level query, you get an “ORA-32034: unsupported use of WITH clause” error. Fortunately, you can use the WITH_PLSQL hint to correct the problem. The hint, however, enables you to specify only the WITH function as shown; it is not an optimizer hint in that it does not have any bearing on the optimizer’s execution plan decisions.
Although this example is quite simple, imagine the possibilities of how you could use the WITH function clause. There are some expressions that are difficult to build without multiple steps, and you may have created user-defined PL/SQL functions that you use in your SQL statements. However, doing so requires that context switches must occur between the SQL and PL/SQL “engines” when your SQL statement is executed. These context switches incur overhead and can cause performance problems if used frequently. But now, with the advent of the WITH function clause in 12c, we have a construct built into the SQL language that can reduce or eliminate the need for user-defined PL/SQL functions and can allow you to do everything you need directly inline in your SQL statement.
When a SQL query is designed or modified to take advantage of subquery factoring, there are some not-so-subtle changes that may take place when the optimizer creates an execution plan for the query. The following quote comes from the Oracle documentation in the Oracle Database SQL Language Reference (http://www.oracle.com/technetwork/indexes/documentation/index.html) for SELECT, under the subquery_factoring_clause heading: “The WITH query_name clause lets you assign a name to a subquery block. You can then reference the subquery block multiple places in the query by specifying query_name. Oracle Database optimizes the query by treating the query name as either an inline view or as a temporary table.”
Notice that Oracle may treat the factored subquery as a temporary table. In queries in which a table is referenced more than once, this could be a distinct performance advantage, because Oracle can materialize result sets from the query, thereby avoiding performing some expensive database operations more than once. The caveat here is that it “could be” a distinct performance advantage. Keep in mind that materializing the result set requires creating a temporary table and inserting the rows into it. Doing so may be of value if the same result set is referred to many times, or it may be a big performance penalty.
Testing Execution Plans
When examining the execution plans for subfactored queries, it may not be readily apparent whether Oracle is choosing the best execution plan. It may seem that the use of the INLINE or MATERIALIZE 1 hint would result in better performing SQL. In some cases it may, but the use of these hints needs to be tested and considered in the context of overall application performance.
The need to test for optimum query performance can be illustrated by a report that management has requested. The report must show the distribution of customers by country and income level, showing only those countries and income levels that make up 1 percent or more of the entire customer base. A country and income level should also be reported if the number of customers in an income level bracket is greater than or equal to 25 percent of all customers in that income bracket. 2
The query in Listing 10-4 is the end result. 3 The cust factored subquery has been retained from previous queries. New are the subqueries in the HAVING clause; these are used to enforce the rules stipulated for the report.
Listing 10-4. WITH and MATERIALIZE
1 with cust as (
2 select /*+ materialize gather_plan_statistics */
3 b.cust_income_level,
4 a.country_name
5 from sh.customers b
6 join sh.countries a on a.country_id = b.country_id
7 )
8 select country_name, cust_income_level, count(country_name) country_cust_count
9 from cust c
10 having count(country_name) >
11 (
12 select count(*) * .01
13 from cust c2
14 )
15 or count(cust_income_level) >=
16 (
17 select median(income_level_count)
18 from (
19 select cust_income_level, count(*) *.25 income_level_count
20 from cust
21 group by cust_income_level
22 )
23 )
24 group by country_name, cust_income_level
25 order by 1,2;
CUSTOMER
COUNTRY INCOME LEVEL COUNT
------------------------------ -------------------- --------
France E: 90,000 - 109,999 585
France F: 110,000 - 129,999 651
...
United States of America H: 150,000 - 169,999 1857
United States of America I: 170,000 - 189,999 1395
...
35 rows selected.
Elapsed: 00:00:01.37
Statistics
-----------------------------------------------------
1854 recursive calls
307 db block gets
2791 consistent gets
1804 physical reads
672 redo size
4609 bytes sent via SQL*Net to client
700 bytes received via SQL*Net from client
18 SQL*Net roundtrips to/from client
38 sorts (memory)
0 sorts (disk)
35 rows processed
----------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 35 |
| 1 | TEMP TABLE TRANSFORMATION | | 1 | | 35 |
| 2 | LOAD AS SELECT | | 1 | | 0 |
|* 3 | HASH JOIN | | 1 | 55500 | 55500 |
| 4 | TABLE ACCESS FULL | COUNTRIES | 1 | 23 | 23 |
| 5 | TABLE ACCESS FULL | CUSTOMERS | 1 | 55500 | 55500 |
|* 6 | FILTER | | 1 | | 35 |
| 7 | SORT GROUP BY | | 1 | 18 | 209 |
| 8 | VIEW | | 1 | 55500 | 55500 |
| 9 | TABLE ACCESS FULL | SYS_TEMP_0F| 1 | 55500 | 55500 |
| 10 | SORT AGGREGATE | | 1 | 1 | 1 |
| 11 | VIEW | | 1 | 55500 | 55500 |
| 12 | TABLE ACCESS FULL | SYS_TEMP_0F| 1 | 55500 | 55500 |
| 13 | SORT GROUP BY | | 1 | 1 | 1 |
| 14 | VIEW | | 1 | 11 | 13 |
| 15 | SORT GROUP BY | | 1 | 11 | 13 |
| 16 | VIEW | | 1 | 55500 | 55500 |
| 17 | TABLE ACCESS FULL | SYS_TEMP_0F| 1 | 55500 | 55500 |
----------------------------------------------------------------------------
When executing 4 the SQL, all appears as you expect, then you check the execution plan and find that the join of the customers and countries tables underwent a TEMP TABLE TRANSFORMATION, and the rest of the query was satisfied by using the temporary table sys_temp_of. 5 At this point, you might rightly wonder if the execution plan chosen was a reasonable one. This can be tested easily, thanks to the MATERIALIZED and INLINE hints.
By using the INLINE hint, Oracle can be instructed to satisfy all portions of the query without using a TEMP TABLE TRANSFORMATION. The results of doing so are shown in Listing 10-5. Only the relevant portion of the SQL that has changed is shown here (the rest is identical to that in Listing 10-4).
Listing 10-5. WITH and INLINE Hint
1 with cust as (
2 select /*+ inline gather_plan_statistics */
3 b.cust_income_level,
4 a.country_name
5 from sh.customers b
6 join sh.countries a on a.country_id = b.country_id
7 )
...
COUNTRY INCOME LEVEL COUNT
------------------------------ -------------------- --------
France E: 90,000 - 109,999 585
France F: 110,000 - 129,999 651
...
United States of America I: 170,000 - 189,999 1395
United States of America J: 190,000 - 249,999 1390
...
35 rows selected.
Elapsed: 00:00:00.62
Statistics
----------------------------------------------------------
1501 recursive calls
0 db block gets
4758 consistent gets
1486 physical reads
0 redo size
4609 bytes sent via SQL*Net to client
700 bytes received via SQL*Net from client
18 SQL*Net roundtrips to/from client
34 sorts (memory)
0 sorts (disk)
35 rows processed
--------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 35 |
|* 1 | FILTER | | 1 | | 35 |
| 2 | SORT GROUP BY | | 1 | 20 | 236 |
|* 3 | HASH JOIN | | 1 | 55500 | 55500 |
| 4 | TABLE ACCESS FULL | COUNTRIES | 1 | 23 | 23 |
| 5 | TABLE ACCESS FULL | CUSTOMERS | 1 | 55500 | 55500 |
| 6 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 7 | HASH JOIN | | 1 | 55500 | 55500 |
| 8 | INDEX FULL SCAN | COUNTRIES_PK | 1 | 23 | 23 |
| 9 | TABLE ACCESS FULL | CUSTOMERS | 1 | 55500 | 55500 |
| 10 | SORT GROUP BY | | 1 | 1 | 1 |
| 11 | VIEW | | 1 | 12 | 13 |
| 12 | SORT GROUP BY | | 1 | 12 | 13 |
|* 13 | HASH JOIN | | 1 | 55500 | 55500 |
| 14 | INDEX FULL SCAN | COUNTRIES_PK | 1 | 23 | 23 |
| 15 | TABLE ACCESS FULL| CUSTOMERS | 1 | 55500 | 55500 |
--------------------------------------------------------------------------
From the execution plan in Listing 10-5, you can see that three full scans were performed on the customers table and one full scan on the countries table. Two of the executions against the cust subquery required only the information in the COUNTRIES_PK index, so a full scan of the index was performed rather than a full scan of the table, saving time and resources. In this example, the query that used the MATERIALIZE hint had 2791 consistent gets whereas the second had 4758. However, it’s also important to note there was another difference: The second query used less resources for redo generation: 0 vs. 672. When deciding which format works best for your environment, you must consider carefully the decreased CPU resulting from lower consistent gets against the higher redo required to materialize the temporary table.
What may be surprising is that the execution using full table scans took 0.75 second, which is about 100 percent faster than when a temporary table is used. Of course, the cache was cold for both queries, because both the buffer cache and the shared pool were flushed prior to running each query. From these simple tests you might feel safe using the INLINE hint
in this bit of code, convinced that it will perform well based on the amount of physical IO required in the first test outweighing the memory usage and the logical IO required for the second test. If you know for sure that the size of the datasets will not grow and that the system load will remain fairly constant, using the INLINE hint in this query is probably a good idea. The problem, however, is that data are rarely static; often, data grow to a larger size than originally intended when developing a query. In this event, retesting these queries is in order to determine whether the use of the INLINE hint is still valid.
Testing the Effects of Query Changes
Even as data do not remain static, SQL is not always static. Sometimes requirements change, so code must be modified. What if the requirements changed for the examples in Listings 10-4 and 10-5? Would minor changes invalidate the use of the hints embedded in the SQL? This is probably something worth investigating, so let’s do so.
Previously, we were reporting on income brackets when the count of them for any country was greater than or equal to 25 percent of the total global count for that bracket. Now we are asked to include an income bracket if it is among those income brackets the number of which is greater than the median, based on the number of customers per bracket. This SQL is seen in Listing 10-6. Notice that the INLINE hint has been left in. So now there’s an additional full table scan and index scan compared with the execution plan in Listing 10-5. Although the elapsed time has increased, it still seems reasonable.
Listing 10-6. Modified Income Search: INLINE
1 with cust as (
2 select /*+ inline gather_plan_statistics */
3 b.cust_income_level,
4 a.country_name
5 from sh.customers b
6 join sh.countries a on a.country_id = b.country_id
7 ),
8 median_income_set as (
9 select /*+ inline */ cust_income_level, count(*) income_level_count
10 from cust
11 group by cust_income_level
12 having count(cust_income_level) > (
13 select median(income_level_count) income_level_count
14 from (
15 select cust_income_level, count(*) income_level_count
16 from cust
17 group by cust_income_level
18 )
19 )
20 )
21 select country_name, cust_income_level, count(country_name) country_cust_count
22 from cust c
23 having count(country_name) >
24 (
25 select count(*) * .01
26 from cust c2
27 )
28 or cust_income_level in ( select mis.cust_income_level from median_income_set mis)
29 group by country_name, cust_income_level;
CUSTOMER
COUNTRY INCOME LEVEL COUNT
------------------------------ -------------------- --------
Argentina D: 70,000 - 89,999 25
Argentina E: 90,000 - 109,999 39
...
United States of America K: 250,000 - 299,999 1062
United States of America L: 300,000 and above 982
123 rows selected.
Elapsed: 00:00:01.26
Statistics
----------------------------------------------------------
1524 recursive calls
0 db block gets
23362 consistent gets
1486 physical reads
0 redo size
15570 bytes sent via SQL*Net to client
1195 bytes received via SQL*Net from client
63 SQL*Net roundtrips to/from client
3 sorts (memory)
0 sorts (disk)
123 rows processed
---------------------------------------------------------------------
| Id | Operation | Name |Starts|E-Rows|A-Rows |
---------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1| | 123 |
|* 1 | FILTER | | 1| | 123 |
| 2 | SORT GROUP BY | | 1| 20| 236 |
|* 3 | HASH JOIN | | 1| 55500| 55500 |
| 4 | TABLE ACCESS FULL | COUNTRIES | 1| 23| 23 |
| 5 | TABLE ACCESS FULL | CUSTOMERS | 1| 55500| 55500 |
| 6 | SORT AGGREGATE | | 1| 1| 1 |
|* 7 | HASH JOIN | | 1| 55500| 55500 |
| 8 | INDEX FULL SCAN | COUNTRIES_PK| 1| 23| 23 |
| 9 | TABLE ACCESS FULL | CUSTOMERS | 1| 55500| 55500 |
|* 10 | FILTER | | 13| | 6 |
| 11 | HASH GROUP BY | | 13| 1| 133 |
|* 12 | HASH JOIN | | 13| 55500| 721K|
| 13 | INDEX FULL SCAN | COUNTRIES_PK| 13| 23| 299 |
| 14 | TABLE ACCESS FULL | CUSTOMERS | 13| 55500| 721K|
| 15 | SORT GROUP BY | | 1| 1| 1 |
| 16 | VIEW | | 1| 12| 13 |
| 17 | SORT GROUP BY | | 1| 12| 13 |
|* 18 | HASH JOIN | | 1| 55500| 55500 |
| 19 | INDEX FULL SCAN | COUNTRIES_PK| 1| 23| 23 |
| 20 | TABLE ACCESS FULL| CUSTOMERS | 1| 55500| 55500 |
---------------------------------------------------------------------
Now that there’s an additional table scan and index scan, how do you think the performance of this query fares if temporary table transformations are allowed to take place? The results can be seen in Listing 10-7.
Listing 10-7. Modified Income Search: MATERIALIZE
1 with cust as (
2 select /*+ materialize gather_plan_statistics */
3 b.cust_income_level,
4 a.country_name
5 from sh.customers b
6 join sh.countries a on a.country_id = b.country_id
7 ),
...
CUSTOMER
COUNTRY INCOME LEVEL COUNT
------------------------------ -------------------- --------
Argentina D: 70,000 - 89,999 25
Argentina E: 90,000 - 109,999 39
...
United States of America K: 250,000 - 299,999 1062
United States of America L: 300,000 and above 982
123 rows selected.
Elapsed: 00:00:00.87
Statistics
----------------------------------------------------------
2001 recursive calls
324 db block gets
3221 consistent gets
1822 physical reads
1244 redo size
15570 bytes sent via SQL*Net to client
1195 bytes received via SQL*Net from client
63 SQL*Net roundtrips to/from client
38 sorts (memory)
0 sorts (disk)
123 rows processed
-------------------------------------------------------------------
| Id |Operation |Name |Starts|E-Rows|A-Rows |
-------------------------------------------------------------------
| 0|SELECT STATEMENT | | 1| | 123 |
| 1| TEMP TABLE TRANSFORMATION| | 1| | 123 |
| 2| LOAD AS SELECT | | 1| | 0 |
|* 3| HASH JOIN | | 1| 55500| 55500 |
| 4| TABLE ACCESS FULL |COUNTRIES | 1| 23| 23 |
| 5| TABLE ACCESS FULL |CUSTOMERS | 1| 55500| 55500 |
| 6| LOAD AS SELECT | | 1| | 0 |
|* 7| FILTER | | 1| | 6 |
| 8| HASH GROUP BY | | 1| 1| 13 |
| 9| VIEW | | 1| 55500| 55500 |
| 10| TABLE ACCESS FULL |SYS_TEMP_0F| 1| 55500| 55500 |
| 11| SORT GROUP BY | | 1| 1| 1 |
| 12| VIEW | | 1| 12| 13 |
| 13| SORT GROUP BY | | 1| 12| 13 |
| 14| VIEW | | 1| 55500| 55500 |
| 15| TABLE ACCESS FULL |SYS_TEMP_0F| 1| 55500| 55500 |
|* 16| FILTER | | 1| | 123 |
| 17| SORT GROUP BY | | 1| 20| 236 |
| 18| VIEW | | 1| 55500| 55500 |
| 19| TABLE ACCESS FULL |SYS_TEMP_0F| 1| 55500| 55500 |
| 20| SORT AGGREGATE | | 1| 1| 1 |
| 21| VIEW | | 1| 55500| 55500 |
| 22| TABLE ACCESS FULL |SYS_TEMP_0F| 1| 55500| 55500 |
|* 23| VIEW | | 13| 1| 6 |
| 24| TABLE ACCESS FULL |SYS_TEMP_0F| 13| 1| 63 |
-------------------------------------------------------------------
Because there’s that additional scan taking place in the modified version of the query, the overhead of logical IO becomes more apparent. It is significantly more efficient with this query to allow Oracle to perform table transformations, writing the results of the hash join to a temporary table on disk, where they can be reused throughout the query.
Seizing Other Optimization Opportunities
There are other opportunities when subquery factoring may be used to your advantage. If you are working on applications that were originally written several years ago, you may find that some of the SQL could use a bit of improvement based on the features offered by Oracle versions 9i and later. The query in Listing 10-8, for example, does exactly what it was asked to do, which is to find the average, minimum, and maximum costs for each product that was produced in the year 2000, with the costs calculated for each of the sale channels in which the product was sold. This SQL is not only difficult to read and hard to modify, but also it is somewhat inefficient.
Listing 10-8. Old SQL to Calculate Costs
1 select /*+ gather_plan_statistics */
2 substr(prod_name,1,30) prod_name
3 , channel_desc
4 , (
5 select avg(c2.unit_cost)
6 from sh.costs c2
7 where c2.prod_id = c.prod_id and c2.channel_id = c.channel_id
8 and c2.time_id between to_date('01/01/2000','mm/dd/yyyy')
9 and to_date('12/31/2000')
10 ) avg_cost
11 , (
12 select min(c2.unit_cost)
13 from sh.costs c2
14 where c2.prod_id = c.prod_id and c2.channel_id = c.channel_id
15 and c2.time_id between to_date('01/01/2000','mm/dd/yyyy')
16 and to_date('12/31/2000')
17 ) min_cost
18 , (
19 select max(c2.unit_cost)
20 from sh.costs c2
21 where c2.prod_id = c.prod_id and c2.channel_id = c.channel_id
22 and c2.time_id between to_date('01/01/2000','mm/dd/yyyy')
23 and to_date('12/31/2000')
24 ) max_cost
25 from (
26 select distinct pr.prod_id, pr.prod_name, ch.channel_id, ch.channel_desc
27 from sh.channels ch
28 , sh.products pr
29 , sh.costs co
30 where ch.channel_id = co.channel_id
31 and co.prod_id = pr.prod_id
32 and co.time_id between to_date('01/01/2000','mm/dd/yyyy')
33 and to_date('12/31/2000')
34 ) c
35 order by prod_name, channel_desc;
PRODUCT CHANNEL_DESC AVG COST MIN COST MAX COST
------------------------------ -------------------- ---------- ---------- ----------
1.44MB External 3.5" Diskette Direct Sales 8.36 7.43 9.17
1.44MB External 3.5" Diskette Internet 8.59 7.42 9.55
...
Y Box Internet 266.73 245.00 282.30
Y Box Partners 272.62 242.79 293.68
---------- ---------- ----------
sum 27,961.39 24,407.85 34,478.10
216 rows selected.
COLD CACHE Elapsed: 00:00:02.30
WARM CACHE Elapsed: 00:00:01.09
-----------------------------------------------------------------------------------
| Id | Operation |Name |Starts|E-Rows|A-Rows |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1| | 216 |
| 1 | SORT AGGREGATE | | 216| 1| 216 |
|* 2 | FILTER | | 216| | 17373 |
| 3 | PARTITION RANGE ITERATOR | | 216| 96| 17373 |
|* 4 | TABLE ACCESS BY LOCAL INDEX ROWID|COSTS | 864| 96| 17373 |
| 5 | BITMAP CONVERSION TO ROWIDS | | 864| | 52119 |
| 6 | BITMAP AND | | 864| | 840 |
| 7 | BITMAP MERGE | | 864| | 864 |
|* 8 | BITMAP INDEX RANGE SCAN |COSTS_TIME_BIX| 864| | 79056 |
|* 9 | BITMAP INDEX SINGLE VALUE |COSTS_PROD_BIX| 864| | 840 |
| 10 | SORT AGGREGATE | | 216| 1| 216 |
|* 11 | FILTER | | 216| | 17373 |
| 12 | PARTITION RANGE ITERATOR | | 216| 96| 17373 |
|* 13 | TABLE ACCESS BY LOCAL INDEX ROWID|COSTS | 864| 96| 17373 |
| 14 | BITMAP CONVERSION TO ROWIDS | | 864| | 52119 |
| 15 | BITMAP AND | | 864| | 840 |
| 16 | BITMAP MERGE | | 864| | 864 |
|* 17 | BITMAP INDEX RANGE SCAN |COSTS_TIME_BIX| 864| | 79056 |
|* 18 | BITMAP INDEX SINGLE VALUE |COSTS_PROD_BIX| 864| | 840 |
| 19 | SORT AGGREGATE | | 216| 1| 216 |
|* 20 | FILTER | | 216| | 17373 |
| 21 | PARTITION RANGE ITERATOR | | 216| 96| 17373 |
|* 22 | TABLE ACCESS BY LOCAL INDEX ROWID|COSTS | 864| 96| 17373 |
| 23 | BITMAP CONVERSION TO ROWIDS | | 864| | 52119 |
| 24 | BITMAP AND | | 864| | 840 |
| 25 | BITMAP MERGE | | 864| | 864 |
|* 26 | BITMAP INDEX RANGE SCAN |COSTS_TIME_BIX| 864| | 79056 |
|* 27 | BITMAP INDEX SINGLE VALUE |COSTS_PROD_BIX| 864| | 840 |
| 28 | SORT ORDER BY | | 1| 20640| 216 |
| 29 | VIEW | | 1| 20640| 216 |
| 30 | HASH UNIQUE | | 1| 20640| 216 |
|* 31 | FILTER | | 1| | 17373 |
|* 32 | HASH JOIN | | 1| 20640| 17373 |
| 33 | TABLE ACCESS FULL |PRODUCTS | 1| 72| 72 |
|* 34 | HASH JOIN | | 1| 20640| 17373 |
| 35 | TABLE ACCESS FULL |CHANNELS | 1| 5| 5 |
| 36 | PARTITION RANGE ITERATOR | | 1| 20640| 17373 |
|* 37 | TABLE ACCESS FULL |COSTS | 4| 20640| 17373 |
-----------------------------------------------------------------------------------
Note In several of the following listings, you will see an elapsed time for COLD CACHE and WARM CACHE. The COLD CACHE time is the first execution of the statement immediately following a flush of both the buffer cache and the shared pool.
Examining the output of Listing 10-8, note that the elapsed execution time on a cold cache is 2.30 seconds, and 1.09 seconds on a warm cache. These times don’t seem all that bad at first; but, when you examine the execution plan, you find that this query can be improved on from a performance perspective as well as a readability perspective.
The starts column is telling. Each execution against the costs table is executed 864 times. This is because there are 216 rows produced by a join between channels, products, and costs. Also, the costs table is queried in four separate places for the same information. By using subquery factoring, not only can this SQL be cleaned up and made easier to read, but also it can be made more efficient.
As seen in Listing 10-9, you can start by putting the begin_date and end_date columns in a separate query bookends, leaving only one place that the values need to be set. The data for products are placed in the prodmaster subquery. Although this bit of the SQL worked fine as a subquery in the FROM clause, the readability of the SQL statement as a whole is greatly improved by moving it to a factored subquery.
Listing 10-9. Old SQL Refactored Using the WITH Clause
1 with bookends as (
2 select
3 to_date('01/01/2000','mm/dd/yyyy') begin_date
4 ,to_date('12/31/2000','mm/dd/yyyy') end_date
5 from dual
6 ),
7 prodmaster as (
8 select distinct pr.prod_id, pr.prod_name, ch.channel_id, ch.channel_desc
9 from sh.channels ch
10 , sh.products pr
11 , sh.costs co
12 where ch.channel_id = co.channel_id
13 and co.prod_id = pr.prod_id
14 and co.time_id between (select begin_date from bookends)
15 and (select end_date from bookends)
16 ),
17 cost_compare as (
18 select
19 prod_id
20 , channel_id
21 , avg(c2.unit_cost) avg_cost
22 , min(c2.unit_cost) min_cost
23 , max(c2.unit_cost) max_cost
24 from sh.costs c2
25 where c2.time_id between (select begin_date from bookends)
26 and (select end_date from bookends)
27 group by c2.prod_id, c2.channel_id
28 )
29 select /*+ gather_plan_statistics */
30 substr(pm.prod_name,1,30) prod_name
31 , pm.channel_desc
32 , cc.avg_cost
33 , cc.min_cost
34 , cc.max_cost
35 from prodmaster pm
36 join cost_compare cc on cc.prod_id = pm.prod_id
37 and cc.channel_id = pm.channel_id
38 order by pm.prod_name, pm.channel_desc;
PRODUCT CHANNEL_DESC AVG COST MIN COST MAX COST
------------------------------ -------------------- ---------- ---------- ----------
1.44MB External 3.5" Diskette Direct Sales 8.36 7.43 9.17
1.44MB External 3.5" Diskette Internet 8.59 7.42 9.55
Y Box Internet 266.73 245.00 282.30
Y Box Partners 272.62 242.79 293.68
---------- ---------- ----------
sum 27,961.39 24,407.85 34,478.10
216 rows selected.
COLD CACHE Elapsed: 00:00:01.48
WARM CACHE Elapsed: 00:00:00.17
----------------------------------------------------------------------------------
| Id |Operation |Name |Starts|E-Rows|A-Rows|
----------------------------------------------------------------------------------
| 0|SELECT STATEMENT | | 1| | 216|
| 1| SORT ORDER BY | | 1| 17373| 216|
|* 2| HASH JOIN | | 1| 17373| 216|
| 3| VIEW | | 1| 216| 216|
| 4| HASH GROUP BY | | 1| 216| 216|
| 5| PARTITION RANGE ITERATOR | | 1| 17373| 17373|
| 6| TABLE ACCESS BY LOCAL INDEX ROWID|COSTS | 4| 17373| 17373|
| 7| BITMAP CONVERSION TO ROWIDS | | 4| | 17373|
|* 8| BITMAP INDEX RANGE SCAN |COSTS_TIME_BIX| 4| | 366|
| 9| FAST DUAL | | 1| 1| 1|
| 10| FAST DUAL | | 1| 1| 1|
| 11| VIEW | | 1| 17373| 216|
| 12| HASH UNIQUE | | 1| 17373| 216|
|* 13| HASH JOIN | | 1| 17373| 17373|
| 14| TABLE ACCESS FULL |PRODUCTS | 1| 72| 72|
| 15| MERGE JOIN | | 1| 17373| 17373|
| 16| TABLE ACCESS BY INDEX ROWID |CHANNELS | 1| 5| 4|
| 17| INDEX FULL SCAN |CHANNELS_PK | 1| 5| 4|
|* 18| SORT JOIN | | 4| 17373| 17373|
| 19| PARTITION RANGE ITERATOR | | 1| 17373| 17373|
| 20| TABLE ACCESS BY LOCAL INDEX RO|COSTS | 4| 17373| 17373|
| 21| BITMAP CONVERSION TO ROWIDS | | 4| | 17373|
|* 22| BITMAP INDEX RANGE SCAN |COSTS_TIME_BIX| 4| | 366|
| 23| FAST DUAL | | 1| 1| 1|
| 24| FAST DUAL | | 1| 1| 1|
----------------------------------------------------------------------------------
The calculations for the average, minimum, and maximum costs are replaced with a single subquery called cost_compare. Last, the SQL that joins the prodmaster and cost_compare subqueries is added. The structure of the SQL is now much easier on the eyes and the overworked developer’s brain. It’s also simpler for the DBA to understand. The DBA will be especially happy with the execution statistics.
Where the old SQL queried the costs table and costs_time_bix index several hundred times, the new SQL queries each only eight times. This is quite an improvement, and it shows in the elapsed times. The query time on a cold cache is 1.48 seconds, about 25 percent better than the old SQL. On a warm cache, however, the refactored SQL really shines, running at 0.17 second whereas the old SQL managed only 1.09 seconds.
Applying Subquery Factoring to PL/SQL
We discussed the new 12c WITH PL/SQL function earlier, but there are other ways that PL/SQL can present golden opportunities for optimization using subquery factoring. Something that most of us have done at one time or another is to write a PL/SQL routine when we cannot figure out how to do what we want in a single SQL query. Sometimes it can be very difficult to capture everything in a single statement. It’s often just easier to think procedurally rather than in sets of data, and just write some code to do what we need. As you gain experience, you will rely less and less on thinking in terms of “How would I code this in PL/SQL?” and will rely more along the lines of “How do I capture this problem in a single SQL statement?” The more advanced features that Oracle has packed into SQL can help as well.
Here’s an example. You’ve been asked to create a report with the following criteria:
At first, this doesn’t seem too difficult, but you may struggle for a bit when trying to capture this in one SQL statement. So, you decide to use a PL/SQL routine to get the needed data. The results may be similar to those in Listing 10-10. The logic is simple. Find all customers that fit the criteria and store their IDs in a temporary table. Then, loop through the newly saved customer IDs and find all their sales, sum them up, and add them to another temporary table. The results are then joined to the customers and products tables to generate the report.
Listing 10-10. PL/SQL to Generate Customer Report
SQL> create global temporary table cust3year ( cust_id number );
Table created.
SQL> create global temporary table sales3year(
2 cust_id number ,
3 prod_category varchar2(50),
4 total_sale number
5 )
6 /
Table created.
SQL> begin
2 execute immediate 'truncate table cust3year';
3 execute immediate 'truncate table sales3year';
4
5 insert into cust3year
6 select cust_id --, count(cust_years) year_count
7 from (
8 select distinct cust_id, trunc(time_id,'YEAR') cust_years
9 from sh.sales
10 )
11 group by cust_id
12 having count(cust_years) >= 3;
13
14 for crec in ( select cust_id from cust3year)
15 loop
16 insert into sales3year
17 select s.cust_id,p.prod_category, sum(co.unit_price * s.quantity_sold)
18 from sh.sales s
19 join sh.products p on p.prod_id = s.prod_id
20 join sh.costs co on co.prod_id = s.prod_id
21 and co.time_id = s.time_id
22 join sh.customers cu on cu.cust_id = s.cust_id
23 where s.cust_id = crec.cust_id
24 group by s.cust_id, p.prod_category;
25 end loop;
26 end;
27 /
PL/SQL procedure successfully completed.
Elapsed: 00:01:17.48
SQL> break on report
SQL> compute sum of total_sale on report
SQL> select c3.cust_id, c.cust_last_name, c.cust_first_name, s.prod_category, s.total_sale
2 from cust3year c3
3 join sales3year s on s.cust_id = c3.cust_id
4 join sh.customers c on c.cust_id = c3.cust_id
5 order by 1,4;
CUST ID LAST NAME FIRST NAME PRODUCT CATEGORY TOTAL SALE
--------- --------------- --------------- ------------------------------ ---------------
6 Charles Harriett Electronics 2,838.57
6 Charles Harriett Hardware 19,535.38
...
50833 Gravel Grover Photo 15,469.64
50833 Gravel Grover Software/Other 9,028.87
---------------
sum 167,085,605.71
16018 rows selected.
The code in Listing 10-10 is fairly succinct and it only takes 1:17 minutes to run. This isn’t too bad, is it? Although this is a nice little chunk of PL/SQL, take another look at it and think in terms of subfactored subqueries. The section that determines the correct customer IDs can be captured in a WITH clause fairly easily. Once the customers are identified, it is then a fairly easy job to use the results of the subquery to look up the needed sales, product, and customer information to create the report.
Listing 10-11 has a single SQL statement that captures what is done with the PL/SQL routine from Listing 10-10—without the need to create temporary tables manually or use PL/SQL loops. Should the use of temporary tables make for a more efficient query, Oracle does so automatically, or you can choose how Oracle preserves the subquery results via the INLINE and MATERIALIZE hints. It is more efficient, too, with an elapsed time of 6.13 seconds.
Listing 10-11. Use of the WITH Clause to Generate the Customer Report
1 with custyear as (
2 select cust_id, extract(year from time_id) sales_year
3 from sh.sales
4 where extract(year from time_id) between 1998 and 2002
5 group by cust_id, extract(year from time_id)
6 ),
7 custselect as (
8 select distinct cust_id
9 from (
10 select cust_id, count(*) over ( partition by cust_id) year_count
11 from custyear
12 )
13 where year_count >= 3 -- 3 or more years as a customer during period
14 )
15 select cu.cust_id, cu.cust_last_name, cu.cust_first_name, p.prod_category, sum(co.unit_price * s.quantity_sold) total_sale
16 from custselect cs
17 join sh.sales s on s.cust_id = cs.cust_id
18 join sh.products p on p.prod_id = s.prod_id
19 join sh.costs co on co.prod_id = s.prod_id
20 and co.time_id = s.time_id
21 join sh.customers cu on cu.cust_id = cs.cust_id
22 group by cu.cust_id, cu.cust_last_name, cu.cust_first_name, p.prod_category
23 order by cu.cust_id;
CUST ID LAST NAME FIRST NAME PRODUCT CATEGORY TOTAL SALE
--------- --------------- --------------- ------------------------------ ---------------
6 Charles Harriett Electronics 2,838.57
6 Charles Harriett Hardware 19,535.38
...
50833 Gravel Grover Photo 15,469.64
50833 Gravel Grover Software/Other 9,028.87
---------------
sum 167,085,605.71
16018 rows selected.
Elapsed: 00:00:06.13
The WITH clause in Listing 10-11 actually uses two subqueries that can be combined into a single query, but I thought it easier to read when they are broken into two queries. Notice the use of the EXTRACT() function;it simplifies comparing years by extracting the year from a date and converting it to an integer.
The SQL examples in this section of the chapter are not meant to be tuning exercises, but merely demonstrations that show how subquery factoring may be used. When refactoring legacy SQL to take advantage of the WITH clause, be sure to test the results. Subquery factoring can be used to organize some queries better, and in some cases can even be used as an optimization tool. Learning to use it adds another tool to your Oracle toolbox.
EXPERIMENT WITH SUBQUERY FACTORING
Included in this chapter are two scripts in the Exercises folder that you may want to experiment with. These scripts both run against the SH demo schema.
Run these scripts with both the MATERIALIZE and INLINE hints to compare performance. In the tsales subquery, a WHERE clause limits the data returned to a single year. Comment out the WHERE clause and run the queries again. How does the efficiency of the two hints compare now? Would you feel comfortable using these hints when the size of the data set is set at runtime by user input?
Beginning in Oracle 11.2, recursive subquery factoring (RSF for the remainder of this chapter) was added. As you can probably guess, the ANSI name for this feature is recursive common table expression. Regardless of what you call it, Oracle has had a similar feature for a very long time in the form of the CONNECT BY 6 clause of the SELECT statement. This feature has been enhanced in Oracle 11gR2.
Let’s begin by looking at a traditional CONNECT BY query such as that in Listing 10-12. The emp inline view is used to join the employee and department tables, and then the single dataset is presented to the SELECT . . . CONNECT BY statement. The PRIOR operator is used to match the current EMPLOYEE_ID to rows where this value is in the MANAGER_ID column. Doing so iteratively creates a recursive query.
Listing 10-12 contains a number of extra columns in the output to help explain how the PRIOR operator works. Let’s take a look at the output beginning with the row for Lex De Haan. You can see that the employee_id for Lex is 102. The PRIOR operator finds all rows for which the manager_id is 102 and includes them under the hierarchy for Lex De Haan. The only row that meets these criteria is the one for Alexander Hunold, with an employee_id of 103. The process is then repeated for Alexander Hunold: Are there any rows for which the manager_id is 103? There are four rows found with a manager_id of 103—those for employees Valli Pattaballa, Diana Lorentz, Bruce Ernst, and David Austin—so these are included in the output below Alexander Hunold. Because there are no rows for which any of the employee_id values for these four employees appears as a manager_id, Oracle moves back up to a level for which the rows have not yet been processed (in this case, for Alberto Errazuriz) and continues on to the end until all rows have been processed.
Listing 10-12. Basic CONNECT BY
1 select lpad(' ', level*2-1,' ') || emp.emp_last_name emp_last_name
2 , emp.emp_first_name
3 , emp.employee_id
4 , emp.mgr_last_name, emp.mgr_first_name
5 , emp.manager_id
6 , department_name
7 from (
8 select /*+ inline gather_plan_statistics */
9 e.last_name emp_last_name, e.first_name emp_first_name
10 , e.employee_id, d.department_id
11 , e.manager_id, d.department_name
12 , es.last_name mgr_last_name, es.first_name mgr_first_name
13 from hr.employees e
14 left outer join hr.departments d on d.department_id = e.department_id
15 left outer join hr.employees es on es.employee_id = e.manager_id
16 ) emp
17 connect by prior emp.employee_id = emp.manager_id
18 start with emp.manager_id is null
19 order siblings by emp.emp_last_name;
EMP_LAST_NAME EMP_FIRST_NAME EMP ID MGR_LAST_NAME MGR_FIRST_NAME MGR ID DEPARTMENT
---------------- --------------- ------ ---------------- --------------- ------ ------------
King Steven 100 Executive
Cambrault Gerald 148 King Steven 100 Sales
Bates Elizabeth 172 Cambrault Gerald 148 Sales
Bloom Harrison 169 Cambrault Gerald 148 Sales
Fox Tayler 170 Cambrault Gerald 148 Sales
Kumar Sundita 173 Cambrault Gerald 148 Sales
Ozer Lisa 168 Cambrault Gerald 148 Sales
Smith William 171 Cambrault Gerald 148 Sales
De Haan Lex 102 King Steven 100 Executive
Hunold Alexander 103 De Haan Lex 102 IT
Austin David 105 Hunold Alexander 103 IT
Ernst Bruce 104 Hunold Alexander 103 IT
Lorentz Diana 107 Hunold Alexander 103 IT
Pataballa Valli 106 Hunold Alexander 103 IT
Errazuriz Alberto 147 King Steven 100 Sales
Ande Sundar 166 Errazuriz Alberto 147 Sales
Banda Amit 167 Errazuriz Alberto 147 Sales
...
107 rows selected.
The START WITH clause is instructed to begin with a value for which manager_id is null. Because this is an organizational hierarchy with a single person at the top of the hierarchy, this causes the query to start with Stephen King. As the chief executive officer, King does not have a manager, so the manager_id column is set to NULL for his row.
The level pseudocolumn holds the value for the depth of the recursion, allowing for a simple method to indent the output so that the organizational hierarchy is visible.
The Example Using an RSF
The example query on the employees table has been rewritten in Listing 10-13 to use RSF, in which the main subquery is emp_recurse. The anchor member in this case simply selects the topmost row in the hierarchy by selecting the only row where manager_id is null. This is equivalent to start with emp.manager_id is null in Listing 10-12. The recursive member references the defining query emp_recurse by joining it to emp query. This join is used to locate the row corresponding to each employee’s manager, which is equivalent to connect by prior emp.employee_id = emp.manager_id in Listing 10-12. The results in Listing 10-13 are identical to those in Listing 10-12.
Listing 10-13. Basic Recursive Subquery Factoring
1 with emp as (
2 select /*+ inline gather_plan_statistics */
3 e.last_name, e.first_name, e.employee_id, e.manager_id, d.department_name
4 from hr.employees e
5 left outer join hr.departments d on d.department_id = e.department_id
6 ),
7 emp_recurse (last_name,first_name,employee_id,manager_id,department_name,lvl) as (
8 select e.last_name, e.first_name
9 , e.employee_id, e.manager_id
10 , e.department_name, 1 as lvl
11 from emp e where e.manager_id is null
12 union all
13 select emp.last_name, emp.first_name
14 , emp.employee_id, emp.manager_id
15 ,emp.department_name, empr.lvl + 1 as lvl
16 from emp
17 join emp_recurse empr on empr.employee_id = emp.manager_id
18 )
19 search depth first by last_name set order1
20 select lpad(' ', lvl*2-1,' ') || er.last_name last_name
21 , er.first_name
22 , er.department_name
23 from emp_recurse er;
LAST_NAME FIRST_NAME DEPARTMENT
------------------------- -------------------- ------------
King Steven Executive
Cambrault Gerald Sales
Bates Elizabeth Sales
Bloom Harrison Sales
Fox Tayler Sales
Kumar Sundita Sales
Ozer Lisa Sales
Smith William Sales
De Haan Lex Executive
Hunold Alexander IT
Austin David IT
Ernst Bruce IT
Lorentz Diana IT
Pataballa Valli IT
Errazuriz Alberto Sales
Ande Sundar Sales
Banda Amit Sales
...
107 rows selected.
Although the new RSF method may at first appear verbose, the basis of how it works is simpler to understand than CONNECT BY, and it allows for more complex queries. The recursive WITH clause requires two query blocks: the anchor member and the recursive member. These two query blocks must be combined with the UNION ALL set operator. The anchor member is the query prior to the UNION ALL, whereas the recursive member is the query that follows. The recursive member must reference the defining subquery; in doing so, it is recursive.
Restrictions on RSF
As you might imagine, the use of RSF is quite a bit more flexible than CONNECT BY. There are some restrictions on its use, however. Per the Oracle documentation for the SELECT statement, the following elements cannot be used in the recursive member of an RSF:
Differences from CONNECT BY
There are several differences when using RSF compared with CONNECT BY, and some of them are apparent in Listing 10-13. You may have wondered what happened to the level pseudocolumn because it is missing in this query, replaced by the lvl column. I explain this a little later on. Also, notice that the columns returned by an RSF query must be specified in the query definition, as seen in line 7 of Listing 10-13. One more new feature is SEARCH DEPTH FIRST seen on line 19. The default search is BREADTH FIRST, which is not usually the output you want from a hierarchical query. Listing 10-14 shows the output when the SEARCH clause is not used or it is set to BREADTH FIRST. This search returns rows of all siblings at each level before returning any child rows. Specifying SEARCH DEPTH FIRST returns the rows in hierarchical order. The SET ORDER1 portion of the SEARCH clause sets the value of the order1 pseudocolumn to the value of the order in which the rows are returned, similar to what you might see with ROWNUM, but you get to name the column. This is also used in later examples.
Listing 10-14. Default BREADTH FIRST Search
...
search breadth first by last_name set order1
select lpad(' ', lvl*2-1,' ') || er.last_name last_name
...
LAST_NAME FIRST_NAME DEPARTMENT_NAME
------------------------- -------------------- -----------------
King Steven Executive
Cambrault Gerald Sales
De Haan Lex Executive
Errazuriz Alberto Sales
Fripp Adam Shipping
Hartstein Michael Marketing
Kaufling Payam Shipping
Kochhar Neena Executive
Mourgos Kevin Shipping
Partners Karen Sales
Raphaely Den Purchasing
Russell John Sales
Vollman Shanta Shipping
Weiss Matthew Shipping
Zlotkey Eleni Sales
Abel Ellen Sales
Ande Sundar Sales
...
Notice that the SEARCH clause, as it is used in Listings 10-13 and 10-14, specifies that the search be by last_name. This could also be by first_name, or by a column list, such as last_name,first_name. Doing so controls the order of the rows within each level. The SEARCH clause ends with SET ORDER1, which effectively adds the order1 pseudocolumn to the column list returned by the recursive subquery. You will see it used more in some of the following examples.
Duplicating CONNECT BY Functionality
As Oracle Database has progressed through several versions, the functionality of the CONNECT BY clause has progressed as well. There are a number of hierarchical query operators, pseudocolumns, and one function available to CONNECT BY that are not natively available to RSF. The functionality these provide, however, can be duplicated in RSF. The functionality may not mimic exactly what occurs when CONNECT BY is used, but it can likely be made to do what you need. The trick to getting what you want from RSF sometimes requires stepping away from the keyboard and thinking about the results you want to achieve, rather than thinking about how you are going to code it. It is amazing how the change in perspective helps you achieve the desired output easily from the SQL you write.
The operators and pseudocolumns for CONNECT BY are listed in Table 10-1. I go through each of these as needed, showing example usages for CONNECT BY, and then duplicating that functionality with RSF. Keep in mind that RSF is quite versatile, so TMTOWTDI (There’s More Than One Way to Do It) is definitely in force. Feel free to experiment and find other methods to achieve the same results.
Table 10-1. CONNECT BY Functions, Operators, and Pseudocolumns
I also cover the SEARCH clause of RSF because it is instrumental in solving some problems.
The level Pseudocolumn
Let’s start with the level pseudocolumn, which is used frequently in hierarchical queries to indent the output, creating a visual representation of the hierarchy. Listing 10-15 contains a simple example showing how level is generated. As the hierarchy increases in depth, level is incremented. Likewise, level is decremented when the hierarchy goes back a level.
Listing 10-15. The level Pseudocolumn
1 select lpad(' ', level*2-1,' ') || e.last_name last_name, level
2 from hr.employees e
3 connect by prior e.employee_id = e.manager_id
4 start with e.manager_id is null
5 order siblings by e.last_name;
LAST_NAME LEVEL
------------------------- ----------
King 1
Cambrault 2
Bates 3
Bloom 3
Fox 3
Kumar 3
Ozer 3
Smith 3
De Haan 2
...
107 rows selected.
This can also be accomplished in RSF (see Listing 10-16), although it does require a little effort on your part. It may be somewhat surprising to see that this actually works. The value for lvl is never decremented, only incremented. Recall that the default search method for RSF is BREADTH FIRST. It is apparent that Oracle is processing the rows in sibling order, with the top of the hierarchy (King), followed by the child rows at the next level, continuing until the last row is reached. This behavior allows you to solve some other problems as well.
Listing 10-16. Create a lvl Column
1 with emp_recurse(employee_id,manager_id,last_name, lvl) as (
2 select e.employee_id, null, e.last_name, 1 as lvl
3 from hr.employees e
4 where e.manager_id is null
5 union all
6 select e1.employee_id, e1.manager_id, e1.last_name, e2.lvl + 1 as lvl
7 from hr.employees e1
8 join emp_recurse e2 on e2.employee_id= e1.manager_id
9 )
10 search depth first by last_name set last_name_order
11 select lpad(' ', r.lvl*2-1,' ') || r.last_name last_name, r.lvl
12 from emp_recurse r
13 order by last_name_order;
LAST_NAME LVL
------------------------- ----------
King 1
Cambrault 2
Bates 3
Bloom 3
Fox 3
Kumar 3
Ozer 3
Smith 3
De Haan 2
...
107 rows selected.
The SYS_CONNECT_BY_PATH Function
The SYS_CONNECT_BY_PATH function is used to return the values that comprise the hierarchy up to the current row. It’s best explained with an example, such as the one in Listing 10-17, In which the SYS_CONNECT_BY_PATH function is used to build a colon-delimited list of the hierarchy, complete from root to node.
Listing 10-17. SYS_CONNECT_BY_PATH
1 select lpad(' ',2*(level-1)) || e.last_name last_name
2 , sys_connect_by_path(last_name,':') path
3 from hr.employees e
4 start with e.manager_id is null
5 connect by prior e.employee_id = e.manager_id
6 order siblings by e.last_name;
LAST_NAME PATH
------------------------- -----------------------
King :King
Cambrault :King:Cambrault
Bates :King:Cambrault:Bates
Bloom :King:Cambrault:Bloom
Fox :King:Cambrault:Fox
Kumar :King:Cambrault:Kumar
Ozer :King:Cambrault:Ozer
Smith :King:Cambrault:Smith
De Haan :King:De Haan
...
107 rows selected.
Although the SYS_CONNECT_BY_PATH function is not available to RSF queries, this function can be duplicated using much the same method that was used to reproduce the level pseudocolumn. Rather than incrementing a counter, however, you now append to a string value. Listing 10-18 shows how this is done.
Listing 10-18. Build Your Own SYS_CONNECT_BY_PATH
1 with emp_recurse(employee_id,manager_id,last_name,lvl, path) as (
2 select e.employee_id, null, e.last_name
3 , 1 as lvl
4 ,':' || to_char(e.last_name) as path
5 from hr.employees e
6 where e.manager_id is null
7 union all
8 select e1.employee_id, e1.manager_id, e1.last_name
9 ,e2.lvl + 1 as lvl
10 ,e2.path || ':' || e1.last_name as path
11 from hr.employees e1
12 join emp_recurse e2 on e2.employee_id= e1.manager_id
13 )
14 search depth first by last_name set last_name_order
15 select lpad(' ', r.lvl*2-1,' ') || r.last_name last_name, r.path
16 from emp_recurse r
17 order by last_name_order;
LAST_NAME PATH
------------------------- ----------------------
King :King
Cambrault :King:Cambrault
Bates :King:Cambrault:Bates
Bloom :King:Cambrault:Bloom
Fox :King:Cambrault:Fox
Kumar :King:Cambrault:Kumar
Ozer :King:Cambrault:Ozer
Smith :King:Cambrault:Smith
De Haan :King:De Haan
...
107 rows selected.
The output of the SYS_CONNECT_BY_PATH as seen in Listing 10-17 is duplicated by the roll-your-own version using RSF in Listing 10-18. Take another look at this SQL. You may notice that there’s something here that SYS_CONNECT_BY_PATH cannot do. Consider, for instance, if you want the hierarchy to be displayed as a comma-delimited list. This is accomplished simply enough by changing “:” to “,”. The problem with SYS_CONNECT_BY_PATH is that the first character in the output is always a comma.
Using the RSF method, you can simply remove the delimiter in the anchor member and then change the delimiter in the recursive member to a comma. This is shown in Listing 10-19, along with a sample of the output. Should you feel inclined, the first character of the path could remain a colon, with the values delimited by commas.
Listing 10-19. Comma-Delimited PATH
1 with emp_recurse(employee_id,manager_id,last_name,lvl,path) as (
2 select e.employee_id, null, e.last_name
3 , 1 as lvl
4 ,e.last_name as path
5 from hr.employees e
6 where e.manager_id is null
7 union all
8 select e1.employee_id, e1.manager_id, e1.last_name
9 ,e2.lvl + 1 as lvl
10 ,e2.path || ',' || e1.last_name as path
11 from hr.employees e1
12 join emp_recurse e2 on e2.employee_id= e1.manager_id
13 )
14 search depth first by last_name set last_name_order
15 select lpad(' ', r.lvl*2-1,' ') || r.last_name last_name, r.path
16 from emp_recurse r
17 order by last_name_order;
LAST_NAME PATH
------------------------- ----------------------
King King
Cambrault King,Cambrault
Bates King,Cambrault,Bates
Bloom King,Cambrault,Bloom
Fox King,Cambrault,Fox
Kumar King,Cambrault,Kumar
Ozer King,Cambrault,Ozer
Smith King,Cambrault,Smith
De Haan King,De Haan
...
107 rows selected.
The CONNECT_BY_ROOT Operator
The CONNECT_BY_ROOT operator enhances the CONNECT BY syntax by returning the root node of the current row. In the example of the hr.employees table, all rows return King as the root. You can change it up a bit, however, by modifying the row temporarily for Neena Kochhar, putting her on the same level as the company president, Steven King. Then, the hierarchy can be shown for Kochhar by using the CONNECT_BY_ROOT operator to restrict the output. The results are shown in Listing 10-20.
Listing 10-20. CONNECT_BY_ROOT
1* update hr.employees set manager_id= null where last_name ='Kochhar';
1 row updated.
1 select /*+ inline gather_plan_statistics */
2 level
3 , lpad(' ',2*(level-1)) || last_name last_name
4 , first_name
5 , CONNECT_BY_ROOT last_name as root
6 , sys_connect_by_path(last_name,':') path
7 from hr.employees
8 where connect_by_root last_name = 'Kochhar'
9 connect by prior employee_id = manager_id
10 start with manager_id is null;
LEVEL LAST_NAME FIRST_NAME ROOT PATH
----- ------------ ------------ ------------ ------------------------------
1 Kochhar Neena Kochhar :Kochhar
2 Greenberg Nancy Kochhar :Kochhar:Greenberg
3 Faviet Daniel Kochhar :Kochhar:Greenberg:Faviet
3 Chen John Kochhar :Kochhar:Greenberg:Chen
3 Sciarra Ismael Kochhar :Kochhar:Greenberg:Sciarra
3 Urman Jose Manuel Kochhar :Kochhar:Greenberg:Urman
3 Popp Luis Kochhar :Kochhar:Greenberg:Popp
2 Whalen Jennifer Kochhar :Kochhar:Whalen
2 Mavris Susan Kochhar :Kochhar:Mavris
2 Baer Hermann Kochhar :Kochhar:Baer
2 Higgins Shelley Kochhar :Kochhar:Higgins
3 Gietz William Kochhar :Kochhar:Higgins:Gietz
12 rows selected.
1 rollback;
This functionality can be duplicated in RSF, but it does require a little more SQL. The code in Listing 10-21 is based on the SYS_CONNECT_BY_PATH example, with some minor changes and additions. The delimiting character is now prepended and appended to the value for PATH in the anchor member. In the recursive member, the delimiter is appended to the PATH, whereas previously it was prepended to the last_name column. Doing so ensures that the root records always have a delimiting character at the end of the value, allowing the SUBSTR() function in the emps subquery to parse the root correctly from the string when the path comes from the anchor member only, such as the rows for King and Kochar. This is probably better explained by examining the output from the query.
Listing 10-21. Duplicate CONNECT_BY_ROOT
1 update hr.employees set manager_id= null where last_name ='Kochhar';
1 row updated.
1 with emp_recurse(employee_id,manager_id,last_name,lvl,path) as (
2 select /*+ gather_plan_statistics */
3 e.employee_id
4 , null as manager_id
5 , e.last_name
6 , 1 as lvl
7 , ':' || e.last_name || ':' as path
8 from hr.employees e
9 where e.manager_id is null
10 union all
11 select
12 e.employee_id
13 , e.manager_id
14 , e.last_name
15 , er.lvl + 1 as lvl
16 , er .path || e.last_name || ':' as path
17 from hr.employees e
18 join emp_recurse er on er.employee_id = e.manager_id
19 join hr.employees e2 on e2.employee_id = e.manager_id
20 )
21 search depth first by last_name set order1 ,
22 emps as (
23 select lvl
24 , last_name
25 , path
26 , substr(path,2,instr(path,':',2)-2) root
27 from emp_recurse
28 )
29 select
30 lvl
31 , lpad(' ',2*(lvl-1)) || last_name last_name
32 , root
33 , path
34 from emps
35 where root = 'Kochhar';
LVL LAST_NAME ROOT PATH
---------- --------------- --------------- ------------------------------
1 Kochhar Kochhar :Kochhar:
2 Baer Kochhar :Kochhar:Baer:
2 Greenberg Kochhar :Kochhar:Greenberg:
3 Chen Kochhar :Kochhar:Greenberg:Chen:
3 Faviet Kochhar :Kochhar:Greenberg:Faviet:
3 Popp Kochhar :Kochhar:Greenberg:Popp:
3 Sciarra Kochhar :Kochhar:Greenberg:Sciarra:
3 Urman Kochhar :Kochhar:Greenberg:Urman:
2 Higgins Kochhar :Kochhar:Higgins:
3 Gietz Kochhar :Kochhar:Higgins:Gietz:
2 Mavris Kochhar :Kochhar:Mavris:
2 Whalen Kochhar :Kochhar:Whalen:
12 rows selected.
1* rollback;
This is not a perfect duplication of the CONNECT_BY_ROOT operator. In this case, it does exactly what is needed. The built-in operator, however, does allow some flexibility in specifying the level and returning the root at that level. The example given needs more modification to match this ability; however, you may find that this example works well for most cases.
The connect_by_iscycle Pseudocolumn and NOCYCLE Parameter
The connect_by_iscycle pseudocolumn makes it easy to detect loops in a hierarchy. This is illustrated by the SQL in Listing 10-22. There, an intentional error has been introduced by updating the hr.employees row for the president, assigning Smith as King’s manager, which causes an error in the CONNECT BY.
Listing 10-22. Cycle Error in CONNECT BY
1 update hr.employees set manager_id = 171 where employee_id = 100;
1 row updated.
Elapsed: 00:00:00.02
1 select lpad(' ',2*(level-1)) || last_name last_name
2 ,first_name, employee_id, level
3 from hr.employees
4 start with employee_id = 100
5* connect by prior employee_id = manager_id
LAST_NAME FIRST_NAME EMPLOYEE_ID LEVEL
------------------------- ------------ ----------- -----
King Steven 100 1
Kochhar Neena 101 2
Greenberg Nancy 108 3
...
Smith William 171 3
King Steven 100 4
...
ERROR:
ORA-01436: CONNECT BY loop in user data
187 rows selected.
1 rollback;
In the output, Smith appears as the manager of King, which you know to be incorrect. But, if you didn’t already know what the problem is, how would you find it? This is where the NOCYCLE parameter and CONNECT_BY_ISCYCLE operator come into play. These are used to detect a cycle in the hierarchy. The NOCYCLE parameter prevents the ORA-1436 error from occurring, allowing all rows to be output. The CONNECT_BY_ISCYCLE operator allows you to find the row causing the error easily.
As seen in Listing 10-23, the value of CONNECT_BY_ISCYCLE is 1, indicating that the row for Smith is somehow causing the error. The next query looks up the data for Smith, and all appears normal. Last, you query the table again, this time using Smith’s employee ID, to find all employees that he manages. The error becomes apparent—the president of the company does not have a manager—so the solution is to set the manager_id back to NULL for this row.
Listing 10-23. Detect the Cycle with CONNECT_BY_ISCYCLE
1* update hr.employees set manager_id = 171 where employee_id = 100
1 row updated.
1 select lpad(' ',2*(level-1)) || last_name last_name
2 ,first_name, employee_id, level
3 , connect_by_iscycle
4 from hr.employees
5 start with employee_id = 100
6 connect by nocycle prior employee_id = manager_id;
LAST_NAME FIRST_NAME EMPLOYEE_ID LEVEL CONNECT_BY_ISCYCLE
------------------------- ------------ ----------- ----- ------------------
King Steven 100 1 0
Kochhar Neena 101 2 0
...
Smith William 171 3 1
...
107 rows selected.
Elapsed: 00:00:00.03
1 select last_name, first_name, employee_id, manager_id
2 from hr.employees
3* where employee_id = 171
LAST_NAME FIRST_NAME EMPLOYEE_ID MANAGER_ID
------------------------- ------------ ----------- ----------
Smith William 171 148
1 select last_name, first_name, employee_id, manager_id
2 from hr.employees
3* where manager_id = 171
LAST_NAME FIRST_NAME EMPLOYEE_ID MANAGER_ID
------------------------- ------------ ----------- ----------
King Steven 100 171
1 rollback;
So, how do you do this with RSF? It’s really quite simple, because Oracle has provided the built-in CYCLE clause that makes short work of detecting cycles in recursive queries. It is somewhat more robust than the connect_by_iscycle pseudocolumn in that it lets you determine which values are used to indicate a cycle, as well as provides a column name at the same time. Listing 10-24 uses the same data error in Listing 10-23, but this time it uses a recursive subfactored query.
Listing 10-24. Detect Cycles in Recursive Queries
1 update hr.employees set manager_id = 171 where employee_id = 100;
1 row updated.
Elapsed: 00:00:00.00
1 with emp(employee_id,manager_id,last_name,first_name,lvl) as (
2 select e.employee_id
3 , null as manager_id
4 , e.last_name
5 , e.first_name
6 , 1 as lvl
7 from hr.employees e
8 where e.employee_id =100
9 union all
10 select e.employee_id
11 , e.manager_id
12 , e.last_name
13 , e.first_name
14 , emp.lvl + 1 as lvl
15 from hr.employees e
16 join emp on emp.employee_id = e.manager_id
17 )
18 search depth first by last_name set order1
19 CYCLE employee_id SET is_cycle TO '1' DEFAULT '0'
20 select lpad(' ',2*(lvl-1)) || last_name last_name
21 , first_name
22 , employee_id
23 , lvl
24 , is_cycle
25 from emp
26 order by order1;
LAST_NAME FIRST_NAME EMPLOYEE_ID LVL I
------------------------- ------------ ----------- ---------- -
King Steven 100 1 0
Cambrault Gerald 148 2 0
Bates Elizabeth 172 3 0
Bloom Harrison 169 3 0
Fox Tayler 170 3 0
Kumar Sundita 173 3 0
Ozer Lisa 168 3 0
Smith William 171 3 0
King Steven 100 4 1
...
108 rows selected.
Elapsed: 00:00:00.04
1 select last_name, first_name, employee_id, manager_id
2 from hr.employees
3 where employee_id = 100;
LAST_NAME FIRST_NAME EMPLOYEE_ID MANAGER_ID
------------------------- ------------ ----------- ----------
King Steven 100 171
1 row selected.
1 rollback;
Notice how the CYCLE clause lets you set the two possible values for the is_cycle column to 0 or 1. Only single-value characters are allowed here. The name of the column is also user defined and is set to is_cycle in this example. Examining the output, it appears that the CYCLE clause in RSF does a somewhat better job of identifying the row that causes the data cycle. The row with the error is identified clearly as that of King, so you can query that row and determine the error immediately.
The connect_by_isleaf Pseudocolumn
Last, there is the connect_by_isleaf pseudocolumn, which permits easy identification of leaf 7 nodes in hierarchical data. You can see that leaf nodes are identified in the output of Listing 10-25 when the value of connect_by_isleaf is 1.
Listing 10-25. CONNECT_BY_ISLEAF
1 select lpad(' ',2*(level-1)) || e.last_name last_name, connect_by_isleaf
2 from hr.employees e
3 start with e.manager_id is null
4 connect by prior e.employee_id = e.manager_id
5 order siblings by e.last_name;
LAST_NAME CONNECT_BY_ISLEAF
------------------------- -----------------
King 0
Cambrault 0
Bates 1
Bloom 1
Fox 1
Kumar 1
Ozer 1
Smith 1
De Haan 0
Hunold 0
Austin 1
Ernst 1
Lorentz 1
Pataballa 1
...
107 rows selected.
--------------------------------------------------------------------------
| Id | Operation | Name | E-Rows |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | |
|* 1 | CONNECT BY NO FILTERING WITH START-WITH | | |
| 2 | TABLE ACCESS FULL | EMPLOYEES | 107 |
-------------------------------------------------------------------------
Duplicating this in RSF is somewhat of a challenge. There are probably many methods that can be used to accomplish this, with some limitations. This is one of those problems that may require a little extra thought to solve, where “solve” means you get the output you desire, but you won’t necessarily duplicate the functionality of CONNECT_BY_ISLEAF completely.
In this case, you want to identify the leaf nodes in the employee hierarchy. By definition, none of the leaf nodes can be managers, so one way to accomplish this is to determine which rows are those of managers. All rows that are not those of managers are then leaf nodes.
Listing 10-26 uses this approach to solve the problem. The cost of solving it is two more extra scans of the hr.employees table and three index scans, but if RSF must be used, this is one way to get the desired results. The leaves subquery is used find the leaf nodes. This is then left outer joined to the employees table, and the value (or lack of a value) of the leaves.employee_id column indicates whether the current row is a leaf.
Listing 10-26. Finding Leaf Nodes in a Recursive Query
1 with leaves as (
2 select employee_id
3 from hr.employees
4 where employee_id not in (
5 select manager_id
6 from hr.employees
7 where manager_id is not null
8 )
9 ),
10 emp(manager_id,employee_id,last_name,lvl,isleaf) as (
11 select e.manager_id, e.employee_id, e.last_name, 1 as lvl, 0 as isleaf
12 from hr.employees e
13 where e.manager_id is null
14 union all
15 select e.manager_id,nvl(e.employee_id,null) employee_id,e.last_name,emp.lvl + 1 as lvl
16 , decode(l.employee_id,null,0,1) isleaf
17 from hr.employees e
18 join emp on emp.employee_id = e.manager_id
19 left outer join leaves l on l.employee_id = e.employee_id
20 )
21 search depth first by last_name set order1
22 select lpad(' ',2*(lvl-1)) || last_name last_name, isleaf
23 from emp;
LAST_NAME ISLEAF
------------------------- ----------
King 0
Cambrault 0
Bates 1
Bloom 1
Fox 1
Kumar 1
Ozer 1
Smith 1
De Haan 0
Hunold 0
Austin 1
Ernst 1
Lorentz 1
Pataballa 1
...
107 rows selected.
---------------------------------------------------------------------------
| Id | Operation | Name | E-Rows |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | |
| 1 | VIEW | | 7 |
| 2 | UNION ALL (RECURSIVE WITH) DEPTH FIRST| | |
|* 3 | TABLE ACCESS FULL | EMPLOYEES | 1 |
| 4 | NESTED LOOPS OUTER | | 6 |
| 5 | NESTED LOOPS | | 6 |
| 6 | RECURSIVE WITH PUMP | | |
| 7 | TABLE ACCESS BY INDEX ROWID | EMPLOYEES | 6 |
|* 8 | INDEX RANGE SCAN | EMP_MANAGER_IX | 6 |
| 9 | VIEW PUSHED PREDICATE | | 1 |
| 10 | NESTED LOOPS ANTI | | 1 |
|* 11 | INDEX UNIQUE SCAN | EMP_EMP_ID_PK | 1 |
|* 12 | INDEX RANGE SCAN | EMP_MANAGER_IX | 6 |
---------------------------------------------------------------------------
Another way to accomplish this is presented in Listing 10-27, in which the analytic function LEAD() uses the value of the lvl column to determine whether a row is a leaf node. Although it does avoid two of the index scans that were seen in Listing 10-26, determining correctly whether a row is a leaf node is dependent on the order of the output, as seen in line 16. The LEAD() function depends on the value of the last_name_order column that is set in the SEARCH clause.
Listing 10-27. Using LEAD() to Find Leaf Nodes
1 with emp(manager_id,employee_id,last_name,lvl) as (
2 select e.manager_id, e.employee_id, e.last_name, 1 as lvl
3 from hr.employees e
4 where e.manager_id is null
5 union all
6 select e.manager_id, nvl(e.employee_id,null) employee_id
7 , e.last_name, emp.lvl + 1 as lvl
8 from hr.employees e
9 join emp on emp.employee_id = e.manager_id
10 )
11 search depth first by last_name set last_name_order
12 select lpad(' ',2*(lvl-1)) || last_name last_name,
13 lvl,
14 lead(lvl) over (order by last_name_order) leadlvlorder,
15 case
16 when ( lvl - lead(lvl) over (order by last_name_order) ) < 0
17 then 0
18 else 1
19 end isleaf
20 from emp;
LAST_NAME LVL LEADLVLORDER ISLEAF
------------------------- ---------- ------------ ----------
King 1 2 0
Cambrault 2 3 0
Bates 3 3 1
Bloom 3 3 1
Fox 3 3 1
Kumar 3 3 1
Ozer 3 3 1
Smith 3 2 1
De Haan 2 3 0
Hunold 3 4 0
Austin 4 4 1
Ernst 4 4 1
Lorentz 4 4 1
Pataballa 4 2 1
...
107 rows selected.
----------------------------------------------------------------------------
| Id | Operation | Name | E-Rows |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | |
| 1 | WINDOW BUFFER | | 7 |
| 2 | VIEW | | 7 |
| 3 | UNION ALL (RECURSIVE WITH) DEPTH FIRST| | |
|* 4 | TABLE ACCESS FULL | EMPLOYEES | 1 |
| 5 | NESTED LOOPS | | |
| 6 | NESTED LOOPS | | 6 |
| 7 | RECURSIVE WITH PUMP | | |
|* 8 | INDEX RANGE SCAN | EMP_MANAGER_IX | 6 |
| 9 | TABLE ACCESS BY INDEX ROWID | EMPLOYEES | 6 |
----------------------------------------------------------------------------
What might happen if the SEARCH clause is changed from DEPTH FIRST to BREADTH FIRST? The results are shown in Listing 10-28. The use of the LEAD() function appears, at first, to be an elegant solution, but it is somewhat fragile in its dependency on the order of the data. The example in Listing 10-26 works regardless of the SEARCH clause parameters. It is readily apparent that the output in Listing 10-28 is incorrect.
Listing 10-28. LEAD() with BREADTH FIRST
1 with emp(manager_id,employee_id,last_name,lvl) as (
2 select e.manager_id, e.employee_id, e.last_name, 1 as lvl
3 from hr.employees e
4 where e.manager_id is null
5 union all
6 select e.manager_id, nvl(e.employee_id,null) employee_id
7 , e.last_name, emp.lvl + 1 as lvl
8 from hr.employees e
9 join emp on emp.employee_id = e.manager_id
10 )
11 search breadth first by last_name set last_name_order
12 select lpad(' ',2*(lvl-1)) || last_name last_name,
13 lvl,
14 lead(lvl) over (order by last_name_order) leadlvlorder,
15 case
16 when ( lvl - lead(lvl) over (order by last_name_order) ) < 0
17 then 0
18 else 1
19 end isleaf
20 from emp;
LAST_NAME LVL LEADLVLORDER ISLEAF
------------------------- ---------- ------------ ----------
King 1 2 0
Cambrault 2 2 1
De Haan 2 2 1
Errazuriz 2 2 1
Fripp 2 2 1
Hartstein 2 2 1
Kaufling 2 2 1
Summary
Although the functionality of CONNECT BY can be duplicated for most practical purposes in recursive subfactored queries, the question is, should you do so? In many cases, the CONNECT BY syntax is simpler to use, although the syntax does take some getting used to. Doing the same things in RSF requires quite a bit more SQL in most cases. In addition, CONNECT BY may produce better execution plans than RSF, especially for relatively simple queries. Keep in mind, however, that RSF is a new feature, and will likely improve in later versions of Oracle.
In addition, there may be good reasons not to use CONNECT BY. Perhaps you need to maintain ANSI compatibility in your application. Or perhaps the ability to write hierarchical queries that work in other databases that support recursive common table expressions simplifies the code for an application that runs on databases from different vendors. In this circumstance, RSF is quite useful. Whatever the need for hierarchical queries, with a little ingenuity you can write suitable queries on hierarchical data using recursive subfactored queries, and they will be capable of doing everything that can be done currently with CONNECT BY.
1 Although well known in the Oracle community for some time now, the INLINE and MATERIALIZE hints remain undocumented by Oracle.
2 If you run these examples on different versions of Oracle, the output may appear differently because the test data sometimes change with versions of Oracle.
3 The MATERIALIZE hint was used to ensure that the example works as expected, given that you may be testing on a different version or patch level of Oracle. On the test system I used, this was the default action by Oracle.
4 Initial executions are run after first flushing shared_pool and buffer_cache.
5 The actual table name was sys_temp_0fd9d66a2_453290, but was shortened in the listing for formatting purposes.
6CONNECT BY was first available in Oracle version 2 or, in others words, from the very beginning.
7 A leaf node is a node in the hierarchical tree that has no children.
18.223.107.85