Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

22. Counting Children in Trees

Kim Berg Hansen¹

(1)

Middelfart, Denmark

Sometimes you’d like to do aggregation where a row is included in multiple rows of the output, for example, being counted multiple times or having the value added multiple times. An example of this is hierarchical data, where you want for every row to find the count of all the children in the tree – not just immediate children but also grandchildren and their children and so on, all the way down to the leaves of the tree.

It means that a given row is counted in the result for the parent, but also counted again in the result of the grandparent, and so on. It can look similar to subtotals created with group by and rollup, but with the hierarchy, you don’t know how many levels down it goes, so you cannot simply use rollup.

One way I can solve this is using the after match skip to next row clause of match_recognize. Of course it could be used for other aggregates than count, but count is easy to understand, and once you know the technique, you can do the others easily.

Hierarchical tree of employees

The most classic table used to demonstrate hierarchical queries on Oracle is scott.emp table. Well, the Good Beer Trading Co also employs people, so my practical schema naturally has a table employees depicted in Figure 22-1.

../images/475066_1_En_22_Chapter/475066_1_En_22_Fig1_HTML.jpg — Figure 22-1
The employees table with a self-referencing foreign key

The column supervisor_id is a self-referencing foreign key that references the primary key id. Only one person has no supervisor – the boss of the company – for everyone else the supervisor_id contains the id of their immediate supervisor in the employee hierarchy. So I can show you the data of the table in a tree using Listing 22-1.

SQL> select

2 e.id

3 , lpad(' ', 2*(level-1)) || e.name as name

4 , e.title as title

5 , e.supervisor_id as super

6 from employees e

7 start with e.supervisor_id is null

8 connect by e.supervisor_id = prior e.id

9 order siblings by e.name;

Listing 22-1

A classic hierarchical query of employees

For a simple hierarchy like this, I tend to use the Oracle proprietary connect by query instead of the recursive subquery factoring I showed in Chapter 4. One of the things that are easier with connect by is, for example, the order siblings by I use here – that is more awkward to code with recursive subquery factoring.

So I start with the boss by specifying in line 7 to start with those with no supervisors. Then line 8 finds the immediate subordinates of the boss and then goes on recursively to find subordinates of those and so on:

ID NAME TITLE SUPER

142 Harold King Managing Director

144 Axel de Proef Product Director 142

151 Jim Kronzki Sales Manager 144

150 Laura Jensen Bulk Salesman 151

154 Simon Chang Retail Salesman 151

148 Maria Juarez Purchaser 144

147 Ursula Mwbesi Operations Chief 142

146 Lim Tok Lo Warehouse Manager 147

152 Evelyn Smith Forklift Operator 146

149 Kurt Zollman Forklift Operator 146

155 Susanne Hoff Janitor 146

143 Mogens Juel IT Manager 147

153 Dan Hoeffler IT Supporter 143

145 Zoe Thorston IT Developer 143

It’s different persons, but you’re likely to have seen a very similar output using scott.emp somewhere. And this query will form the basis for the rest of the SQL I’ll show in this chapter.

Counting subordinates of all levels

The task is now for each row to do a count of subordinates all the way down the tree – not just the immediate subordinates one level down. If you look at the organization diagram in Figure 22-2, I need to find that Harold King has 13 subordinates (all employees except himself), Ursula Mwbesi has 7 subordinates total (2 immediately below her plus 5 that are a level further down the tree), Lim Tok Lo has 3 subordinates total (all just 1 level below and they have no further subordinates), and so on.

../images/475066_1_En_22_Chapter/475066_1_En_22_Fig2_HTML.jpg — Figure 22-2
Organization diagram with some of the subtrees marked

A simple way to do this is using a scalar subquery as shown in Listing 22-2. The scalar subquery can find the relevant subtree in the hierarchy and count the nodes of the subtree.

SQL> select

2 e.id

3 , lpad(' ', 2*(level-1)) || e.name as name

4 , (

5 select count(*)

6 from employees sub

7 start with sub.supervisor_id = e.id

8 connect by sub.supervisor_id = prior sub.id

9 ) as subs

10 from employees e

11 start with e.supervisor_id is null

12 connect by e.supervisor_id = prior e.id

13 order siblings by e.name;

Listing 22-2

Counting the number of subordinates

The outer query is the same as Listing 22-1. The scalar subquery in lines 4–9 utilizes the same connect by query; only start with is not from the top of the tree, but instead start with in line 7 starts with those that are immediate subordinates of the current row in the outer query and searches the subtree from there and down:

ID NAME SUBS

142 Harold King 13

144 Axel de Proef 4

151 Jim Kronzki 2

150 Laura Jensen 0

154 Simon Chang 0

148 Maria Juarez 0

147 Ursula Mwbesi 7

146 Lim Tok Lo 3

152 Evelyn Smith 0

149 Kurt Zollman 0

155 Susanne Hoff 0

143 Mogens Juel 2

153 Dan Hoeffler 0

145 Zoe Thorston 0

The output is just what I’m after, but I’ve accessed the same rows of the tables multiple times – like Simon Chang that has been accessed four times: once in the scalar subquery for each of the three people above him in the tree and then once in the main query when it got to him in the tree. Also every time a leaf node in the tree was accessed, the database queried if there was anyone below him/her, so the four times Simon was accessed also incurred four lookups if he had subordinates.

All in all, it is a lot of repetitive work for the database. But luckily I have a way to reduce that amount of work.

Counting with row pattern matching

Using row pattern matching, I can create the query shown in Listing 22-3, which only needs to do the hierarchical query a single time and then do all the necessary counts on the retrieved tree without accessing the tables over and over again.

SQL> with hierarchy as (

2 select

3 lvl, id, name, rownum as rn

4 from (

5 select

6 level as lvl, e.id, e.name

7 from employees e

8 start with e.supervisor_id is null

9 connect by e.supervisor_id = prior e.id

10 order siblings by e.name

11 )

12 )

13 select

14 id

15 , lpad(' ', (lvl-1)*2) || name as name

16 , subs

17 from hierarchy

18 match_recognize (

19 order by rn

20 measures

21 strt.rn as rn

22 , strt.lvl as lvl

23 , strt.id as id

24 , strt.name as name

25 , count(higher.lvl) as subs

26 one row per match

27 after match skip to next row

28 pattern (

29 strt higher*

30 )

31 define

32 higher as higher.lvl > strt.lvl

33 )

34 order by rn;

Listing 22-3

Counting subordinates with match_recognize

The output of Listing 22-3 is exactly the same as the output of Listing 22-2. I’ll tell you how it works:

I’m using a with clause for clarity as I taught you in Chapter 3.
Inside the with clause lines 5–10 is an inline view containing the basic hierarchical query I’ve already shown you in the previous listings. Notice the order by in line 10 is inside the inline view.
I place it in an inline view so that I can use rownum in line 3 (outside the inline view) and save it as column alias rn. I need to preserve the hierarchical ordering created by the inline view when I do my row pattern matching – this allows me to do so.
Building my match_recognize clause, I start by defining in line 32 that a row that has a higher level than the starting row of the pattern is classified as higher – meaning that when it has a higher level, then it is a child/grandchild/greatgrandchild/… of the starting row (i.e., a subordinate).
Of course, not everybody in the entire row set with a higher level is a subordinate – only those consecutive rows with a higher level that follow the row itself. Once I reach someone with the same level (or lower), then I am no longer within the subtree I want. I solve this in the pattern in line 29 by looking for a strt row (which is undefined and therefore can be any row) followed by zero or more higher rows – when a row is reached that is no longer classified higher, the match stops.
In line 26, I’ve specified one row per match, and the employee I’m interested in outputting data from is the strt row, so I’m using strt columns in the measures in lines 21–24.
In line 25, I’m doing a count on how many higher rows were in the match. If I just did a plain count(*), I’d be including the strt row, but on that row anything I qualify with higher will be null, so counting higher.lvl gives me a count only of the higher rows, which is the count of subordinates that I want.
With after match skip to next row in line 27, I’m specifying that once it has finished with a match of one strt row and zero or more higher rows, it should move to the next row that follows after the strt row. This is the part that makes rows be counted more than once – I’ll explain in detail shortly.

That’s all clear, right? Well, I’ll dive a little more into the details to clarify why it works.

Note

A few words on why you’d consider using the long and somewhat convoluted Listing 22-3 instead of the short and clear Listing 22-2.

I tested this on an employee table where I had 14001 employees in it.

The scalar subquery method used about 11 seconds, nearly half a million consistent gets, and over 37000 sorts, due to a full table scan and many, many index range scans for the connect by processing.

The match_recognize method used less than half a second, 55 consistent gets, and four (four!) sorts, with just a single full table scan.

Your mileage will vary, of course, so test it yourself.

The details of each match

As I’ve mentioned before, very often a good way to see what happens is to inspect the detailed output using all rows per match. So this is what I do in Listing 22-4.

SQL> with hierarchy as (

...

12 )

13 select

14 mn

15 , rn

16 , lvl

17 , lpad(' ', (lvl-1)*2)

18 || substr(name, 1, instr(name, ' ') - 1) as name

19 , roll

20 , subs

21 , cls

22 , substr(stname, 1, instr(stname, ' ') - 1) as stname

23 , substr(hiname, 1, instr(hiname, ' ') - 1) as hiname

24 from hierarchy

25 match_recognize (

26 order by rn

27 measures

28 match_number() as mn

29 , classifier() as cls

30 , strt.name as stname

31 , higher.name as hiname

32 , count(higher.lvl) as roll

33 , final count(higher.lvl) as subs

34 all rows per match

35 after match skip to next row

36 pattern (

37 strt higher*

38 )

39 define

40 higher as higher.lvl > strt.lvl

41 )

42 order by mn, rn;

Listing 22-4

Inspecting the details with all rows per match

The with clause subquery is unchanged, as are the after match, pattern, and define clauses. I’ve changed the one row to all rows per match in line 34 and then created some different measures in lines 28–33:

The match_number() function in line 28 is a consecutive numbering of the matches found. Without it, I couldn’t tell which rows in the output belongs together as part of each match.
The classifier() function shows what the row has been classified as according to the pattern and define clauses – in this case showing whether a row is strt or higher.
When column names are not qualified, values of the current row in the match are used, no matter what classifier they have. When I qualify the column names with the classifier strt and higher in lines 30 and 31, I get the values from the last of the rows with that classifier.
Aggregate functions like count in lines 32 and 33 can be running or final. In Listing 22-3, it did not matter, since I used one row per match, but here it does matter, so I output both to show the difference. Line 32 defaults to running (aka rolling count) which gives a result similar to an analytic function with a window of rows between unbounded preceding and current row, while line 33 with final keyword works similar to rows between unbounded preceding and unbounded following.

The output of Listing 22-4 has far more rows than are in the table, but I have 14 matches (one for each row in the table) identified by 1–14 in the mn column. So if I step through the output, here’s the rows for the first match:

MN RN LVL NAME ROLL SUBS CLS STNAME HINAME

1 1 1 Harold 0 13 STRT Harold

1 2 2 Axel 1 13 HIGHER Harold Axel

1 3 3 Jim 2 13 HIGHER Harold Jim

1 4 4 Laura 3 13 HIGHER Harold Laura

1 5 4 Simon 4 13 HIGHER Harold Simon

1 6 3 Maria 5 13 HIGHER Harold Maria

1 7 2 Ursula 6 13 HIGHER Harold Ursula

1 8 3 Lim 7 13 HIGHER Harold Lim

1 9 4 Evelyn 8 13 HIGHER Harold Evelyn

1 10 4 Kurt 9 13 HIGHER Harold Kurt

1 11 4 Susanne 10 13 HIGHER Harold Susanne

1 12 3 Mogens 11 13 HIGHER Harold Mogens

1 13 4 Dan 12 13 HIGHER Harold Dan

1 14 4 Zoe 13 13 HIGHER Harold Zoe

As my pattern matching is ordered by rn, it starts at rn = 1 (Harold) and classifies him strt (since any row can match strt) and then repeatedly checks if the next row has a lvl greater than the lvl of the strt row, which is true for all of the remaining 13 rows, as everybody else has a lvl greater than 1. That means that the first match does not stop until it reaches the end of the rows.

Match number 1 has now been found, containing 1 strt row and 13 higher rows as shown in the cls column. In the strt row, no higher rows have been found yet, so when I qualify a column with higher (and I am not using the final keyword), the result is null, as you can see in column hiname. This also means that when I do the total (final) count of higher in column subs, the strt row is not counted, and the result is the desired 13 subordinates.

You can also see in the output how the running total goes in column roll and that strt.name in column stname keeps the value of the last (in this case only) strt row.

So when the first match is finished, I specified after match skip to next row, which in this case is rn = 2 (Axel). He’ll be the strt row of match mn = 2 in the continued output:

2 2 2 Axel 0 4 STRT Axel

2 3 3 Jim 1 4 HIGHER Axel Jim

2 4 4 Laura 2 4 HIGHER Axel Laura

2 5 4 Simon 3 4 HIGHER Axel Simon

2 6 3 Maria 4 4 HIGHER Axel Maria

After Axel as strt, this match finds four higher rows, because row rn = 7 (Ursula) has lvl = 2, which is not higher than Axel (it is the same), and therefore the match stops with Maria. The counting of subordinates works just like before – even though there are five rows in the match, there are only four that are classified higher and are counted. These rows were also included in the count of Harold King’s subordinates in match number 1, but because of skipping back up to rn = 2 to find the next match, these rows are included once more.

The next row after Axel is Jim, who’ll be the strt row of match mn = 3 that is the next in the output:

3 3 3 Jim 0 2 STRT Jim

3 4 4 Laura 1 2 HIGHER Jim Laura

3 5 4 Simon 2 2 HIGHER Jim Simon

Match number 3 ends up with one strt row and just 2 higher rows, since Maria (who follows Simon in the rn order) does not have a lvl higher than Jim. So Laura and Simon are counted as Jim’s subordinates – just as they also were counted under Axel and under Harold.

The output moves on to match number 4, which starts with Laura classifying her as a strt row. After her comes Simon, but he has the same lvl as Laura. Therefore, he cannot be a higher row, and the match becomes a match containing only a single strt row and no higher rows, leading to a subordinate count of 0 in the output:

4 4 4 Laura 0 0 STRT Laura

And so it goes on and on, until at the end, the match number mn = 14 is found, containing just Zoe:

...

14 14 4 Zoe 0 0 STRT Zoe

The details of this long output are good to learn how the different pieces of match_recognize work for this solution. But I can also take just some of the columns of the all rows per match output and use pivot in Listing 22-5 to visualize the rows that are part of each match.

SQL> with hierarchy as (

...

12 )

13 select

14 name

15 , "1", "2", "3", "4", "5", "6", "7"

16 , "8", "9", "10", "11", "12", "13", "14"

17 from (

18 select

19 mn

20 , rn

21 , lpad(' ', (lvl-1)*2)

22 || substr(name, 1, instr(name, ' ') - 1) as name

23 from hierarchy

24 match_recognize (

25 order by rn

26 measures

27 match_number() as mn

28 all rows per match

29 after match skip to next row

30 pattern (

31 strt higher*

32 )

33 define

34 higher as higher.lvl > strt.lvl

35 )

36 ) pivot (

37 max('X')

38 for mn in (

39 1,2,3,4,5,6,7,8,9,10,11,12,13,14

40 )

41 )

42 order by rn;

Listing 22-5

Pivoting to show which rows are in which match

The only measure I am using is match_number() in line 27, and then in lines 19–22, I select just mn, rn, and the name. This allows me to do a pivot for mn in line 38 specifying the 14 match numbers in line 39, thereby getting rn, name, and 14 columns named 1–14 (these column names must be enclosed in double quotes, as they do not start with a letter).

The value of the 14 match number columns is the literal X if the rn of the row is included in the match, otherwise null. So I can select the mn column and the Xs and just use rn for ordering the output:

NAME 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Harold X

Axel X X

Jim X X X

Laura X X X X

Simon X X X X

Maria X X X

Ursula X X

Lim X X X

Evelyn X X X X

Kurt X X X X

Susanne X X X X

Mogens X X X

Dan X X X X

Zoe X X X X

In this pivoted output, it is easy to use the Xs to check that all rows are included in match number 1, the rows from Axel to Maria are included in match number 2, and so on.

Fiddling with the output

Having examined the detailed output, I’ll return to the one row per match version to fiddle a bit more and show you a couple of things.

First, I’d like to make it clear that although Listing 22-3 with one row per match only has a single aggregate measure, and so far I’ve only shown multiple aggregate measures in Listing 22-4 using all rows per match, it is perfectly legitimate to use multiple aggregates or uses of functions like first and last together with one row. Take a look at Listing 22-6.

SQL> with hierarchy as (

...

12 )

13 select

14 lpad(' ', (lvl-1)*2) || name as name

15 , subs

16 , hifrom

17 , hito

18 , himax

19 from hierarchy

20 match_recognize (

21 order by rn

22 measures

23 strt.rn as rn

24 , strt.lvl as lvl

25 , strt.name as name

26 , count(higher.lvl) as subs

27 , first(higher.name) as hifrom

28 , last(higher.name) as hito

29 , max(higher.lvl) as himax

30 one row per match

31 after match skip to next row

32 pattern (

33 strt higher*

34 )

35 define

36 higher as higher.lvl > strt.lvl

37 )

38 order by rn;

Listing 22-6

Adding multiple measures when doing one row per match

In lines 26–29, I am using both navigational functions and aggregates. Remember that when I use one row per match, it makes no difference if I use running or final for the aggregates, so even if I didn’t specify final, I get the same result:

NAME SUBS HIFROM HITO HIMAX

Harold King 13 Axel de Proef Zoe Thorston 4

Axel de Proef 4 Jim Kronzki Maria Juarez 4

Jim Kronzki 2 Laura Jensen Simon Chang 4

Laura Jensen 0

Simon Chang 0

Maria Juarez 0

Ursula Mwbesi 7 Lim Tok Lo Zoe Thorston 4

Lim Tok Lo 3 Evelyn Smith Susanne Hoff 4

Evelyn Smith 0

Kurt Zollman 0

Susanne Hoff 0

Mogens Juel 2 Dan Hoeffler Zoe Thorston 4

Dan Hoeffler 0

Zoe Thorston 0

So my first point was the use of multiple measures for whatever output I want in the various columns. Can I fiddle with the rows in the output as well? Say, for example, I want to output only those employees that actually have subordinates (or in other words are not leaf nodes in the tree).

Sure, I could put the entire query in an inline view and then use a where clause to filter on subs > 0 and that way not get any leaf nodes in the output. It would work fine, but my second point to show you is a better alternative that filters away the non-leaf nodes earlier in the processing.

In Listing 22-3 line 29, I’m using a pattern of strt higher* which is a pattern that by design will be matched by any row that will be classified strt – it is just a question of how many higher rows will follow after that strt row. So Listing 22-3 will by the nature of the pattern output all rows of the table.

Let me in Listing 22-7 change just one character – otherwise, it is identical to Listing 22-3.

...

29 strt higher+

...

Listing 22-7

Filtering matches with the pattern definition

I have changed * to + which means that any given strt row will only cause a match if it is followed by at least one higher row. So the leaf nodes, which are not followed by any higher row, will not cause a match – instead the database simply moves one row along and checks if it can find a match using the next row as strt row. This leads to only supervisors being output:

ID NAME SUBS

142 Harold King 13

144 Axel de Proef 4

151 Jim Kronzki 2

147 Ursula Mwbesi 7

146 Lim Tok Lo 3

143 Mogens Juel 2

Doing it this way allows the database to discard the unwanted rows immediately as it works its way through the pattern matching process – rather than the inline view that lets the database build a result set of all rows and then afterward removes the unwanted ones again.

Lessons learned

In this chapter I’ve demonstrated that with a suitable ordering, the after match skip to next row clause can very efficiently allow match_recognize to process the same rows multiple times in different groupings without accessing them in the table multiple times. In the demos I covered

Preparing the source query by creating an ordering column that allows match_recognize to work in the hierarchical order
Setting a pattern that for each row finds the group of rows that are in the subtree below
Using after match skip to next row to use the pattern search on every row, even if it was included in previous matches
Changing the pattern to ignore those rows not having a subtree below

These methods you can use on any hierarchical data. They can also be useful on other data with an ordering that is nontrivial, where you can set up a more complex query to prepare the data and preserve the ordering, before you process the data with match_recognize.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 22. Counting Children in Trees

Create new playlist

Sign In

Sign Up

22. Counting Children in Trees

Hierarchical tree of employees

Counting subordinates of all levels

Counting with row pattern matching

The details of each match

Fiddling with the output

Lessons learned

Table of Contents for
22. Counting Children in Trees