© Kim Berg Hansen 2020
K. Berg HansenPractical Oracle SQLhttps://doi.org/10.1007/978-1-4842-5617-6_22

22. Counting Children in Trees

Kim Berg Hansen1 
(1)
Middelfart, Denmark
 

Sometimes you’d like to do aggregation where a row is included in multiple rows of the output, for example, being counted multiple times or having the value added multiple times. An example of this is hierarchical data, where you want for every row to find the count of all the children in the tree – not just immediate children but also grandchildren and their children and so on, all the way down to the leaves of the tree.

It means that a given row is counted in the result for the parent, but also counted again in the result of the grandparent, and so on. It can look similar to subtotals created with group by and rollup, but with the hierarchy, you don’t know how many levels down it goes, so you cannot simply use rollup.

One way I can solve this is using the after match skip to next row clause of match_recognize. Of course it could be used for other aggregates than count, but count is easy to understand, and once you know the technique, you can do the others easily.

Hierarchical tree of employees

The most classic table used to demonstrate hierarchical queries on Oracle is scott.emp table. Well, the Good Beer Trading Co also employs people, so my practical schema naturally has a table employees depicted in Figure 22-1.
../images/475066_1_En_22_Chapter/475066_1_En_22_Fig1_HTML.jpg
Figure 22-1

The employees table with a self-referencing foreign key

The column supervisor_id is a self-referencing foreign key that references the primary key id. Only one person has no supervisor – the boss of the company – for everyone else the supervisor_id contains the id of their immediate supervisor in the employee hierarchy. So I can show you the data of the table in a tree using Listing 22-1.
SQL> select
  2     e.id
  3   , lpad(' ', 2*(level-1)) || e.name as name
  4   , e.title as title
  5   , e.supervisor_id as super
  6  from employees e
  7  start with e.supervisor_id is null
  8  connect by e.supervisor_id = prior e.id
  9  order siblings by e.name;
Listing 22-1

A classic hierarchical query of employees

For a simple hierarchy like this, I tend to use the Oracle proprietary connect by query instead of the recursive subquery factoring I showed in Chapter 4. One of the things that are easier with connect by is, for example, the order siblings by I use here – that is more awkward to code with recursive subquery factoring.

So I start with the boss by specifying in line 7 to start with those with no supervisors. Then line 8 finds the immediate subordinates of the boss and then goes on recursively to find subordinates of those and so on:
ID   NAME                TITLE              SUPER
142  Harold King         Managing Director
144    Axel de Proef     Product Director   142
151      Jim Kronzki     Sales Manager      144
150        Laura Jensen  Bulk Salesman      151
154        Simon Chang   Retail Salesman    151
148      Maria Juarez    Purchaser          144
147    Ursula Mwbesi     Operations Chief   142
146      Lim Tok Lo      Warehouse Manager  147
152        Evelyn Smith  Forklift Operator  146
149        Kurt Zollman  Forklift Operator  146
155        Susanne Hoff  Janitor            146
143      Mogens Juel     IT Manager         147
153        Dan Hoeffler  IT Supporter       143
145        Zoe Thorston  IT Developer       143

It’s different persons, but you’re likely to have seen a very similar output using scott.emp somewhere. And this query will form the basis for the rest of the SQL I’ll show in this chapter.

Counting subordinates of all levels

The task is now for each row to do a count of subordinates all the way down the tree – not just the immediate subordinates one level down. If you look at the organization diagram in Figure 22-2, I need to find that Harold King has 13 subordinates (all employees except himself), Ursula Mwbesi has 7 subordinates total (2 immediately below her plus 5 that are a level further down the tree), Lim Tok Lo has 3 subordinates total (all just 1 level below and they have no further subordinates), and so on.
../images/475066_1_En_22_Chapter/475066_1_En_22_Fig2_HTML.jpg
Figure 22-2

Organization diagram with some of the subtrees marked

A simple way to do this is using a scalar subquery as shown in Listing 22-2. The scalar subquery can find the relevant subtree in the hierarchy and count the nodes of the subtree.
SQL> select
  2     e.id
  3   , lpad(' ', 2*(level-1)) || e.name as name
  4   , (
  5        select count(*)
  6        from employees sub
  7        start with sub.supervisor_id = e.id
  8        connect by sub.supervisor_id = prior sub.id
  9     ) as subs
 10  from employees e
 11  start with e.supervisor_id is null
 12  connect by e.supervisor_id = prior e.id
 13  order siblings by e.name;
Listing 22-2

Counting the number of subordinates

The outer query is the same as Listing 22-1. The scalar subquery in lines 4–9 utilizes the same connect by query; only start with is not from the top of the tree, but instead start with in line 7 starts with those that are immediate subordinates of the current row in the outer query and searches the subtree from there and down:
ID   NAME                SUBS
142  Harold King         13
144    Axel de Proef     4
151      Jim Kronzki     2
150        Laura Jensen  0
154        Simon Chang   0
148      Maria Juarez    0
147    Ursula Mwbesi     7
146      Lim Tok Lo      3
152        Evelyn Smith  0
149        Kurt Zollman  0
155        Susanne Hoff  0
143      Mogens Juel     2
153        Dan Hoeffler  0
145        Zoe Thorston  0

The output is just what I’m after, but I’ve accessed the same rows of the tables multiple times – like Simon Chang that has been accessed four times: once in the scalar subquery for each of the three people above him in the tree and then once in the main query when it got to him in the tree. Also every time a leaf node in the tree was accessed, the database queried if there was anyone below him/her, so the four times Simon was accessed also incurred four lookups if he had subordinates.

All in all, it is a lot of repetitive work for the database. But luckily I have a way to reduce that amount of work.

Counting with row pattern matching

Using row pattern matching, I can create the query shown in Listing 22-3, which only needs to do the hierarchical query a single time and then do all the necessary counts on the retrieved tree without accessing the tables over and over again.
SQL> with hierarchy as (
  2     select
  3        lvl, id, name, rownum as rn
  4     from (
  5        select
  6           level as lvl, e.id, e.name
  7        from employees e
  8        start with e.supervisor_id is null
  9        connect by e.supervisor_id = prior e.id
 10        order siblings by e.name
 11     )
 12  )
 13  select
 14     id
 15   , lpad(' ', (lvl-1)*2) || name as name
 16   , subs
 17  from hierarchy
 18  match_recognize (
 19     order by rn
 20     measures
 21        strt.rn           as rn
 22      , strt.lvl          as lvl
 23      , strt.id           as id
 24      , strt.name         as name
 25      , count(higher.lvl) as subs
 26     one row per match
 27     after match skip to next row
 28     pattern (
 29        strt higher*
 30     )
 31     define
 32        higher as higher.lvl > strt.lvl
 33  )
 34  order by rn;
Listing 22-3

Counting subordinates with match_recognize

The output of Listing 22-3 is exactly the same as the output of Listing 22-2. I’ll tell you how it works:
  • I’m using a with clause for clarity as I taught you in Chapter 3.

  • Inside the with clause lines 5–10 is an inline view containing the basic hierarchical query I’ve already shown you in the previous listings. Notice the order by in line 10 is inside the inline view.

  • I place it in an inline view so that I can use rownum in line 3 (outside the inline view) and save it as column alias rn. I need to preserve the hierarchical ordering created by the inline view when I do my row pattern matching – this allows me to do so.

  • Building my match_recognize clause, I start by defining in line 32 that a row that has a higher level than the starting row of the pattern is classified as higher – meaning that when it has a higher level, then it is a child/grandchild/greatgrandchild/… of the starting row (i.e., a subordinate).

  • Of course, not everybody in the entire row set with a higher level is a subordinate – only those consecutive rows with a higher level that follow the row itself. Once I reach someone with the same level (or lower), then I am no longer within the subtree I want. I solve this in the pattern in line 29 by looking for a strt row (which is undefined and therefore can be any row) followed by zero or more higher rows – when a row is reached that is no longer classified higher, the match stops.

  • In line 26, I’ve specified one row per match, and the employee I’m interested in outputting data from is the strt row, so I’m using strt columns in the measures in lines 21–24.

  • In line 25, I’m doing a count on how many higher rows were in the match. If I just did a plain count(*), I’d be including the strt row, but on that row anything I qualify with higher will be null, so counting higher.lvl gives me a count only of the higher rows, which is the count of subordinates that I want.

  • With after match skip to next row in line 27, I’m specifying that once it has finished with a match of one strt row and zero or more higher rows, it should move to the next row that follows after the strt row. This is the part that makes rows be counted more than once – I’ll explain in detail shortly.

That’s all clear, right? Well, I’ll dive a little more into the details to clarify why it works.

Note

A few words on why you’d consider using the long and somewhat convoluted Listing 22-3 instead of the short and clear Listing 22-2.

I tested this on an employee table where I had 14001 employees in it.

The scalar subquery method used about 11 seconds, nearly half a million consistent gets, and over 37000 sorts, due to a full table scan and many, many index range scans for the connect by processing.

The match_recognize method used less than half a second, 55 consistent gets, and four (four!) sorts, with just a single full table scan.

Your mileage will vary, of course, so test it yourself.

The details of each match

As I’ve mentioned before, very often a good way to see what happens is to inspect the detailed output using all rows per match. So this is what I do in Listing 22-4.
SQL> with hierarchy as (
...
 12  )
 13  select
 14     mn
 15   , rn
 16   , lvl
 17   , lpad(' ', (lvl-1)*2)
 18      || substr(name, 1, instr(name, ' ') - 1) as name
 19   , roll
 20   , subs
 21   , cls
 22   , substr(stname, 1, instr(stname, ' ') - 1) as stname
 23   , substr(hiname, 1, instr(hiname, ' ') - 1) as hiname
 24  from hierarchy
 25  match_recognize (
 26     order by rn
 27     measures
 28        match_number()    as mn
 29      , classifier()      as cls
 30      , strt.name         as stname
 31      , higher.name       as hiname
 32      , count(higher.lvl) as roll
 33      , final count(higher.lvl) as subs
 34     all rows per match
 35     after match skip to next row
 36     pattern (
 37        strt higher*
 38     )
 39     define
 40        higher as higher.lvl > strt.lvl
 41  )
 42  order by mn, rn;
Listing 22-4

Inspecting the details with all rows per match

The with clause subquery is unchanged, as are the after match, pattern, and define clauses. I’ve changed the one row to all rows per match in line 34 and then created some different measures in lines 28–33:
  • The match_number() function in line 28 is a consecutive numbering of the matches found. Without it, I couldn’t tell which rows in the output belongs together as part of each match.

  • The classifier() function shows what the row has been classified as according to the pattern and define clauses – in this case showing whether a row is strt or higher.

  • When column names are not qualified, values of the current row in the match are used, no matter what classifier they have. When I qualify the column names with the classifier strt and higher in lines 30 and 31, I get the values from the last of the rows with that classifier.

  • Aggregate functions like count in lines 32 and 33 can be running or final. In Listing 22-3, it did not matter, since I used one row per match, but here it does matter, so I output both to show the difference. Line 32 defaults to running (aka rolling count) which gives a result similar to an analytic function with a window of rows between unbounded preceding and current row, while line 33 with final keyword works similar to rows between unbounded preceding and unbounded following.

The output of Listing 22-4 has far more rows than are in the table, but I have 14 matches (one for each row in the table) identified by 1–14 in the mn column. So if I step through the output, here’s the rows for the first match:
MN  RN  LVL  NAME           ROLL  SUBS  CLS     STNAME   HINAME
1   1   1    Harold         0     13    STRT    Harold
1   2   2      Axel         1     13    HIGHER  Harold   Axel
1   3   3        Jim        2     13    HIGHER  Harold   Jim
1   4   4          Laura    3     13    HIGHER  Harold   Laura
1   5   4          Simon    4     13    HIGHER  Harold   Simon
1   6   3        Maria      5     13    HIGHER  Harold   Maria
1   7   2      Ursula       6     13    HIGHER  Harold   Ursula
1   8   3        Lim        7     13    HIGHER  Harold   Lim
1   9   4          Evelyn   8     13    HIGHER  Harold   Evelyn
1   10  4          Kurt     9     13    HIGHER  Harold   Kurt
1   11  4          Susanne  10    13    HIGHER  Harold   Susanne
1   12  3        Mogens     11    13    HIGHER  Harold   Mogens
1   13  4          Dan      12    13    HIGHER  Harold   Dan
1   14  4          Zoe      13    13    HIGHER  Harold   Zoe

As my pattern matching is ordered by rn, it starts at rn = 1 (Harold) and classifies him strt (since any row can match strt) and then repeatedly checks if the next row has a lvl greater than the lvl of the strt row, which is true for all of the remaining 13 rows, as everybody else has a lvl greater than 1. That means that the first match does not stop until it reaches the end of the rows.

Match number 1 has now been found, containing 1 strt row and 13 higher rows as shown in the cls column. In the strt row, no higher rows have been found yet, so when I qualify a column with higher (and I am not using the final keyword), the result is null, as you can see in column hiname. This also means that when I do the total (final) count of higher in column subs, the strt row is not counted, and the result is the desired 13 subordinates.

You can also see in the output how the running total goes in column roll and that strt.name in column stname keeps the value of the last (in this case only) strt row.

So when the first match is finished, I specified after match skip to next row, which in this case is rn = 2 (Axel). He’ll be the strt row of match mn = 2 in the continued output:
2   2   2      Axel         0     4     STRT    Axel
2   3   3        Jim        1     4     HIGHER  Axel     Jim
2   4   4          Laura    2     4     HIGHER  Axel     Laura
2   5   4          Simon    3     4     HIGHER  Axel     Simon
2   6   3        Maria      4     4     HIGHER  Axel     Maria

After Axel as strt, this match finds four higher rows, because row rn = 7 (Ursula) has lvl = 2, which is not higher than Axel (it is the same), and therefore the match stops with Maria. The counting of subordinates works just like before – even though there are five rows in the match, there are only four that are classified higher and are counted. These rows were also included in the count of Harold King’s subordinates in match number 1, but because of skipping back up to rn = 2 to find the next match, these rows are included once more.

The next row after Axel is Jim, who’ll be the strt row of match mn = 3 that is the next in the output:
3   3   3        Jim        0     2     STRT    Jim
3   4   4          Laura    1     2     HIGHER  Jim      Laura
3   5   4          Simon    2     2     HIGHER  Jim      Simon

Match number 3 ends up with one strt row and just 2 higher rows, since Maria (who follows Simon in the rn order) does not have a lvl higher than Jim. So Laura and Simon are counted as Jim’s subordinates – just as they also were counted under Axel and under Harold.

The output moves on to match number 4, which starts with Laura classifying her as a strt row. After her comes Simon, but he has the same lvl as Laura. Therefore, he cannot be a higher row, and the match becomes a match containing only a single strt row and no higher rows, leading to a subordinate count of 0 in the output:
4   4   4          Laura    0     0     STRT    Laura
And so it goes on and on, until at the end, the match number mn = 14 is found, containing just Zoe:
...
14  14  4          Zoe      0     0     STRT    Zoe
The details of this long output are good to learn how the different pieces of match_recognize work for this solution. But I can also take just some of the columns of the all rows per match output and use pivot in Listing 22-5 to visualize the rows that are part of each match.
SQL> with hierarchy as (
...
 12  )
 13  select
 14     name
 15   , "1", "2", "3", "4", "5", "6", "7"
 16   , "8", "9", "10", "11", "12", "13", "14"
 17  from (
 18     select
 19        mn
 20      , rn
 21      , lpad(' ', (lvl-1)*2)
 22         || substr(name, 1, instr(name, ' ') - 1) as name
 23     from hierarchy
 24     match_recognize (
 25        order by rn
 26        measures
 27           match_number()    as mn
 28        all rows per match
 29        after match skip to next row
 30        pattern (
 31           strt higher*
 32        )
 33        define
 34           higher as higher.lvl > strt.lvl
 35     )
 36  ) pivot (
 37     max('X')
 38     for mn in (
 39        1,2,3,4,5,6,7,8,9,10,11,12,13,14
 40     )
 41  )
 42  order by rn;
Listing 22-5

Pivoting to show which rows are in which match

The only measure I am using is match_number() in line 27, and then in lines 19–22, I select just mn, rn, and the name. This allows me to do a pivot for mn in line 38 specifying the 14 match numbers in line 39, thereby getting rn, name, and 14 columns named 1–14 (these column names must be enclosed in double quotes, as they do not start with a letter).

The value of the 14 match number columns is the literal X if the rn of the row is included in the match, otherwise null. So I can select the mn column and the Xs and just use rn for ordering the output:
NAME           1  2  3  4  5  6  7  8  9  10  11  12  13  14
Harold         X
  Axel         X  X
    Jim        X  X  X
      Laura    X  X  X  X
      Simon    X  X  X     X
    Maria      X  X           X
  Ursula       X                 X
    Lim        X                 X  X
      Evelyn   X                 X  X  X
      Kurt     X                 X  X     X
      Susanne  X                 X  X         X
    Mogens     X                 X                X
      Dan      X                 X                X   X
      Zoe      X                 X                X       X

In this pivoted output, it is easy to use the Xs to check that all rows are included in match number 1, the rows from Axel to Maria are included in match number 2, and so on.

Fiddling with the output

Having examined the detailed output, I’ll return to the one row per match version to fiddle a bit more and show you a couple of things.

First, I’d like to make it clear that although Listing 22-3 with one row per match only has a single aggregate measure, and so far I’ve only shown multiple aggregate measures in Listing 22-4 using all rows per match, it is perfectly legitimate to use multiple aggregates or uses of functions like first and last together with one row. Take a look at Listing 22-6.
SQL> with hierarchy as (
...
 12  )
 13  select
 14     lpad(' ', (lvl-1)*2) || name as name
 15   , subs
 16   , hifrom
 17   , hito
 18   , himax
 19  from hierarchy
 20  match_recognize (
 21     order by rn
 22     measures
 23        strt.rn            as rn
 24      , strt.lvl           as lvl
 25      , strt.name          as name
 26      , count(higher.lvl)  as subs
 27      , first(higher.name) as hifrom
 28      , last(higher.name)  as hito
 29      , max(higher.lvl)    as himax
 30     one row per match
 31     after match skip to next row
 32     pattern (
 33        strt higher*
 34     )
 35     define
 36        higher as higher.lvl > strt.lvl
 37  )
 38  order by rn;
Listing 22-6

Adding multiple measures when doing one row per match

In lines 26–29, I am using both navigational functions and aggregates. Remember that when I use one row per match, it makes no difference if I use running or final for the aggregates, so even if I didn’t specify final, I get the same result:
NAME                SUBS  HIFROM         HITO          HIMAX
Harold King         13    Axel de Proef  Zoe Thorston  4
  Axel de Proef     4     Jim Kronzki    Maria Juarez  4
    Jim Kronzki     2     Laura Jensen   Simon Chang   4
      Laura Jensen  0
      Simon Chang   0
    Maria Juarez    0
  Ursula Mwbesi     7     Lim Tok Lo     Zoe Thorston  4
    Lim Tok Lo      3     Evelyn Smith   Susanne Hoff  4
      Evelyn Smith  0
      Kurt Zollman  0
      Susanne Hoff  0
    Mogens Juel     2     Dan Hoeffler   Zoe Thorston  4
      Dan Hoeffler  0
      Zoe Thorston  0

So my first point was the use of multiple measures for whatever output I want in the various columns. Can I fiddle with the rows in the output as well? Say, for example, I want to output only those employees that actually have subordinates (or in other words are not leaf nodes in the tree).

Sure, I could put the entire query in an inline view and then use a where clause to filter on subs > 0 and that way not get any leaf nodes in the output. It would work fine, but my second point to show you is a better alternative that filters away the non-leaf nodes earlier in the processing.

In Listing 22-3 line 29, I’m using a pattern of strt higher* which is a pattern that by design will be matched by any row that will be classified strt – it is just a question of how many higher rows will follow after that strt row. So Listing 22-3 will by the nature of the pattern output all rows of the table.

Let me in Listing 22-7 change just one character – otherwise, it is identical to Listing 22-3.
...
 29        strt higher+
...
Listing 22-7

Filtering matches with the pattern definition

I have changed * to + which means that any given strt row will only cause a match if it is followed by at least one higher row. So the leaf nodes, which are not followed by any higher row, will not cause a match – instead the database simply moves one row along and checks if it can find a match using the next row as strt row. This leads to only supervisors being output:
ID   NAME             SUBS
142  Harold King      13
144    Axel de Proef  4
151      Jim Kronzki  2
147    Ursula Mwbesi  7
146      Lim Tok Lo   3
143      Mogens Juel  2

Doing it this way allows the database to discard the unwanted rows immediately as it works its way through the pattern matching process – rather than the inline view that lets the database build a result set of all rows and then afterward removes the unwanted ones again.

Lessons learned

In this chapter I’ve demonstrated that with a suitable ordering, the after match skip to next row clause can very efficiently allow match_recognize to process the same rows multiple times in different groupings without accessing them in the table multiple times. In the demos I covered
  • Preparing the source query by creating an ordering column that allows match_recognize to work in the hierarchical order

  • Setting a pattern that for each row finds the group of rows that are in the subtree below

  • Using after match skip to next row to use the pattern search on every row, even if it was included in previous matches

  • Changing the pattern to ignore those rows not having a subtree below

These methods you can use on any hierarchical data. They can also be useful on other data with an ordering that is nontrivial, where you can set up a more complex query to prepare the data and preserve the ordering, before you process the data with match_recognize.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.19.27.178