© Kim Berg Hansen 2020
K. Berg HansenPractical Oracle SQLhttps://doi.org/10.1007/978-1-4842-5617-6_19

19. Merging Date Ranges

Kim Berg Hansen1 
(1)
Middelfart, Denmark
 

Lots of data have a date range for validity – when is or was the event or price or whatever active. Schedules, prices, discounts, versioning, audit trails, the list is endless.

It’s common to want to merge rows (at least in report output) where the date ranges are right after one another or even overlapping. For example, you may have a production schedule for your assembly line having three rows with adjoining date ranges – producing the same product for three different sales orders. For production planning, you may want to output this as a single row with the total date range and the sum of the quantities you need to produce.

There can be many other examples of this – in this chapter I’ll show you an example of merging job hire periods with the match_recognize clause.

Job hire periods

As an example of a table with date ranges, I’ll be using the emp_hire_periods table shown in Figure 19-1, which has a foreign key relation to the employees table.
../images/475066_1_En_19_Chapter/475066_1_En_19_Fig1_HTML.jpg
Figure 19-1

The table of periods that employees have been hired for a given job

A given employee can be hired in different periods for different job functions (indicated by the title column). The date ranges I have in the table follow these rules:
  • A null value in end_date means the employee currently works at that function.

  • When an employee stops working for Good Beer Trading Co, end_date is filled.

  • If the employee is rehired, a new row is inserted.

  • By promotion or change in job function, end_date is filled, and a new row is inserted with the new title.

  • An employee can have more than one function at the same time, so the date ranges may overlap.

  • The start_date is included in the date range and the end_date is excluded from the date range – often written as a [start_date, end_date[ half-open interval.

You may find the last rule less than intuitive, but I’ll get back shortly with an explanation of why this is a good idea.

Note

A closed interval [start, end] is start <= x <= end, while an open interval ]start, end[ is start < x < end. The half-open interval is then either ]start, end] or (as in this case) [start, end[.

All of the logic I’ll be showing in this chapter is in principle valid just by working with the emp_hire_periods table alone, but to make it easier to see who is whom, I create a view in Listing 19-1 so that I retrieve the employee name too.
SQL> create or replace view emp_hire_periods_with_name
  2  as
  3  select
  4     ehp.emp_id
  5   , e.name
  6   , ehp.start_date
  7   , ehp.end_date
  8   , ehp.title
  9  from emp_hire_periods ehp
 10  join employees e
 11     on e.id = ehp.emp_id;
View EMP_HIRE_PERIODS_WITH_NAME created.
Listing 19-1

View joining the hire periods with the employees

Querying the emp_hire_periods_with_name view in Listing 19-2, I can show you the data I have.
SQL> select
  2     ehp.emp_id
  3   , ehp.name
  4   , ehp.start_date
  5   , ehp.end_date
  6   , ehp.title
  7  from emp_hire_periods_with_name ehp
  8  order by ehp.emp_id, ehp.start_date;
Listing 19-2

The hire periods data

In the interest of saving a little space, I have not filled the table with data for all 14 employees, just a selection of 6:
EMP_ID  NAME           START_DATE  END_DATE    TITLE
142     Harold King    2010-07-01  2012-04-01  Product Director
142     Harold King    2012-04-01              Managing Director
143     Mogens Juel    2010-07-01  2014-01-01  IT Technician
143     Mogens Juel    2014-01-01  2016-06-01  Sys Admin
143     Mogens Juel    2014-04-01  2015-10-01  Code Tester
143     Mogens Juel    2016-06-01              IT Manager
144     Axel de Proef  2010-07-01  2013-07-01  Sales Manager
144     Axel de Proef  2012-04-01              Product Director
145     Zoe Thorston   2014-02-01              IT Developer
145     Zoe Thorston   2019-02-01              Scrum Master
146     Lim Tok Lo     2014-10-01  2016-02-01  Forklift Operator
146     Lim Tok Lo     2017-03-01              Warehouse Manager
147     Ursula Mwbesi  2014-10-01  2015-05-01  Delivery Manager
147     Ursula Mwbesi  2016-05-01  2017-03-01  Warehouse Manager
147     Ursula Mwbesi  2016-11-01              Operations Chief
When I visualize the same data in Figure 19-2, it’s easy to see who has changed jobs along the way, who has been away from the company and returned in a different job, and who has had double jobs for periods of time.
../images/475066_1_En_19_Chapter/475066_1_En_19_Fig2_HTML.jpg
Figure 19-2

Visualizing the data helps see the overlaps

You’ll notice that because I use the half-open interval I mentioned before, employees changing jobs have a start_date on the new job that is equal to the end_date of the old job. Why didn’t I use closed intervals instead, so Harold King was product director from 2010-07-01 to 2012-03-31 – both dates included?

It might seem easier to use closed intervals, so you can simplify your code a little by using between instead of >= and < – but there’s a problem. The date datatype can contain not only whole dates but also hours, minutes, and seconds. That means that with a closed interval end_date of 2012-03-31, Harold King would not be hired anymore at 1 second past midnight, and the entire day of March 31st, he would be out of a job until rehired April 1st at midnight.

“Easy,” you say, “just put an end_date of 2012-03-31 23:59:59, and all is well.” But is it? Possibly it’ll be OK, but what if you need to switch to a timestamp datatype in the future and support fractional seconds? (Probably not the case for hire periods, but you can easily imagine other use cases for this.)

By using half-open intervals instead for your date ranges, you will never have the problem that Harold King in principle is not hired for a short time (a day, a second, a microsecond – no matter how small, with the closed interval, there will always be a piece of time that is not covered by the ranges).

When working with half-open intervals, it can help to think of both dates as from dates:
  • The start_date is the exact moment from which the row starts being active.

  • The end_date is the exact moment from which the row is no longer active (i.e., it ends being active immediately before that moment).

This thought process might have been helped by choosing column names like active_from and inactive_from, but the notion of start and end is just so commonly used that I’m doing the same.

Oracle itself has realized the usefulness of half-open intervals when they introduced temporal validity in version 12.1. So let me use this as a good opportunity for a brief detour and show you how temporal validity works. Afterward I’ll get back to the date range merging.

Temporal validity

In Listing 19-3, you’ll see the create table statement I used for creating the emp_hire_periods table.
SQL> create table emp_hire_periods (
  2     emp_id         not null constraint emp_hire_periods_emp_fk
  3                       references employees
  4   , start_date     date not null
  5   , end_date       date
  6   , title          varchar2(20 char) not null
  7   , constraint emp_hire_periods_pk primary key (emp_id, start_date)
  8   , period for employed_in (start_date, end_date)
  9  );
Listing 19-3

Table defined with temporal validity

The interesting bit is line 8, which is the period for clause for defining temporal validity on the table.

In the parentheses, I’ve specified the two columns that contain the start and end point of the half-open interval. (These can be date or timestamp columns.) Both columns are allowed to be nullable; it is just for this use case I have set start_date to be not null as a job period will always have a specific starting point, whereas end_date allows nulls, because this means the job is still current.

Tip

If you do not specify the two columns, the database auto-creates two hidden columns to contain the interval. Normally I prefer to create the columns myself and specify them, but it might be handy if you have a use case where those who query are not interested in the actual interval, just whether the row is valid at a specific point in time or not.

Right after period for, you must name the period (give it an identifier), and I have carefully chosen employed_in. It is a good idea to give the name some thought, as a good name will be helpful in queries that use temporal validity, as I show it in Listing 19-4.
SQL> select
  2     ehp.emp_id
  3   , e.name
  4   , ehp.start_date
  5   , ehp.end_date
  6   , ehp.title
  7  from emp_hire_periods
  8          as of period for employed_in date '2010-07-01'
  9       ehp
 10  join employees e
 11     on e.id = ehp.emp_id
 12  order by ehp.emp_id, ehp.start_date;
Listing 19-4

Querying hire periods table as of a specific date

In the from clause lines 7–9, I can use an as of syntax very similar to flashback queries, with the table in line 7, the as of specification in line 8, and the table alias in line 9.

When using flashback, I specify as of timestamp or as of scn, but with temporal validity, I specify as of period for and then the name of the period. This means that the name employed_in in line 8 helps self-document that I’m querying those that were employed in 2010-07-01, which was the start of the company, and there were only three people:
EMP_ID  NAME           START_DATE  END_DATE    TITLE
142     Harold King    2010-07-01  2012-04-01  Product Director
143     Mogens Juel    2010-07-01  2014-01-01  IT Technician
144     Axel de Proef  2010-07-01  2013-07-01  Sales Manager
If I want to find those that were employed 6 years later, I just change the date value in line 8:
...
  8          as of period for employed_in date '2016-07-01'
...
And here I have five people (some of whom are the same, just with new titles):
EMP_ID  NAME           START_DATE  END_DATE    TITLE
142     Harold King    2012-04-01              Managing Director
143     Mogens Juel    2016-06-01              IT Manager
144     Axel de Proef  2012-04-01              Product Director
145     Zoe Thorston   2014-02-01              IT Developer
147     Ursula Mwbesi  2016-05-01  2017-03-01  Warehouse Manager

The query with as of is internally rewritten by the database into a regular where clause with suitable >= and < predicates; it is just easier to get it right with as of. Also the database treats it as a type of constraint – it will not let you insert data with an end_date that is before start_date.

This little aside showed you briefly how temporal validity can make things easier, and if you do use temporal validity, you’ll also automatically get the benefits of the half-open intervals. Now I’ll get back to the range merging, which you can do with or without temporal validity.

Merging overlapping ranges

What I want to do now is to take the data in Figure 19-2, find all places where hire periods of the same employee either adjoin or overlap, and merge those into single aggregate rows showing how many jobs (either successively or concurrently) the employee has had in that aggregated period. The result I want is shown in Figure 19-3.
../images/475066_1_En_19_Chapter/475066_1_En_19_Fig3_HTML.jpg
Figure 19-3

Expected results after merging overlapping and adjoining date ranges

I am now going to attempt solving this with match_recognize. To demonstrate trying out different approaches and changing the logic along the way, I will first show some attempts that do not quite work, leading up to a working solution in the end.

Attempts comparing to the previous row

In quite a few scenarios using match_recognize , it is typical to compare a value from the current row to a value from the previous row in order to make a row classification. So I’ll try that first in Listing 19-5.
SQL> select
  2     emp_id
  3   , name
  4   , start_date
  5   , end_date
  6   , jobs
  7  from emp_hire_periods_with_name
  8  match_recognize (
  9     partition by emp_id
 10     order by start_date, end_date
 11     measures
 12        max(name)         as name
 13      , first(start_date) as start_date
 14      , last(end_date)    as end_date
 15      , count(*)          as jobs
 16     pattern (
 17        strt adjoin_or_overlap*
 18     )
 19     define
 20        adjoin_or_overlap as
 21           start_date <= prev(end_date)
 22  )
 23  order by emp_id, start_date;
Listing 19-5

Comparing start_date to end_date of the previous row

My simple definition in line 21 states that a row is overlapping or adjoining if the start_date is smaller than or equal to the end_date of the previous row. A match is then found by the pattern in line 17 of any row followed by zero or more adjoining or overlapping rows.

And sure enough, this rule does indeed merge some of the date ranges in this output:
EMP_ID  NAME           START_DATE  END_DATE    JOBS
142     Harold King    2010-07-01              2
143     Mogens Juel    2010-07-01  2015-10-01  3
143     Mogens Juel    2016-06-01              1
144     Axel de Proef  2010-07-01              2
145     Zoe Thorston   2014-02-01              1
145     Zoe Thorston   2019-02-01              1
146     Lim Tok Lo     2014-10-01  2016-02-01  1
146     Lim Tok Lo     2017-03-01              1
147     Ursula Mwbesi  2014-10-01  2015-05-01  1
147     Ursula Mwbesi  2016-05-01              2

But the output of, for example, Mogens Juel is not completely merged; there should have been a single row only for him with four jobs. The problem is that when I order his rows by start_date, the Code Tester and IT Manager rows are compared and not overlapping. A comparison like this to the previous row fails to discover that both rows are adjoining or overlapping to Sys Admin.

Thinking about it, I figured that maybe it would help simply to change the ordering in line 10 to order by end_date first:
...
 10     order by end_date, start_date
...
The output has changed, but Mogens Juel still wrongly is shown twice:
EMP_ID  NAME           START_DATE  END_DATE    JOBS
142     Harold King    2010-07-01              2
143     Mogens Juel    2010-07-01  2014-01-01  1
143     Mogens Juel    2014-04-01              3
144     Axel de Proef  2010-07-01              2
145     Zoe Thorston   2014-02-01              1
145     Zoe Thorston   2019-02-01              1
146     Lim Tok Lo     2014-10-01  2016-02-01  1
146     Lim Tok Lo     2017-03-01              1
147     Ursula Mwbesi  2014-10-01  2015-05-01  1
147     Ursula Mwbesi  2016-05-01              2

With the changed ordering, the first attempt at finding a match for Mogens Juel will try to compare the IT Technician row with the Code Tester row and fail to find an overlap.

No matter which ordering I choose, I cannot get all the overlaps in a single match by simply comparing a row to the previous row. I need a different way to handle this.

Better comparing to the maximum end date

Looking more closely on the rows of Mogens Juel in Figure 19-2, I decide that a better approach would be to compare the start_date of a row with the highest end_date that I have found so far in the match.

A first attempt at this approach could look like this, but it would not work:
...
  8  match_recognize (
  9     partition by emp_id
 10     order by start_date, end_date
 11     measures
 12        max(name)         as name
 13      , first(start_date) as start_date
 14      , max(end_date)     as end_date
 15      , count(*)          as jobs
 16     pattern (
 17        strt adjoin_or_overlap*
 18     )
 19     define
 20        adjoin_or_overlap as
 21           start_date <= max(end_date)
 22  )
 ...

The reason it does not work is that when a definition condition like line 21 is evaluated, the row is first assumed to be classified adjoin_or_overlap, and then the condition is tested if it is true. Therefore the result of max(end_date) is calculated of all rows of the match so far plus the current row, which does not make sense.

In fact it makes so little sense that when I tested this first attempt, the query gave me either ORA-03113: end-of-file on communication channel or java.lang.NullPointerException depending on database version and which client I use. The database connection was then broken.

So do not use this first attempt. Instead you should try my second attempt, which is shown in Listing 19-6.
...
  8  match_recognize (
  9     partition by emp_id
 10     order by start_date, end_date
 11     measures
 12        max(name)         as name
 13      , first(start_date) as start_date
 14      , max(end_date)     as end_date
 15      , count(*)          as jobs
 16     pattern (
 17        adjoin_or_overlap* last_row
 18     )
 19     define
 20        adjoin_or_overlap as
 21           next(start_date) <= max(end_date)
 22  )
 23  order by emp_id, start_date;
Listing 19-6

Comparing start_date of next row to highest end_date seen so far

In Listing 19-6, I reverse the logic. Instead of comparing the current row with the previous row, I compare it with the next row:
  • I go back to ordering by start_date in line 10.

  • In line 21, I check if the start_date of the next row is less than or equal to the highest end_date seen so far in the match – including the current row, because the max call will assume the current row is part of the match when it is evaluated. That means that when a row is classified as adjoin_or_overlap, that row should be merged with the next row.

  • The pattern in line 17 looks for zero or more adjoin_or_overlap rows followed by one single row classified last_row. As that classification is undefined, any row can match it – but since the row before last_row was classified adjoin_or_overlap, I know that the last_row should be merged too.

  • If I find no adjoin_or_overlap rows, the row will become classified last_row because of the * in line 17 that says that zero adjoin_or_overlap rows are acceptable in the pattern. This means that when a row is not overlapping with any other rows, it will become a match of a single row classified as last_row and thus unmerged be part of the output.

  • The measure end_date in line 14 is calculated as the largest end_date of the match. Since I am not qualifying the end_date in the max call with either adjoin_or_overlap or last_row, max is applied to all rows of the match no matter what classification the rows got.

This is a somewhat tricky match_recognize clause to understand. When I do conference presentations on this topic, I usually draw the date ranges on a whiteboard and step through the evaluation of the row classification row by row. As I cannot do an animated drawing in a book, I am going to simulate it using a series of figures from Figure 19-4 to Figure 19-8, going through the steps of finding a match for Mogens Juel.
../images/475066_1_En_19_Chapter/475066_1_En_19_Fig4_HTML.jpg
Figure 19-4

Can first row be classified as adjoin_or_overlap?

In Figure 19-4, I start by evaluating if the first row of Mogens Juel can be classified adjoin_or_overlap or not. Since I start by assuming it can, the max(end_date) in line 21 of Listing 19-6 evaluates to the end of the first row. The next(start_date) evaluates to the start_date of the second row. The two are equal, therefore adjoining, so the condition in line 21 is true, and the first row is classified adjoin_or_overlap.
../images/475066_1_En_19_Chapter/475066_1_En_19_Fig5_HTML.jpg
Figure 19-5

Can second row be classified as adjoin_or_overlap?

Having classified the first row, Figure 19-5 evaluates if the second row can be classified adjoin_or_overlap or not. The max(end_date) evaluates to the end_date of the second row, while the next(start_date) is the start_date of the third row. The latter is less than the former, therefore overlapping, and the second row is classified adjoin_or_overlap.
../images/475066_1_En_19_Chapter/475066_1_En_19_Fig6_HTML.jpg
Figure 19-6

Can third row be classified as adjoin_or_overlap?

The pattern is still fulfilled, so in Figure 19-6, the classification evaluation is performed for the third row. In this case the max(end_date) does not move; it is still the end_date of the second row. The next(start_date) is the start_date of the fourth row. They are equal, so the fourth row is adjoining to the match found so far, and therefore the third row is adjoin_or_overlap.
../images/475066_1_En_19_Chapter/475066_1_En_19_Fig7_HTML.jpg
Figure 19-7

Can fourth row be classified as adjoin_or_overlap?

The match continues, and Figure 19-7 evaluates the fourth row. This time max(end_date) should be infinity as shown in the figure, because the fourth row has null in end_date. I am not yet handling this situation (more on this shortly), so in actual fact, max(end_date) would wrongly evaluate to the end_date of the second row. But since there are no more rows, next(start_date) evaluates to null, which makes the condition evaluate to Boolean unknown. Therefore the fourth row is not classified as adjoin_or_overlap.
../images/475066_1_En_19_Chapter/475066_1_En_19_Fig8_HTML.jpg
Figure 19-8

Fourth row classified as last_row and a match has been found

When the fourth row is not adjoin_or_overlap, the pattern in line 17 of Listing 19-6 states that it should be a last_row in order to complete the match. So Figure 19-8 evaluates if the fourth row can be classified last_row or not. As last_row is an undefined classification, it always evaluates to true, and the fourth row is therefore classified as last_row, and the match has been completed.

This step-by-step evaluation of the row classification of Mogens Juel leads to the output of Listing 19-6, where the four hire periods of Mogens Juel have correctly been merged into a single row showing four jobs:
EMP_ID  NAME           START_DATE  END_DATE    JOBS
142     Harold King    2010-07-01  2012-04-01  2
143     Mogens Juel    2010-07-01  2016-06-01  4
144     Axel de Proef  2010-07-01  2013-07-01  2
145     Zoe Thorston   2014-02-01              1
145     Zoe Thorston   2019-02-01              1
146     Lim Tok Lo     2014-10-01  2016-02-01  1
146     Lim Tok Lo     2017-03-01              1
147     Ursula Mwbesi  2014-10-01  2015-05-01  1
147     Ursula Mwbesi  2016-05-01  2017-03-01  2

But I still have a couple of problems with this output.

Firstly several of the employees (including Mogens Juel) have a wrong value in the measure end_date. Those that are still employed should have null (blank) in the end_date column, and in this output that is only true for those with just a single hire period. For those that have had more than one job, the highest non-null end_date is wrongly displayed.

Secondly I notice that Zoe Thorston also has overlapping rows – the problem here is just that the end_date of both rows are null, meaning both rows are current and she has both job functions. With the null values, the simple comparison in line 21 of Listing 19-6 will not be true.

Both of these problems are because I am not handling the null values in end_date. This I will do now.

Handling the null dates

To handle these null values , I change a little bit more in Listing 19-7.
...
  8  match_recognize (
  9     partition by emp_id
 10     order by start_date nulls first, end_date nulls last
 11     measures
 12        max(name)         as name
 13      , first(start_date) as start_date
 14      , nullif(
 15           max(nvl(end_date, date '9999-12-31'))
 16         , date '9999-12-31'
 17        )                 as end_date
 18      , count(*)          as jobs
 19     pattern (
 20        adjoin_or_overlap* last_row
 21     )
 22     define
 23        adjoin_or_overlap as
 24           nvl(next(start_date), date '-4712-01-01')
 25              <= max(nvl(end_date, date '9999-12-31'))
 26  )
 27  order by emp_id, start_date;
Listing 19-7

Handling null=infinity for both start and end

Even though this particular case only has null values in the end_date, for demonstration purposes, I have made the changes necessary to handle if there were null values in the start_date as well:
  • In line 10, I make the order by a bit more explicit. If there had been null values in start_date, these would be considered earlier than any other start_date, so I use nulls first to make those rows come first. Similarly null values in end_date are considered later than any other end_date, so I use nulls last to make those rows come last.

  • In comparisons I cannot simply use a nulls first to consider a null in start_date to be less than any other date, so in line 24, I turn a null into the smallest date possible in the Oracle date datatype.

  • The aggregate function max ignores null values, so in line 25, I turn a null in end_date into the largest date possible in a date.

  • To get a correct result in the end_date measure, I do the same nvl inside the max function in line 15. Then if the max results in the largest date, I use nullif in lines 14 and 16 to turn that back into null for output.

With these expanded rules, I get the final output where the rows of Zoe Thorston also are merged into one:
EMP_ID  NAME           START_DATE  END_DATE    JOBS
142     Harold King    2010-07-01              2
143     Mogens Juel    2010-07-01              4
144     Axel de Proef  2010-07-01              2
145     Zoe Thorston   2014-02-01              2
146     Lim Tok Lo     2014-10-01  2016-02-01  1
146     Lim Tok Lo     2017-03-01              1
147     Ursula Mwbesi  2014-10-01  2015-05-01  1
147     Ursula Mwbesi  2016-05-01              2

This output matches Figure 19-3, the result that I wanted.

Now I cannot merge any further – the rows of this output are all neither overlapping nor adjoining.

Lessons learned

This is just a single example of merging rows with date ranges in a report on employee job history, but it serves as inspiration and lesson to enable you to go ahead and do the same for other data.

In the course of the chapter, I’ve been explaining about
  • The advantages of using half-open intervals for date ranges and how temporal validity can make it easier to query data with such intervals

  • Using match_recognize to compare maximum values with next row to find overlapping or adjoining ranges and merge them into aggregate rows

  • Expanding the rules to also handle situations where null indicates infinity

You’ll likely find many places you can use these methods.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.141.6