Optimizing temporal queries

The problem with temporal queries is that when reading from a table, SQL Server can use only one index, successfully eliminate rows that are not candidates for the result from one side only, and then scan the rest of the data. For example, you need to find all intervals in the table that overlap with a given interval. Remember, two intervals overlap when the beginning of the first one is lower than or equal to the end of the second one, and the beginning of the second one is lower than or equal to the end of the first one, or mathematically when (b1 ≤ e2) AND (b2 ≤ e1).

The following query searches for all of the intervals that overlap with the interval (10, 30). Note that the second condition (b2 ≤ e1) is turned around to (e≥ b2) for simpler reading (the beginning and the end of intervals from the table are always on the left side of the condition). The given or searched interval is at the beginning of the timeline for all intervals in the table:

SET STATISTICS IO ON; 
DECLARE @b AS INT = 10, 
 @e AS INT = 30; 
SELECT id, b, e 
FROM dbo.Intervals 
WHERE b <= @e 
  AND e >= @b 
OPTION (RECOMPILE); 
GO 

The query used 36 logical reads. If you check the execution plan, you can see that the query used the index seek in the idx_b index with the seek predicate [tempdb].[dbo].[Intervals].b <= Scalar Operator((30)) and then scanned the rows and selected the resulting rows using the residual predicate [tempdb].[dbo].[Intervals].[e]>=(10). Because the searched interval is at the beginning of the timeline, the seek predicate successfully eliminated a majority of the rows; only a few intervals in the table have a beginning point lower than or equal to 30.

You would get a similarly efficient query if the searched interval was at the end of the timeline; it's just that SQL Server would use the idx_e index for the seek. However, what happens if the searched interval is in the middle of the timeline, like the following query shows:

DECLARE @b AS INT = 570, 
 @e AS INT = 590; 
SELECT id, b, e 
FROM dbo.Intervals 
WHERE b <= @e 
  AND e >= @b 
OPTION (RECOMPILE); 
GO 

This time, the query used 111 logical reads. With a bigger table, the difference with the first query would be even bigger. If you check the execution plan, you can find out that SQL Server used the idx_e index with the [tempdb].[dbo].[Intervals].e >= Scalar Operator((570)) seek predicate and the [tempdb].[dbo].[Intervals].[b]<=(590) residual predicate. The seek predicate excludes approximately half of the rows from one side, while half of the rows from the other side are scanned and the resulting rows extracted with the residual predicate.

There is a solution that uses that index to eliminate rows from both sides of the searched interval by using a single index. The following figure shows this logic:

Optimizing a temporal query

The intervals in the figure are sorted by the lower boundary, representing SQL Server's usage of the idx_b index. Eliminating intervals from the right side of the given (searched) interval is simple: just eliminate all intervals where the beginning is at least one unit bigger (more to the right) of the end of the given interval. You can see this boundary in the figure denoted with the rightmost dotted line. However, eliminating from the left is more complex. In order to use the same index, the idx_b index for eliminating from the left, I need to use the beginning of the intervals in the table in the WHERE clause of the query. I have to go to the left side, away from the beginning of the given (searched) interval at least for the length of the longest interval in the table, which is marked with a callout in the figure. The intervals that begin before the left-hand yellow line cannot overlap with the given (blue/dark gray) interval.

Since I already know that the length of the longest interval is 20, I can write an enhanced query in a quite simple way:

DECLARE @b AS INT = 570, 
 @e AS INT = 590; 
DECLARE @max AS INT = 20; 
SELECT id, b, e 
FROM dbo.Intervals 
WHERE b <= @e AND b >= @b - @max 
  AND e >= @b AND e <= @e + @max 
OPTION (RECOMPILE); 

This query retrieves the same rows as the previous one with 20 logical reads only. If you check the execution plan, you can see that the idx_b was used, with the seek predicate Seek Keys[1]: Start: [tempdb].[dbo].[Intervals].b >= Scalar Operator((550)), End: [tempdb].[dbo].[Intervals].b <= Scalar Operator((590)), which successfully eliminated rows from both sides of the timeline, and then the residual predicate [tempdb].[dbo].[Intervals].[e]>=(570) AND [tempdb].[dbo].[Intervals].[e]<=(610) was used to select rows from a very limited partial scan.

Of course, the figure could be turned around to cover cases where the idx_e index would be more useful. With this index, elimination from the left is simple—eliminate all of the intervals that end at least one unit before the beginning of the given interval. This time, elimination from the right is more complex—the end of the intervals in the table cannot be more to the right than the end of the given interval plus the maximum length of all intervals in the table.

Please note that this performance is the consequence of the specific data in the table.

The maximum length of an interval is 20. This way, SQL Server can very efficiently eliminate intervals from both sides. However, if there were to be only one long interval in the table, the code would become much less efficient, because SQL Server would not be able to eliminate a lot of rows from one side, either left or right, depending on which index it would use. Anyway, in real life, interval length often does not vary a lot, so this optimization technique might be very useful, especially because it is simple.

After you finish with the temporal queries in this section, you can clean up your tempdb database with the following code:

DROP TABLE dbo.DateNums; 
DROP TABLE dbo.Intervals; 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.103.154