Chapter 4. Set-based iteration, the third alternative

Hugo Kornelis

When reading SQL Server newsgroups or blogs, you could easily get the impression that there are two ways to manipulate data: declarative (set-based) or iterative (cursor-based). And that iterative code is always bad and should be avoided like the plague.

Those impressions are both wrong.

Iterative code isn’t always bad (though, in all honesty, it usually is). And there’s more to SQL Server than declarative or iterative—there are ways to combine them, adding their strengths and avoiding their weaknesses. This article is about one such method: set-based iteration.

The technique of set-based iteration can lead to efficient solutions for problems that don’t lend themselves to declarative solutions, because those would result in an amount of work that grows exponentially with the amount of data. In those cases, the trick is to find a declarative query that solves a part of the problem (as much as feasible), and that doesn’t have the exponential performance problem—then repeat that query until all work has been done. So instead of attempting a single set-based leap, or taking millions of single-row-sized miniature steps in a cursor, set-based iteration arrives at the destination by taking a few seven-mile leaps.

In this chapter, I’ll first explain the need for an extra alternative by discussing the weaknesses and limitations of purely iterative and purely declarative coding. I’ll then explain the technique of set-based iteration by presenting two examples: first a fairly simple one, and then a more advanced case.

The common methods and their shortcomings

Developing SQL Server code can be challenging. You have so many ways to achieve the same result that the challenge isn’t coming up with working code, but picking the “best” working code from a bunch of alternatives. So what’s the point of adding yet another technique, other than making an already tough choice even harder? The answer is that there are cases (admittedly, not many) where none of the existing options yield acceptable performance, and set-based iteration does.

Declarative (set-based) code

Declarative coding is, without any doubt, the most-used way to manipulate data in SQL Server. And for good reason, because in most cases it’s the fastest possible code.

The basic principle of declarative code is that you don’t tell the computer how to process the data in order to create the required results, but instead declare the results you want and leave it to the DBMS to figure out how to get those results. Declarative code is also called set-based code because the declared required results aren’t based on individual rows of data, but on the entire set of data.

For example, if you need to find out which employees earn more than their manager, the declarative answer would involve one single query, specifying all the tables that hold the source data in its FROM clause, all the required output columns in its SELECT clause, and using a WHERE clause to filter out only those employees that meet the salary requirement.

Benefits

The main benefit of declarative coding is its raw performance. For one thing, SQL Server has been heavily optimized toward processing declarative code. But also, the query optimizer—the SQL Server component that selects how to process each query—can use all the elements in your database (including indexes, constraints, and statistics on data distribution) to find the most efficient way to process your request, and even adapt the execution plan when indexes are added or statistics indicate a major change in data distribution.

Another benefit is that declarative code is often much shorter and (once you get the hang of it) easier to read and maintain than iterative code. Shorter, easier-to-read code directly translates into a reduction of development cost, and an even larger reduction of future maintenance cost.

Drawbacks

Aside from the learning curve for people with a background in iterative coding, there’s only one problem with the set-based approach. Because you have to declare the results in terms of the original input, you can’t take shortcuts by specifying end results in terms of intermediate results. In some cases, this results in queries that are awkward to write and hard to read. In other cases, it may result in queries that force SQL Server to do more work than would otherwise be required.

Running totals is an example of this. There’s no way to tell SQL Server to calculate the running total of each row as the total of the previous row plus the value of the current row, because the running total of the previous row isn’t available in the input, and partial query results (even though SQL Server does know them) can’t be specified in the language.

The only way to calculate running totals in a set-based fashion is to specify each running total as the sum of the values in all preceding rows. That implies that a lot more summation is done than would be required if intermediate results were available. This results in performance that degrades exponentially with the amount of data, so even if you have no problems in your test environment, you will have problems in your 100-million-row production database!


Running totals in the OVER clause

The full ANSI standard specification of the OVER clause includes windowing extensions that allow for simple specification of running totals. This would result in short queries with probably very good performance—if SQL Server had implemented them. Unfortunately, these extensions aren’t available in any current version of SQL Server, so we still have to code the running totals ourselves.


Iterative (cursor-based) code

The base principle of iterative coding is to write T-SQL as if it were just another third-generation programming language, like C#, VB.NET, Cobol, and Pascal. In those languages, the only way to process a set of data (such as a sequentially organized file) is to iterate over the data, reading one record at a time, processing that record, and then moving to the next record until the end of the file has been reached. SQL Server has cursors as a built-in mechanism for this iteration, hence the term cursor-based code as an alternative to the more generic iterative code.

Most iterative code encountered “in the wild” is written for one of two reasons: either because the developer was used to this way of coding and didn’t know how (or why!) to write set-based code instead; or because the developer was unable to find a good-performing set-based approach and had to fall back to iterative code to get acceptable performance.

Benefits

A perceived benefit of iterative code might be that developers with a background in third-generation languages can start coding right away, instead of having to learn a radically different way to do their work. But that argument would be like someone from the last century suggesting that we hitch horses to our cars so that drivers don’t have to learn how to start the engine and operate the steering wheel.

Iterative code also has a real benefit—but only in a few cases. Because the coder has to specify each step SQL Server has to take to get to the end result, it’s easy to store an intermediate result and reuse it later. In some cases (such as the running totals already mentioned), this can result in faster-running code.

Drawbacks

By writing iterative code, you’re crippling SQL Server’s performance in two ways at the same time. You not only work around all the optimizations SQL Server has for fast set-based processing, you also effectively prevent the query optimizer from coming up with a faster way to achieve the same results. Tell SQL Server to read employees, and for each employee read the details of his or her department, and that’s exactly what’ll happen. But tell SQL Server that you want results of employees and departments combined, and that’s only one of the options for the query optimizer to consider.

Set-based iteration

An aspect that’s often overlooked in the “set-based or cursor” discussion is that they represent two extremes, and there’s plenty of room for alternate solutions in between. Iterative algorithms typically use one iteration for each row in the table or query that the iteration is based on, so the number of iterations is always equal to the number of rows, and the amount of work done by a single execution of the body of the iteration equates to processing a single row. Set-based code goes to the other extreme: processing all rows at once, in a single execution of the code. Why limit ourselves to choosing either one execution that processes N rows, or N executions that process one row each?

The most basic form

The most basic form of set-based iteration isn’t used to prevent exponential performance scaling, but to keep locking short and to prevent the transaction log from overflowing. This technique is often recommended in newsgroups when UPDATE or DELETE statements that affect a large number of rows have to be run. To prevent long-lasting locks, lock escalation, and transaction log overflow, the TOP clause is used (or SET ROWCOUNT on versions older than SQL Server 2005) to limit the number of rows processed in a single iteration, and the statement is repeated until no more rows are affected. An example is shown in listing 1, where transaction history predating the year 2005 is removed in chunks of 10,000 rows. (Note that this example, like all other examples in this chapter, should run on all versions from SQL Server 2005 upward.)

Listing 1. Set-based iteration with the TOP clause
SET NOCOUNT ON;
DECLARE @BatchSize int, @RowCnt int;

SET @BatchSize = 10000;
SET @RowCnt = @BatchSize;

WHILE @RowCnt = @BatchSize
BEGIN;
DELETE TOP (@BatchSize)
FROM TransactionHistory
WHERE TranDate < '20050101';

SET @RowCnt = @@ROWCOUNT;
END;

This form of set-based iteration won’t increase performance of the code. It’s used to limit the impact of code on concurrency, but may make the code run slower.

This form of set-based iteration isn’t sophisticated enough to warrant much discussion. I merely wanted to include it for the sake of completeness. Using set-based iteration to increase performance of problematic code takes, unfortunately, more than just adding a TOP clause to the query.

Running totals

Adding running totals to a report is a common business requirement. It’s also one of the few situations where declarative code often (though not always) results in poor performance.

In this example, I’ll use the AdventureWorks sample database to report all sales, arranged by customer, ordered by date, and with a running total of all order amounts for a customer up to and including that date. Note that the Microsoft-supplied sample database is populated with more than 31,000 orders for over 19,000 customers, and that the highest number of orders for a single customer is 28.

Declarative Code

In current versions of SQL Server, the only way to calculate running totals in declarative code is to join each row from the table to all preceding rows for the same customer from itself, adding all those joined rows together to calculate the running total. The code for this is shown in listing 2.

Listing 2. Declarative code for calculating running totals

The performance of this query depends on the average number of rows in the self-join. In this case, the average is less than 2, resulting in great performance: approximately 0.2 seconds on my laptop. But if you adapt the code to produce running totals per sales territory instead of per customer (by replacing all occurrences of the column name CustomerID with TerritoryID), you’re in for a nasty surprise: with only 10 different territories in the database, the average number of rows in the self-join is much higher. And because performance in this case degrades exponentially, not linearly, the running time on my laptop went up to over 10 minutes (638 seconds, to be exact)!

Iterative Code

Because the declarative running totals code usually performs poorly, this problem is commonly solved with iterative code, using a server-side cursor. Listing 3 shows the code typically used for this.

Listing 3. Iterative code for calculating running totals

The code is pretty straightforward. In order to get all results as one result set, a table variable is used to store the base data and the calculated running totals. The primary key on the table variable is there primarily to create a good clustered index for the iteration, which explains why it includes more columns than the key (which is on SalesOrderID only). The only way to index a table variable is to add PRIMARY KEY or UNIQUE constraints to it.

A T-SQL cursor is then used to iterate over the rows. For each row, the variable holding the running total is incremented with the total of that order and then stored in the results table , after resetting the running total to 0 when the customer changes . The ORDER BY of the cursor ensures that the data is processed in the proper order, so that the calculated running totals will be correct.

On my laptop, this code takes 1.9 seconds. That’s slower than the declarative version presented earlier. But if I change the code to calculate running totals per territory, the running time remains stable at 1.9 seconds. This shows that, even though the declarative solution is faster when the average number of rows in the self-join is low, the iterative solution is faster at all other times, with the added benefit of stable and predictable performance. Almost all processing time is for fetching the order rows, so the performance will grow linearly with the amount of data.

Set-Based Iteration

For each customer, the running total of her first order is equal to the order total. The running total of the second order is then equal to the order total plus the first running total, and so on. This is the key to a solution that uses set-based iteration to determine the running total for the first orders of all customers, then calculate all second running totals, and so forth.

This algorithm, for which the code is shown in listing 4, needs as many iterations as the highest number of orders for a single customer—28 in this case. Each individual iteration will probably be slower than a single iteration of the iterative solution, but because the number of iterations is reduced from more than 30,000 to 28, the total execution time is faster.

Listing 4. Set-based iteration for calculating running totals

Just as in the iterative code, a table variable is used to store the base data and the calculated running totals. In this case, that’s not only to enable all results to be returned at once and in the expected order, but also because we need to store intermediate results and reuse them later.

During the initial population of the results table, I calculate and store the rank of each order. This is more efficient than calculating it in each iteration, because this also allows me to base the clustered index on this rank. It’s possible to code this algorithm without materializing the rank in this table, but that makes the rest of the code more complex, and (most important) hurts performance in a big way!

While populating the table variable, I also set the running total for each order equal to its order total. This is, of course, incorrect for all except the first orders, but it saves the need for a separate UPDATE statement for the first orders, and the running totals for all other orders will eventually be replaced later in the code.

The core of this algorithm is the UPDATE statement that joins a selection of all orders with the next rank to those of the previous rank, so that the next running total can be set to the sum of the previous running total and the next order total.

On my laptop, this code runs in 0.4 seconds. This speed depends not only on the amount of data, but also on the required number of iterations. If I change the code to calculate running totals per territory rather than per customer, the number of iterations goes up to almost 7,000, causing the execution time to rise to approximately 0.9 seconds. And if I change the code to calculate overall running totals (forcing the number of iterations to be equal to the number of rows), the clock stops at 2 seconds.

The bottom line is that, even though declarative code runs slightly faster in cases with a very low iteration count and iterative code is slightly better for very high iteration counts, set-based iteration presents a good algorithm that’s the fastest in many situations and only slightly slower in the other cases.

Bin packing

The bin-packing problem describes a category of related problems. In its shortest form, it can be expressed as “given an unlimited supply of bins, all having the same capacity, and a collection of packages, find a way to combine all packages in the least number of bins.”

The bin-packing problem is sometimes thought to be mainly academic, of interest for mathematicians only. That’s a misconception, as there are many business situations that are a variation on the bin-packing problem:

  • Transport— You have to transport five packages from Amsterdam to Paris. The packages weigh two, three, four, five, and six tons. The maximum capacity of a single truck is 10 tons. You can, of course, place the first three packages in a single truck without exceeding the maximum weight, but than you’d need two extra trucks for the last two packages. With this small amount of data, it’s obvious that you can get them transported in two trucks if you place the packages of four and six tons in one truck and the other three in the second. But if there are 400 packages, it becomes too hard for a human to see how to spare one or two trucks, and computerized assistance becomes crucial.
  • Seating groups— Imagine a theatre with 40 rows of 30 seats each. If a group makes a reservation, they’ll expect to get adjacent seats on a single row. But if you randomly assign groups to rows, you have a high chance that you’ll end up with two or three empty seats in each row and a group of eight people who can’t get adjacent seats anymore. If you can find a more efficient way to assign seats to groups, you might free up eight adjacent seats on one row and sell an extra eight tickets.
  • Minimizing cut loss— Materials such as cable and fabric are usually produced on rolls of a given length. If a builder needs to use various lengths of cable, or a store gets orders for various lengths of fabric, they don’t want to be left with one or two meters from each roll and still have to use a new roll for the last required length of six meters.

According to mathematicians, you can only be 100 percent sure that you get the absolute minimum number of bins by trying every possible permutation. It’s obvious that, however you implement this, it’ll never scale, as the number of possible permutations grows exponentially with the number of packages. Most businesses will prefer an algorithm that produces a “very good” distribution in a few seconds over one that might save two or three bins by finding the “perfect” solution after running for a few days.

Declarative Code

I’ve never found a set-based approach to finding a “good enough” solution for the bin-packing problem. But I’ve found set-based code that finds the “perfect” solution. This code was originally posted by John Gilson; I’ve corrected and optimized this code and then published it to my blog (http://sqlblog.com/blogs/hugo_kornelis/archive/2008/10/27/bin-packing-part-4-the-set-based-disaster.aspx), but it’s too large to reproduce here. There’s no reason to, either, because this code can never be used in practice—not only because it couples already-bad performance with the ugliest exponential growth curve I’ve ever seen, but also because it requires extra columns in intermediate result sets and many extra lines of code as the bin size and the number of packages increases, such that for real-world problems, you’d need millions of lines of code (and a version of SQL Server that allows more than 4,096 columns per SELECT statement). And then you’ll still get execution times measured in days, if not years.

Iterative Code

Because a set-based solution for the bin-packing problem is way too slow, even in cases that are limited enough that such a solution is even possible, we need to investigate other options. And the most obvious alternative is an iterative solution. Of all the possible strategies I investigated (see my blog for the details), I found that the best combination of speed and packing efficiency is attained by an algorithm that stays close to how I’d pack a bunch of physical packages into physical bins: take a bin, keep adding packages to it until it overflows, then start with a new bin unless the overflowing package fits into one of the other already filled bins. Listing 5 shows the code to set up the tables and fill them with some randomly generated data, and listing 6 shows the T-SQL version of this algorithm.

Listing 5. Set up tables and generate random data for bin packing

Listing 6. Iterative code for bin packing

The main logic is coded in the WHILE loop. For every package, I first check whether the current bin has enough room left . If not, I check whether the package would fit one of the other already partly filled bins before creating a new bin for it . To save time, I don’t write the data for the current bin to disc after each package, but I pay for this by having to write it at two slightly less logical locations in the code —when a new bin is started, or (for the last bin) after the last package has been assigned.

This algorithm is fast, because adding several packages to the same bin right after each other saves on the overhead of switching between bins. It’s also efficient because, even if a large package forces me to start a new bin when the previous one is still half empty, that half-empty bin will still be reconsidered every time a package would overflow the current bin, so it should eventually fill up. There’s no ORDER BY specified in the cursor definition . Adding ORDER BY BinSize DESC will improve the packing efficiency by about 4 percent, but at the cost of a performance hit that starts at 5–10 percent for small amounts of test data (10,000–50,000 packages), but grows to more than 20 percent for 500,000 packages.

When I tested this code on my laptop, it was able to pack 100,000 packages in bins in approximately 143 seconds. The running time went up to 311 seconds for 200,000 packages, and to 769 seconds for 500,000 packages. The growth is much better than exponential, but worse than linear, probably due to the increasing cost of checking an ever-increasing number of partly filled bins when a package would overflow the current bin.

Some extrapolation of my test results indicates that a run with a million packages will probably take half an hour, and maybe 6 or 7 hours are needed to pack ten million packages. This sure beats packing the bins by hand, but it might not be fast enough in all situations.

Set-Based Iteration

In those situations where the iterative solution isn’t fast enough, we need to find something faster. The key here is that it’s easy to calculate an absolute minimum number of bins—if, for example, the combined size of all packages is 21,317 and the bin size is 100, then we can be sure that there will never be a solution with less than 214 bins—so why not start off with packing 214 bins at once?

I start by finding the 214 largest packages and putting one in each of the 214 available bins. After that, I rank the bins by space remaining, rank the packages (excluding those that are already too large for any bin) by size, match bins and packages by rank, and add packages that will still fit into their matching bins. I then repeat this step until there are no packages left that fit in the remaining space of an available bin (either because all packages are packed, or they’re all larger than the largest free space).

Ideally, all packages have now been catered for. In reality, there will often be cases where not all packages can be handled in a single pass—so I then repeat this process, by summing the total size of the remaining packages, dividing by the bin size, assigning that number of bins and repeatedly putting packages into bins until no more packages that fit in a bin are left. This second pass is often the last. Sometimes a third pass can be required.

The code in listing 8 shows a SQL Server 2005–compatible implementation of this algorithm. (On SQL Server 2008, the UPDATE FROM statement can be replaced with MERGE for better ANSI compatibility, though at the cost of slightly slower performance.) This code uses a numbers table—a table I believe should exist in every database, as it can be used in many situations. Listing 7 shows how to make such a table and fill it with numbers 1 through 1,000,000. Note that creating and filling the numbers table is a one-time operation!

Listing 7. Creating the numbers table for use in the set-based bin-packing code
SET NOCOUNT ON;

CREATE TABLE dbo.Numbers
(Num int NOT NULL PRIMARY KEY);

WITH Digits(d) AS
(SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL
SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL
SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9)
INSERT INTO dbo.Numbers(Num)
SELECT a.d + b.d * 10 + c.d * 100 + d.d * 1000
+ e.d * 10000 + f.d * 100000 + 1
FROM Digits a, Digits b, Digits c, Digits d, Digits e, Digits f;
Listing 8. Set-based iteration for bin packing

This code uses two nested WHILE loops. The outer loop generates as many rows in the Bins table as the minimum number of bins required for the packages that aren’t already in a bin and then hands control over to the inner loop for filling these bins. This inner loop first finds the largest available space in any of the current batch of bins, to avoid wasting time on packages that are larger than that . It then ranks the bins by available space , ranks the packages by size , and assigns packages to bins that have the same ranking, but only if they fit . After that, the remaining space is recalculated for each bin in the current batch .

The queries in the inner loop of this algorithm are quite complex and are bound to take some time. But, as the number of iterations of these loops is low (28 executions of the inner loop for the 100,000 row test set and 29 for the 200,000 row test set), the total execution time is very fast: only 12.5 seconds on my laptop for 100,000 rows, 25.8 seconds for 200,000 rows, and 47.5 seconds for 500,000 rows. These numbers also indicate that this algorithm scales better than the iterative algorithm. I expect that packing 10 million packages should take less than 15 minutes. And if that’s still too slow for you, then you can always use a real server instead of a laptop.

My tests also showed that the solution that uses set-based iteration tends to be slightly more efficient. The number of bins used varies from less than 0.5 percent more to almost 2 percent less, with an average of 0.8 percent less. If you recall that the iterative version can be improved by about 4 percent (at the cost of a big performance hit) by adding an ORDER BY clause, you’ll also understand that this modified iterative version will need about 3 percent fewer bins than the version based on set-based iteration. So if you need to find a solution with as few bins as possible and you can afford to wait long enough for the iterative version (with ORDER BY) to finish, use that one. Otherwise, save yourself lots of time by using set-based iteration.

Summary

In this article, I’ve shown that iterative code and set-based code aren’t the only options for T-SQL developers, but rather two extremes. Set-based iteration is a technique that sits in between these two extremes, combining a low number of iterations with a set-based query that doesn’t affect all rows at once, but does attempt to affect as many rows as possible without incurring exponential performance loss.

The technique of set-based iteration is neither easy to use, nor a panacea for all performance problems. There’s no simple recipe that you can follow to find an algorithm using set-based iteration for a problem. It requires creativity, fantasy, out-of-the-box thinking, and lots of experience to see a promising approach, and then a lot of hard work to implement and test it. Even then it’s still possible that the set-based iteration may turn out not to perform as well as expected.

I’ve attempted to use set-based iteration in more cases than described here. In many cases, the only result was a performance decrease, or a performance gain that was too small to warrant the extra complexity in the code. But there were also situations where the performance gain was impressive. For those situations, the technique of set-based iteration can be an invaluable tool in the T-SQL developer’s toolbox.

About the author

Hugo Kornelis is co-founder and R&D lead of perFact BV, a Dutch company that strives to improve analysis methods and develop computer-aided tools that will generate completely functional applications from the analysis deliverable. The chosen platform for this development is SQL Server.

In his spare time, Hugo likes to share and enhance his knowledge of SQL Server by frequenting newsgroups and forums, reading and writing books and blogs, and attending and speaking at conferences.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.134.108.197