EXPRESSION TRANSFORMATION

In this section, I want to take a slightly closer look at what the optimizer does. More specifically, I want to consider what’s involved in transforming some relational expression into another, logically equivalent, expression. Note: I mentioned this notion under the discussion of duplicates in Chapter 4, where I explained that such transformations are one of the things the optimizer does; in fact, such transformations constitute one of the two great ideas at the heart of relational optimization (the other, beyond the scope of this book, is the use of “database statistics” to do what’s called cost based optimizing).[88]

I’ll start with a trivial example. Consider the following Tutorial D expression (the query is “Get suppliers who supply part P2, together with the corresponding quantities,” and I’ll ignore the SQL analog for simplicity):

( ( S JOIN SP ) WHERE PNO = 'P2' ) { ALL BUT PNO }

Suppose there are 1,000 suppliers and 1,000,000 shipments, of which 500 are for part P2. If the expression were simply evaluated by brute force (as it were), without any optimization at all, the sequence of events would be:

  1. Join S and SP: This step involves reading the 1,000 supplier tuples; reading the 1,000,000 shipment tuples 1,000 times each, once for each of the 1,000 suppliers; constructing an intermediate result consisting of 1,000,000 tuples; and writing those 1,000,000 tuples back out to the disk. (I’m assuming for simplicity that tuples are physically stored as such, and I’m also assuming I can take “number of tuple reads and writes” as a reasonable measure of performance. Neither of these assumptions is very realistic, but this fact doesn’t materially affect my argument.)

  2. Restrict the result of Step 1: This step involves reading 1,000,000 tuples but produces a result containing only 500 tuples, which I’ll assume can be kept in main memory. (By contrast, I was assuming for the sake of the example in Step 1, realistically or otherwise, that the 1,000,000 intermediate result tuples couldn’t be kept in main memory.)

  3. Project the result of Step 2: This step involves no tuple reads or writes at all, so we can ignore it.

The following procedure is equivalent to the one just described, in the sense that it produces the same final result, but is obviously much more efficient:

  1. Restrict SP to just the tuples for part P2: This step involves reading 1,000,000 shipment tuples but produces a result containing only 500 tuples, which can be kept in main memory.

  2. Join S and the result of Step 1: This step involves reading 1,000 supplier tuples (once only, not once per P2 shipment, because all the P2 shipments are in memory). The result contains 500 tuples (still in main memory).

  3. Project the result of Step 2: Again we can ignore this step.

The first of these two procedures involves a total of 1,002,001,000 tuple reads and writes, whereas the second involves only 1,001,000; thus, it’s clear the second procedure is likely to be over 1,000 times faster than the first. It’s also clear we’d like the implementation to use the second rather than the first! If it does, then what it’s doing (in effect) is transforming the original expression

     ( S JOIN SP ) WHERE PNO = 'P2'

—I’m ignoring the final projection now, since it isn’t really relevant to the argument—into the expression

     S JOIN ( SP WHERE PNO = 'P2' )

These two expressions are logically equivalent, but they have very different performance characteristics, as we’ve seen. If the system is presented with the first expression, therefore, we’d like it to transform it into the second before evaluating it—and of course it can. The point is, the relational algebra, being a high level formalism, is subject to various formal transformation laws; for example, there’s a law that says, loosely, that a join followed by a restriction can always be transformed into a restriction followed by a join (this was the law I was using in the example). And a good optimizer will know those laws, and will apply them—because the performance of a query ideally shouldn’t depend on the specific syntax used to express that query in the first place. Note: Actually it’s an immediate consequence of the fact that not all of the algebraic operators are primitive that certain expressions can be transformed into others (for example, an expression involving intersect can be transformed into one involving join instead), but there’s much more to the issue than that, as I hope is obvious from the example.

Now, there are many possible transformation laws, and this isn’t the place for an exhaustive discussion. All I want to do is highlight a few important cases and key points. First, the law mentioned in the previous paragraph is actually a special case of a more general law, called the distributive law. In general, the monadic operator f distributes over the dyadic operator g if and only if f(g(a,b)) = g(f(a),f(b)) for all a and b. In ordinary arithmetic, for example, SQRT (nonnegative square root) distributes over multiplication, because

     SQRT ( a * b ) = SQRT ( a ) * SQRT ( b )

for all a and b (take f as SQRT and g as “*”); thus, a numeric expression optimizer can always replace either of these expressions by the other when doing numeric expression transformation. As a counterexample, SQRT does not distribute over addition, because the square root of a + b is not equal to the sum of the square roots of a and b, in general.

In relational algebra, restriction distributes over intersect, union, and difference. It also distributes over join, provided the restriction condition consists, at its most complex, of the AND of two separate restriction conditions, one for each of the two join operands. In the case of the example discussed above, this requirement was satisfied—in fact, the restriction condition was very simple and applied to just one of the operands—and so we were able to use the distributive law to replace the expression by a more efficient equivalent. The net effect was that we were able to “do the restriction early.” Doing restrictions early is almost always a good idea, because it serves, typically, (a) to reduce the number of tuples to be scanned in the next operation in sequence and (b) to reduce the number of tuples in the output from that operation as well.

Here are some other specific cases of the distributive law, this time involving projection. First, projection distributes over union, though not over intersection and difference. Second, it also distributes over join, so long as all of the joining attributes are included in the projection. These laws can be used to “do projections early,” which again is usually a good idea, for reasons similar to those given above for restrictions.

Two more important general laws are the laws of commutativity and associativity:

  • The dyadic operator g is commutative if and only if g(a,b) = g(b,a) for all a and b. In ordinary arithmetic, for example, addition and multiplication are commutative, but subtraction and division aren’t. In relational algebra, intersect, union, and join are all commutative,[89] but difference isn’t. So, for example, if a query involves a join of two relations r1 and r2, the commutative law tells us it doesn’t matter which of r1 and r2 is taken as the “outer” relation and which the “inner.” The system is therefore free to choose (say) the smaller relation as the outer one in computing the join.

  • The dyadic operator g is associative if and only if g(a,g(b,c)) = g(g(a,b),c) for all a, b, c. In arithmetic, addition and multiplication are associative, but subtraction and division aren’t. In relational algebra, intersect, union, and join are all associative, but difference isn’t. So, for example, if a query involves a join of three relations r1, r2, and r3, the associative and commutative laws taken together tell us we can join the relations pairwise in any order we like. The system is thus free to decide which of the various possible sequences is most efficient.

Note, incidentally, that all of these transformations can be performed without any regard for either actual data values or physical access paths (indexes and the like) in the database as physically stored. In other words, such transformations represent optimizations that are virtually guaranteed to be good, regardless of what the database looks like physically. Perhaps I should add, however, that while many such transformations are available for sets, not so many are available for bags (as indeed we saw in Chapter 4); and fewer still are available if column ordinal position has to be taken into account; and far fewer still are available if nulls and 3VL have to be taken into account as well. What do you conclude?



[88] Cost based optimizing is beyond the scope of this book because it has to do with how the data is physically stored, which isn’t a relational issue by definition. But I should at least note that such optimizing is possible in the first place only because (as we saw in Chapter 1) the relational model insists on a sharp and rigid distinction between the logical and physical levels of the system, which has the effect among other things of keeping access strategies out of applications.

[89] Strictly speaking, the SQL analogs of these operators aren’t commutative, because—among other things—the left to right column order of the result depends on which operand is specified first. Indeed, the disciplines recommended in this book in connection with these operators are designed, in part, precisely to avoid such problems. More generally, the possibility of such problems occurring is one reason out of many why you’re recommended never to write SQL code that relies on column positioning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.137.10