AGGREGATE OPERATORS

In a sense this section is a bit of a digression, because the operators to be discussed aren’t relational but scalar—they return a scalar result.[98] But I do need to say something about them before I can get back to the main theme of the chapter.

An aggregate operator in the relational model is an operator that derives a single value from the “aggregate” (i.e., the bag or set) of values appearing within some attribute within some relation—or, in the case of COUNT, which is slightly special, from the “aggregate” that’s the entire relation. Here are two examples:

image with no caption

I’ll focus on the Tutorial D statements on the left first. Given our usual sample values, the first assigns the value 5 (the number of tuples in the current value of relvar S) to the variable X; the second assigns the value 3 (the number of tuples in the projection of the current value of relvar S on {STATUS}, which is to say the number of distinct STATUS values in that current value) to the variable Y.

In general, a Tutorial D aggregate operator invocation looks like this:

<agg op name> ( <relation exp> [, <exp> ] )

Legal <agg op name>s include COUNT, SUM, AVG, MAX, MIN, AND, OR, and XOR.[99] Within the <exp>, an <attribute ref> can appear wherever a literal would be allowed. That <exp> must be omitted if the <agg op name> is COUNT; otherwise, it can be omitted only if the <relation exp> denotes a relation of degree one, in which case an <exp> consisting of a reference to the sole attribute of that relation is assumed. Here are some examples:

  1. SUM ( SP , QTY )

    This expression denotes the sum of all quantities in relvar SP (given our usual sample values, the result is 3100).

  2. SUM ( SP { QTY } )

    This expression is shorthand for SUM(SP{QTY},QTY), and it denotes the sum of all distinct quantities in SP (i.e., 1000).

  3. AVG ( SP , 3 * QTY )

    This expression effectively asks what the average shipment quantity would be if quantities were all triple their current value (the answer is 775). More generally, the expression

    agg ( rx , x )

    (where x is some expression more complicated than a simple <attribute ref>) is essentially shorthand for the following:

    agg ( EXTEND rx : { y := x } , y )

    Now I turn to SQL. For convenience, let me first repeat the examples:

    image with no caption

Now, you might be surprised to hear me claim that SQL doesn’t really support aggregate operators at all! I say this knowing full well that most people would consider expressions like those on the right above to be, precisely, SQL aggregate operator invocations.[100] But they aren’t. Let me explain. As we know, the counts are 5 and 3, respectively. But those SQL expressions don’t evaluate to those counts as such, as true aggregate operator invocations would; rather, they evaluate to tables that contain those counts. More precisely, each yields a table with one row and one column, and the sole value in that row is the actual count:[101]

image with no caption

As you can see, therefore, the SELECT expressions really don’t represent aggregate operator invocations as such; at best, they represent only approximations to such invocations. In fact, aggregation is treated in SQL as if it were a special case of summarization. Of course, I haven’t discussed summarization yet; for present purposes, however, you can regard it as what’s represented in SQL by a SELECT expression with a GROUP BY clause. Now, the foregoing SQL expressions don’t have a GROUP BY clause—but they’re defined to be shorthand for the following, which do (and do therefore represent summarizations as claimed):

     SELECT COUNT ( * ) AS X
     FROM   S
     GROUP  BY ( )

     SELECT COUNT ( DISTINCT STATUS ) AS Y
     FROM   S
     GROUP  BY ( )

Note: In case these expressions look strange to you, I should explain that SQL does in fact allow both (a) GROUP BY clauses with an empty operand commalist and (b) GROUP BY clauses with the operand commalist enclosed in parentheses. What’s more, specifying a GROUP BY clause with an empty operand commalist (with or without parentheses) is equivalent to omitting the GROUP BY clause entirely. Here’s why:

  1. Such a GROUP BY clause effectively means “group by no columns.”

  2. Every row has the same value for no columns—namely, the 0-row (despite the fact that SQL doesn’t actually support the 0-row!).

  3. Every row in the table is thus part of the same group; in other words, the entire table is treated as a single group, and that’s effectively what happens when the GROUP BY clause is omitted entirely.

So SQL does support summarization—but it doesn’t support aggregation as such. Sadly, the two concepts are often confused, and perhaps you can begin to see why. What’s more, the picture is confused still further by the fact that, in SQL, it’s common in practice for the table that results from an “aggregation” to be coerced to the single row it contains, or even doubly coerced to the single value that row contains: two separate errors (of judgment, if nothing else) thus compounding to make the SQL-style “aggregation” look more like a true aggregation after all! Such double coercion occurs in particular when the SELECT expression is enclosed in parentheses to form a scalar subquery, as in the following SQL assignments:

     SET X = ( SELECT COUNT ( * ) FROM S ) ;

     SET Y = ( SELECT COUNT ( DISTINCT STATUS ) FROM S ) ;

But assignment as such is far from being the only context in which such coercions occur (see Chapter 2 and Chapter 12).

Aside: Actually there’s another oddity arising in connection with SQL-style aggregation (I include this observation here because this is where it logically belongs, but it does rely on a detailed understanding of SQL-style summarization, and you can skip it if you like):

  • In general, an expression of the form SELECT - FROM T - WHERE - GROUP BY - HAVING delivers a result containing exactly one row for each group in G, where G is the “grouped table” resulting from applying the WHERE, GROUP BY, and HAVING clauses to table T.

  • Omitting the WHERE and HAVING clauses, as in a “straightforward” SQL-style aggregation, is equivalent to specifying WHERE TRUE and HAVING TRUE, respectively. For present purposes, therefore, we need consider the effect of the GROUP BY clause, only, in determining the grouped table G.

  • Suppose table T has nT rows. Then arranging those rows into groups can produce at most nT groups; in other words, the grouped table G has nG groups for some nG (nGnT), and the overall result, obtained by applying the SELECT clause to G, thus has nG rows.

  • Now suppose nT is zero (i.e., table T is empty); then nG must clearly be zero as well (i.e., table G, and hence the result of the SELECT expression overall, must both be empty as well).

  • In particular, therefore, the expression

SELECT COUNT ( * ) AS X
FROM   S
GROUP  BY ( )

—which is, recall, the expanded form of SELECT COUNT(*) AS X FROM S—ought logically to produce the result shown on the left, not the one shown on the right, if table S happens to be empty:

image with no caption

In fact, however, it produces the result on the right. How? Answer: By special casing. Here’s a direct quote from the standard: “If there are no grouping columns, then the result of the <group by clause> is the grouped table consisting of T as its only group.” In other words, while grouping an empty table in SQL does indeed (as argued above) produce an empty set of groups in general, the case where the set of grouping columns is empty is special; in that case, it produces a set containing exactly one group, that group being identical to the empty table T. In the example, therefore, the COUNT operator is applied to an empty group, and thus “correctly” returns the value zero.

Now, you might be thinking the discrepancy here is hardly earth shattering; you might even be thinking the result on the right above is somehow “better” than the one on the left. But (to state the obvious) there’s a logical difference between the two, and—to quote Wittgenstein again—all logical differences are big differences. Logical mistakes like the one under discussion are simply unacceptable in a system that’s meant to be solidly based on logic, as relational systems are. End of aside.

Empty Arguments

The foregoing aside does raise another issue, however. Let agg be an aggregate operator. What should happen if agg is invoked on an empty argument? For example, given our usual sample data values, what value should the following statement assign to X?

     X := SUM ( SP WHERE SNO = 'S5' , QTY ) ;

The answer, of course, is zero; as explained in Chapter 6 under the discussion of n-adic join, zero is the identity value with respect to addition, and the sum of no numbers is therefore zero. More generally, in fact, if:

  1. An aggregate operator is invoked on an empty argument, and

  2. That invocation is essentially just shorthand for repeated invocation of some dyadic operator (e.g., the dyadic operator is “+” in the case of SUM), and

  3. An identity value exists for that dyadic operator, and

  4. The semantics of the aggregate operator in question do not require the result of an invocation to be a value that actually appears in the aggregate in question,

then the result of that invocation is that identity value. For the aggregate operators discussed in this section, identity values (and hence the result returned if the argument is empty) are as follows:[102]

  • AND: TRUE.

  • OR and XOR: FALSE.

  • COUNT and SUM: Zero. Note: The type of the result in these cases is INTEGER (for COUNT) and the type of the specified argument expression (for SUM). By way of example, if relvar P is currently empty, COUNT(P) returns 0 and SUM(P,WEIGHT) returns 0.0.

  • AVG: Since asking for the average of an empty set is effectively asking for zero to be divided by zero, the only reasonable response is to raise an exception (and careful coding might sometimes be called for, therefore).

  • MAX and MIN: By definition, asking for the maximum or minimum of some set of values is asking for some specific value from within that set. If the set in question happens to be empty, therefore, the only reasonable response is, again, to raise an exception (and careful coding might again sometimes be called for, therefore).

Note: For AVG, MAX, and MIN, we’re currently investigating the possibility of providing additional operators called (say) AVGX, MAXX, and MINX, respectively, each of which takes a further argument X in addition to the aggregate argument as such (the idea being that X denotes the value to be returned if the aggregate argument is empty). For the purposes of the present book, I’ll assume this scheme has indeed been implemented. Be aware, however, that the idea is still only a tentative one at this time (in particular, it might be better to provide a means of handling exceptions in general, instead of just a means of handling these special cases in particular).



[98] Nonscalar aggregate operators can be defined too, as we’ll see in the section GROUP, UNGROUP, AND RELATION VALUED ATTRIBUTES.

[99] AND, OR, and XOR apply to aggregates of boolean values specifically. AND in particular can be useful in connection with integrity constraints (see Chapter 8 for further discussion). Note: SQL’s counterparts to AND and OR are called EVERY and SOME, respectively (there’s no counterpart to XOR). SOME can alternatively be spelled ANY; likewise, in ALL or ANY comparisons (see Chapter 12), ANY can alternatively be spelled SOME. Oddly enough, however, the SQL set function EVERY can’t alternatively be spelled ALL, and in ALL or ANY comparisons ALL can’t alternatively be spelled EVERY.

[100] It might be claimed, somewhat more reasonably, that the COUNT invocations within those expressions are SQL aggregate operator invocations. But the whole point about such invocations is that they can’t appear as “stand alone” expressions in SQL; rather, they can only appear as part of some table expression, because they rely on that expression to identify the table over which the aggregation is to be done. For example, a statement like “SET X = COUNT(*);” would be meaningless in SQL, since it fails to mention the table whose rows are to be counted.

[101] The lack of doubly underlined columns in these tables is not an error.

[102] By contrast, as noted in Chapter 4, the SQL analogs of these operators all return null if their argument is empty (except for COUNT and COUNT(*), which do correctly return zero).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.179.239