SUMMARIZATION

Definition: Let relations r1 and r2 be such that r2 has the same heading as some projection of r1, and let the attributes of r2 be A, B, ..., C. Then the summarization SUMMARIZE r1 PER (r2) : {X := summary} is a relation with (a) heading the heading of r2 extended with attribute X and (b) body the set of all tuples t such that t is a tuple of r2 extended with a value x for attribute X. That value x is computed by evaluating summary over all tuples of r1 that have the same value for attributes A, B, ..., C as tuple t does. Observe that the result has cardinality equal to that of r2 and degree equal to that of r2 plus one. The type of X in that result is the type of summary. Note: I assume for simplicity that relations r1 and r2 don’t already have an attribute named X.

Here’s an example (which I’ll label SX1—“SUMMARIZE Example 1”—for purposes of subsequent reference):

     SUMMARIZE SP PER ( S { SNO } ) : { PCT := COUNT ( PNO ) }

Given our usual sample values, the result looks like this:

SNO

PCT

S1

6

S2

2

S3

1

S4

3

S5

0

In other words, the result contains one tuple for each tuple in the PER relation—i.e., one tuple for each of the five supplier numbers, in the example—extended with the corresponding count.

Aside: Note carefully that the “summary” COUNT(PNO)—I deliberately call it a “summary” and not an expression, because it isn’t an expression (at least, not in the usual Tutorial D sense)—in the foregoing SUMMARIZE is not an invocation of the COUNT aggregate operator. That aggregate operator takes a relation as its argument. By contrast, the argument to COUNT in the foregoing SUMMARIZE is an attribute: an attribute of some relation, of course, but just which relation is specified only indirectly. In fact, the syntactic construct COUNT(PNO) is really very special—it has no meaning outside the context of an appropriate SUMMARIZE, and it can’t be used outside that context. (Note, therefore, that my earlier criticisms of COUNT and the rest in SQL, to the effect that they can’t appear “stand alone,” apply with just as much force to Tutorial D’s “summaries.”) All of which begins to make it look as if SUMMARIZE might be not quite respectable, in a way, and it might be nice if we could replace it by something better ... See the section SUMMARIZATION bis, later. End of aside.

As a shorthand, if relation r2 doesn’t merely have the same heading as some projection of relation r1 but actually is such a projection, the PER specification can be replaced by a BY specification, as in this example (“Example SX2”):

     SUMMARIZE SP BY { SNO } : { PCT := COUNT ( PNO ) }

Here’s the result:

SNO

PCT

S1

6

S2

2

S3

1

S4

3

As you can see, this result differs from the previous one in that it contains no tuple for supplier S5. That’s because BY {SNO} in the example is defined to be shorthand for PER (SP{SNO})—SP, because SP is what we want to summarize—and relvar SP doesn’t contain a tuple for supplier S5.

Now, Example SX2 can be expressed in SQL as follows:

     SELECT SNO , COUNT ( ALL PNO ) AS PCT
     FROM   SP
     GROUP  BY SNO

As this example suggests, summarizations—as opposed to “aggregations”—are typically formulated in SQL by means of a SELECT expression with an explicit GROUP BY clause (but see later!). Points arising:

  • You can think of such expressions as being evaluated as follows. First, the table specified by the FROM clause is partitioned into set of disjoint “groups”—actually tables—as specified by the grouping column(s) in the GROUP BY clause; result rows are then obtained, one for each group, by computing the specified summary (or summaries, plural) for that group and appending other items as specified by the SELECT item commalist. Note: The SQL analog of the term summary is “set function”; the term is doubly inappropriate, however, because (a) the argument to such a function isn’t a set but a bag, in general, and (b) the result isn’t a set either.

  • It’s safe to specify just SELECT, not SELECT DISTINCT, in the example because (a) the result table is guaranteed to contain just one row for each group, by definition, and (b) each group contains just one value for the grouping column(s), again by definition.

  • The ALL specification could be omitted from the COUNT invocation in this example, because for set functions ALL is the default. (In the example, in fact, it makes no difference whether ALL or DISTINCT is specified, because the combination of supplier number and part number is a key for table SP.)

  • The set function COUNT(*) is a special case—it applies, not to values in some column (as, e.g., SUM(...) does), but to rows in some table. (In the example, the specification COUNT(PNO) could be replaced by COUNT(*) without changing the result.)

    Now let’s get back to Example SX1. Here’s a possible SQL formulation of that example:

         SELECT S.SNO , ( SELECT COUNT ( PNO )
                          FROM   SP
                          WHERE  SP.SNO = S.SNO ) AS PCT
         FROM   S

The important point here is that the result now does contain a row for supplier S5, because by definition (thanks to the FROM clause, which takes the form FROM S) that result contains one row for each supplier number in table S, not table SP. As you can see, this formulation differs from the one given for Example SX2—the one that missed supplier S5—in that it doesn’t include a GROUP BY clause, and it doesn’t do any grouping (at least, not overtly).

Aside: By the way, there’s another trap for the unwary here. As you can see, the second item in the SELECT item commalist in the foregoing SQL expression—i.e., the subexpression (SELECT ... S.SNO) AS PCT—is of the form subquery AS name (and the subquery in question is in fact a scalar one). Now, if that very same text were to appear in a FROM clause, the “AS name” specification would be understood as defining a name for the table denoted by that subquery.[103] In the SELECT clause, however, that same “AS name” specification is understood as defining a name for the pertinent column of the overall result. It follows that the following SQL expression is not logically equivalent to the one shown above:

     SELECT S.SNO , ( SELECT COUNT ( PNO ) AS PCT
                      FROM   SP
                      WHERE  SP.SNO = S.SNO )
     FROM   S

With this formulation, the table t that’s returned by evaluation of the subquery has a column called PCT. That table t is then doubly coerced to the sole scalar value it contains, producing a column value in the overall result—but (believe it or not) that column in the overall result is not called PCT; instead, it has no name. End of aside.

To revert to the main thread of the discussion: As a matter of fact, Example SX2 could also be expressed in SQL without using GROUP BY, as follows:

     SELECT DISTINCT SPX.SNO , ( SELECT COUNT ( SPY.PNO )
                                 FROM   SP AS SPY
                                 WHERE  SPY.SNO = SPX.SNO ) AS PCT
     FROM   SP AS SPX

As these examples suggest, SQL’s GROUP BY clause is in fact logically redundant—any relational expression that can be represented with it can also be represented without it. Be that as it may, there’s another point that needs to be made here. Suppose Example SX1 had requested, not the count of part numbers, but the sum of quantities, for each supplier

     SUMMARIZE SP PER ( S { SNO } ) : { TOTQ := SUM ( QTY ) }

Given our usual sample values, the result looks like this:

SNO

TOTQ

S1

1300

S2

700

S3

200

S4

900

S5

0

By contrast, this SQL expression—

     SELECT S.SNO , ( SELECT SUM ( QTY )
                      FROM   SP
                      WHERE  SP.SNO = S.SNO ) AS TOTQ
     FROM   S

—gives a result in which the TOTQ value for supplier S5 is shown as null, not zero. That’s because (as mentioned earlier) if any SQL set function other than COUNT(*) or COUNT is invoked on an empty argument, the result is incorrectly defined to be null. To get the correct result, therefore, we need to use COALESCE, as follows:

     SELECT S.SNO , ( SELECT COALESCE ( SUM ( QTY ) , 0 )
                      FROM   SP
                      WHERE  SP.SNO = S.SNO ) AS TOTQ
     FROM   S

Suppose now that Example SX1 had asked for the sum of quantities for each supplier, but only where that sum is greater than 250:

     ( SUMMARIZE SP PER ( S { SNO } ) : { TOTQ := SUM ( QTY ) } )
                                                            WHERE TOTQ > 250

Result:

SNO

TOTQ

S1

1300

S2

700

S4

900

The “natural” SQL formulation of this query would be:

     SELECT SNO , SUM ( QTY ) AS TOTQ
     FROM   SP
     GROUP  BY SNO
     HAVING SUM ( QTY ) > 250  /* not TOTQ > 250 !!! */

But it could also be formulated like this:

     SELECT DISTINCT SPX.SNO , ( SELECT SUM ( SPY.QTY )
                                 FROM   SP AS SPY
                                 WHERE  SPY.SNO = SPX.SNO ) AS TOTQ
     FROM   SP AS SPX
     WHERE  ( SELECT SUM ( SPY.QTY )
              FROM   SP AS SPY
              WHERE  SPY.SNO = SPX.SNO ) > 250

As this example suggests, HAVING, like GROUP BY, is also logically redundant—any relational expression that can be represented with it can also be represented without it. So GROUP BY and HAVING could both be dropped from SQL without any loss of relational functionality! And while it might be true that the GROUP BY and HAVING versions of some query are often more succinct,[104] it’s also true that they sometimes deliver the wrong answer. For example, consider what would happen in the foregoing example if we had wanted the sum to be less than, instead of greater than, 250. Simply replacing “>” by “<” in the GROUP BY / HAVING formulation does not work. (Does it work in the other formulation?) Recommendations: If you do use GROUP BY or HAVING, make sure the table you’re summarizing is the one you really want to summarize (typically suppliers rather than shipments, in terms of the examples in this section). Also, be on the lookout for the possibility that some summarization is being done on an empty set, and use COALESCE wherever necessary.

There’s one more thing I need to say about GROUP BY and HAVING. Consider the following SQL expression:

     SELECT SNO , CITY , SUM ( QTY ) AS TOTQ
     FROM   S NATURAL JOIN SP
     GROUP  BY SNO

Observe that CITY appears in the SELECT item commalist here but isn’t one of the grouping columns. That appearance is legitimate, however, because table S is subject to a certain functional dependency—see Chapter 8—according to which each SNO value in that table has just one corresponding CITY value (again, in that table); what’s more, the SQL standard includes rules according to which the system will in fact be aware of that functional dependency. As a consequence, even though it isn’t a grouping column, CITY is still known to be single valued per group, and it can therefore indeed appear in the SELECT clause as shown (also in the HAVING clause, if there is one).

Of course, it’s not logically wrong—though there might be negative performance implications—to specify the column as a grouping column anyway, as here:

     SELECT SNO , CITY , SUM ( QTY ) AS TOTQ
     FROM   S NATURAL JOIN SP
     GROUP  BY SNO , CITY


[103] More accurately, it would be understood as defining a corresponding range variable (see Chapter 12).

[104] Here’s another test of your SQL knowledge: In the example under discussion, would it be possible to save some keystrokes by using WITH to introduce a name for the common subexpression “(SELECT SUM(SPY.QTY) ...)”?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.102.142