12.10. SQL: Grouping

Sometimes we wish to partition the rows of a table into a number of groups and display properties that apply to each group as a whole. SQL provides a group by clause that may be included in the select statement to retrieve such grouped data. Each group is a bag of rows with the same value(s) for the grouping criteria listed in the group by clause. The final query result has at most one row for each group. Figure 12.56 pictures grouping by column a. Here there are three groups, one for each of the a values. The result may include some group property or function of a, depicted here asf(a).

Since the final result includes just one row for each group, the select list may include group properties only. In other words, for each group, each select item has only one value. To ensure this, each group property must be a grouping criterion (a column name used for grouping), a function call (e.g., count(*)), a constant, or an expression formed from these terms (e.g., 2 * groupcol).

Figure 12.56. Partitioning rows into groups with the same value for column a.


The basic syntax is as follows:

selectgroup-propertyl,...

from ...

[where...]

group by group-criterion1,...

Table 12.14 will be used to help explain most of these ideas. The relation scheme is Particle (particleName, family, charge, mass). To help you visualize the grouping into families, the rows have been sorted by family, and a blank line has been added between the groups. The table lists the main atomic particles belonging to the lepton (light), meson (middle), and baryon (heavy) families. The charge of each particle is given in elementary charges (elementary charge = charge of proton = 1.6 × 10-19 Coulomb). The mass of each particle is expressed as a multiple of the electron mass (9.1 × 10-31kg).

Table 12.14. Some atomic particles.


Consider the following query: List the families and the number of particles in each. This may be formulated in SQL as shown. For this section, the result tables are listed with the column headers underlined by hyphens (this is typical of most SQLs).

select family, count(*)familycount(*)-- function header
from Particle -------------------- may be omitted
group by family baryon8--in some SQLs
  lepton4 
  meson7 

Here the group by clause indicates that the particles are to be grouped into families (i.e., particles with the same family name are placed in the same group). The items in the select list are then applied to each group as a whole.

If count (*) is used, it returns the number of rows in the group. In this case we have eight baryons, four leptons, and seven mesons. If an expression other than a column name is used in the select list, some SQLs use the expression as a column header in the result (as shown here), whereas others, including SQL Server, omit the column header. It’s normally best to use an as clause after the expression, to provide your own header.

In many versions of SQL, the use of group by causes a default ordering equivalent to a matching order by clause (e.g., order by family in this case). However, this is not part of the SQL standard. The syntax of the group by clause is:

group by col-name[,...]

Members of the same group must have matching values for all of the columns named in the group by clause.

The next example places particles into groups having both the same family name and the same charge.

select family, charge, count(*) as tallyfamilychargetally
from Particle ---------------------------
group by family, charge baryon- 12
order by family, charge baryon04
  baryon12
  lepton-12
  lepton02
  meson- 12
  meson03
  meson12

If more than one column is used for grouping, the order in which these columns are specified is irrelevant to the content of the result. This is because the same set of groups must be formed. For example, if we group by attributes a, b, or b, a then the group with values a= al, b = bl is the same as the group with values b = bl,a= al, and so on. However, if the row order is based on the group order, the order of rows in the result may be different. For example, the following query returns the same rows as the previous query, in a different order. You can adjust the order by clause to change the order:

select family, charge, count(*) as tallyfamilychargetally
from Particle ---------------------------
group by charge, family baryon- 12
order by charge, family lepton- 12
  meson- 12
  baryon04
  lepton02
  meson03
  baryon12
  meson12

If a function call in the select list is used as an ordering criterion, you should introduce an alias for it so that you can reference it by name in the order by clause. For example, the following query lists each family and the mass of its lightest particle, ordering by that minimum mass. As an undesirable alternative, you can reference it by its position in the select list (e.g., order by 2).

select family, min (mass) as minMassfamilyminMass
from Particle ------------------------
group by family lepton0
order by minMass meson264
  baryon1836

Use of a group by clause implies that items in the select list are to be applied to each group as a whole, returning just one value for each group. So any columns mentioned in the select list must either be grouped columns or must appear there as an argument to a built-in function (e.g., family and mass in the aforementioned query). So when grouping is being performed it makes no sense, and is in fact illegal, to include a nongrouped column as a whole item in the select list. For example:

select family, mass from Particle — Error!

group by family

Since the same family may include several particles of different mass, this request is plain silly (which mass would we choose?). We can, however, obtain a single grouped value related to mass by using a built-in function, such as min (mass) or avg (mass), as we did with the previous query.

Although the select list must not include any nongrouped columns, it is legal to group by columns that are not mentioned in the select list. For example, the following query lists the minimum mass, but not the name, of each family.

select min (mass) as minMassminMass
from Particle ------
group by family 0
  264
  1836

Within the select statement, a where clause may be included to filter out unwanted rows before the groups are formed and functions are applied. Sensibly, the where clause must be placed before the group by clause. For example, the following query lists each family and the average mass of its uncharged particles. To cater for dialects such as SQL Server that truncate averages, the mass has been coerced to real by multiplying by 1.0, and the cast function is used to display just two digits after the decimal point. Note that it’s legal to apply a scalar function to a bag function, e.g., cast(avg(...)). It is, however, illegal to apply a bag function to a bag function, e.g., min(avg(...)).

select family, cast (avg (1.0*mass) as decimal (6,2)) as avgUnchargedMass   
from Particle   
where charge = 0familyavgUnchargedMass
group by family -----------------------------
  baryon2232.25
  lepton0.00
  meson770.65

Sometimes we wish to restrict our attention to only those groups that satisfy a certain condition. Just as a where clause may be used to impose a search condition on individual rows, a having clause may be used to impose a search condition on individual groups. In this case, the having clause must be placed straight after the group by clause to which it is logically connected. The syntax is:

group bycol-name [,...]

having group-condition

The usual rules for search conditions apply. Logical and comparison operators may be included in the condition. Each comparison compares some property of the group, such as min (mass), with a constant or another group property. (Although rare, it is possible to include a having clause without an associated group by clause: in this case the entire table is treated as one group.) As a simple example, the following query lists the families with more than four particles, and the number of particles for each. Although it makes sense to allow aliases (e.g., tally) in the having clause, this is not allowed in many SQLs (e.g., SQL Server).

select family, count(*) as tallyfamilytally
from Particle ------------
group by family baryon8
having count(*)> 4 meson7

The next example is harder. To help clarify how it works, the numbers 1 through 6 have been added to indicate the order in which the query lines are processed. These numbers are not part of the query. Trace through the steps yourself to make sure that you follow this order. With respect to the having clause, note that the charged leptons are included (uncharged leptons with zero mass are filtered out by the where clause) and that the charged baryons are excluded (their maximum mass exceeds 2 000).

Considering only the charged particles, form groups with the same family name, where the group’s lightest particle has a mass above 0 and the group’s heaviest particle has a mass below 2000. List the family name, number of particles, and largest mass for each group, showing groups with more members first.

5 select family, count (*) as chargedTally, max(mass) as maxChargedMass

1 from Particle

2 where charge <> 0

3 group by family

4 having min(mass) > Oand max (mass) < 2000

6 order by chargedTally desc

familychargedTallymaxChargedMass
 --------------------------------------
 meson4966
 lepton2207

One trivial use of grouping is simply to avoid duplicates. For example, the following query may be used as an alternative to select distinct family from Particle:

select family from Particlefamily
group by family ------
  baryon
  lepton
  meson

Now suppose we issue the following query. What will be the result?

select family from Particle

group by family

having count (*) = 4

Comparing this with the previous query, you may feel that the count of each family will be 1 and hence no rows will be returned (i.e., that the result is the null set). This is wrong! When the group by clause does its work, each group still has all its members (which of course all have the same group value). The having clause now filters out those groups that do not satisfy its search condition. Finally the select list determines which group properties are listed. With this query, the groups are first sorted into three families, two of these families are eliminated (the baryons and mesons), and finally just the family name “lepton” of the remaining group (the four leptons) is listed.

As a further application of grouping, let’s consider how simple cases of relational division can be handled. Figure 12.57 shows an SQL query for an example handled earlier in Figure 12.27 by relational division. Here the column name “language” is double quoted because it is a reserved word.

Figure 12.57. Grouping can be used to perform relational division.


Although SQL provides no operator for relational division, we can achieve the same effect by grouping as shown. The first subquery yields (‘English’, ‘French’), so the where clause filters out the Dutch and Spanish rows before the groups are formed. Because of the uniqueness constraint on Speaks, the number of Canadian languages spoken by a country is now the number of rows in the group for that country. So the left count (*) of the having condition returns the values 1, 1, 2, and 2 for Australia, Belgium, Canada, and Dominica, respectively. The second subquery evaluates to 2, so the overall query returns Canada and Dominica.

This approach can be adapted to handle cases when the relevant column association is not unique by using the distinct option. For example, consider the relation schemes Localization(productCode. “lanauage”) and LanguageUse(“language”, country. nrSpeakers). The query “Which products are localized into all languages spoken by a country where at least 10 million speak that language?” requires count (distinct (“language”)) instead of count (*) in the subquery of the having clause. As discussed later, this grouping approach can be extended to cater for queries that involve set comparisons other than those covered by relational division.

We have now covered most of the basic working of SQL’s select statement. The BNF syntax covered so far may be summarized as shown. Unlike the syntax shown here, the SQL standard actually allows a having clause without a group by clause (effectively meaning “group by no columns”) but the semantics are problematic so we’ll ignore this option.

select * | [all | distinct] expression[[as] column-alias][,...]
  from table [[as] alias]
     [,] cross join table[[as] alias]
           | natural [inner | [outer] {left | right | full}] join table[[as] alias]
      | [inner | [outer] {left | right | full}] join table[[as] alias]

               { on condition | using (col-list)}
| union join table [[as] alias]
                  [,...]]
[where col-condition]
[group by colname[,...]
    [having group-condition]]
[union| intersect | except... ]
[order by col[ase | desc] [,...] ]

Of the eight table operations of relational algebra, we have discussed the equivalent SQL formulation for the general case of five (projection, selection, Cartesian product, ©-join, and union) and for special cases of three (intersection, difference, and division). The general cases of intersection, difference, and division can be handled using joins, and also with correlated and existential subqueries, as discussed in the next section.

Exercise 12.10

  1. This question refers to the library database discussed in Exercise 12.9. The relational schema is repeated here.

    Assume ISBN is based on a character data type. Formulate SQL queries for the following requests.

    1. How many authors of each gender are there?

    2. Place authors into groups of the same gender and nationality, indicating how many there are in each group, with the larger groups shown first.

    3. Considering only the books published after 1980, list the publishers and the total number of copies of their books in the library.

    4. For each publisher having an average number of library copies per book above two, show the earliest publication year.

    5. List the ISBN and the number of authors for each book published before 1986 that has more than one author.

    6. Restricting attention to male authors of a book published by Hall or Possum, list the number of such authors for each nationality that has at least as many authors of this kind as there are copies in the library of the book with ISBN “444”.

  2. The relational schema for a dietary database is shown (for a sample population, see Exercise 12.8). Formulate the following queries in SQL.

    1. Who eats all the foods?

    2. Which foods are eaten by all eaters?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.93.136