The summary statements shown so far calculate summary values
over all rows in the result set. For example, the following statement
determines the number of records in the mail
table, and thus the total number of
mail messages that have been sent:
mysql>SELECT COUNT(*) FROM mail;
+----------+
| COUNT(*) |
+----------+
| 16 |
+----------+
Sometimes it’s desirable to break a set of rows into subgroups
and summarize each group. Do this by using aggregate functions in
conjunction with a GROUP
BY
clause. To determine the number of
messages per sender, group the rows by sender name, count how many
times each name occurs, and display the names with the counts:
mysql>SELECT srcuser, COUNT(*) FROM mail
->GROUP BY srcuser;
+---------+----------+ | srcuser | COUNT(*) | +---------+----------+ | barb | 3 | | gene | 6 | | phil | 5 | | tricia | 2 | +---------+----------+
That query summarizes the same column that is used for grouping
(srcuser
), but that’s not always
necessary. Suppose that you want a quick characterization of the
mail
table, showing for each sender
listed in it the total amount of traffic sent (in bytes) and the
average number of bytes per message. In this case, you still use the
srcuser
column to place the rows in
groups, but the summary functions operate on the size
values:
mysql>SELECT srcuser,
->SUM(size) AS 'total bytes',
->AVG(size) AS 'bytes per message'
->FROM mail GROUP BY srcuser;
+---------+-------------+-------------------+ | srcuser | total bytes | bytes per message | +---------+-------------+-------------------+ | barb | 156696 | 52232.0000 | | gene | 1033108 | 172184.6667 | | phil | 18974 | 3794.8000 | | tricia | 2589407 | 1294703.5000 | +---------+-------------+-------------------+
Use as many grouping columns as necessary to achieve as fine-grained a summary as you require. The earlier query that shows the number of messages per sender is a coarse summary. To be more specific and find out how many messages each sender sent from each host, use two grouping columns. This produces a result with nested groups (groups within groups):
mysql>SELECT srcuser, srchost, COUNT(srcuser) FROM mail
->GROUP BY srcuser, srchost;
+---------+---------+----------------+ | srcuser | srchost | COUNT(srcuser) | +---------+---------+----------------+ | barb | saturn | 2 | | barb | venus | 1 | | gene | mars | 2 | | gene | saturn | 2 | | gene | venus | 2 | | phil | mars | 3 | | phil | venus | 2 | | tricia | mars | 1 | | tricia | saturn | 1 | +---------+---------+----------------+
The preceding examples in this section have used COUNT()
,
SUM()
,
and
AVG()
for
per-group summaries. You can use MIN()
or MAX()
,
too. With a GROUP
BY
clause, they will tell you the
smallest or largest value per group. The following query groups
mail
table rows by message sender,
displaying for each the size of the largest message sent and the date
of the most recent message:
mysql>SELECT srcuser, MAX(size), MAX(t) FROM mail GROUP BY srcuser;
+---------+-----------+---------------------+
| srcuser | MAX(size) | MAX(t) |
+---------+-----------+---------------------+
| barb | 98151 | 2006-05-14 14:42:21 |
| gene | 998532 | 2006-05-19 22:21:51 |
| phil | 10294 | 2006-05-17 12:49:23 |
| tricia | 2394482 | 2006-05-14 17:03:01 |
+---------+-----------+---------------------+
You can group by multiple columns and display a maximum for each
combination of values in those columns. This query finds the size of
the largest message sent between each pair of sender and recipient
values listed in the mail
table:
mysql>SELECT srcuser, dstuser, MAX(size) FROM mail GROUP BY srcuser, dstuser;
+---------+---------+-----------+
| srcuser | dstuser | MAX(size) |
+---------+---------+-----------+
| barb | barb | 98151 |
| barb | tricia | 58274 |
| gene | barb | 2291 |
| gene | gene | 23992 |
| gene | tricia | 998532 |
| phil | barb | 10294 |
| phil | phil | 1048 |
| phil | tricia | 5781 |
| tricia | gene | 194925 |
| tricia | phil | 2394482 |
+---------+---------+-----------+
When using aggregate functions to produce per-group summary
values, watch out for the following trap, which involves selecting
nonsummary table columns not related to the grouping columns. Suppose
that you want to know the longest trip per driver in the driver_log
table. That’s produced by this
query:
mysql>SELECT name, MAX(miles) AS 'longest trip'
->FROM driver_log GROUP BY name;
+-------+--------------+ | name | longest trip | +-------+--------------+ | Ben | 152 | | Henry | 300 | | Suzi | 502 | +-------+--------------+
But what if you also want to show the date on which each
driver’s longest trip occurred? Can you just add trav_date
to the output column list? Sorry,
that won’t work:
mysql>SELECT name, trav_date, MAX(miles) AS 'longest trip'
->FROM driver_log GROUP BY name;
+-------+------------+--------------+ | name | trav_date | longest trip | +-------+------------+--------------+ | Ben | 2006-08-30 | 152 | | Henry | 2006-08-29 | 300 | | Suzi | 2006-08-29 | 502 | +-------+------------+--------------+
The query does produce a result, but if you compare it to the full table (shown following), you’ll see that although the dates for Ben and Henry are correct, the date for Suzi is not:
+--------+-------+------------+-------+ | rec_id | name | trav_date | miles | +--------+-------+------------+-------+ | 1 | Ben | 2006-08-30 | 152 |← Ben’s longest trip | 2 | Suzi | 2006-08-29 | 391 | | 3 | Henry | 2006-08-29 | 300 | ← Henry’s longest trip | 4 | Henry | 2006-08-27 | 96 | | 5 | Ben | 2006-08-29 | 131 | | 6 | Henry | 2006-08-26 | 115 | | 7 | Suzi | 2006-09-02 | 502 | ← Suzi’s longest trip | 8 | Henry | 2006-09-01 | 197 | | 9 | Ben | 2006-09-02 | 79 | | 10 | Henry | 2006-08-30 | 203 | +--------+-------+------------+-------+
So what’s going on? Why does the summary statement produce
incorrect results? This happens because when you include a GROUP
BY
clause in a query, the only values that you can select are the grouped
columns or summary values calculated from the groups. If you display
additional table columns, they’re not tied to the grouped columns and
the values displayed for them are indeterminate. (For the statement
just shown, it appears that MySQL may simply be picking the first date
for each driver, regardless of whether it matches the driver’s maximum
mileage value.)
The general solution to the problem of displaying contents of rows associated with minimum or maximum group values involves a join. The technique is described in Chapter 12. For the problem at hand, the required results are produced as follows:
mysql>CREATE TABLE t
->SELECT name, MAX(miles) AS miles FROM driver_log GROUP BY name;
mysql>SELECT d.name, d.trav_date, d.miles AS 'longest trip'
->FROM driver_log AS d INNER JOIN t USING (name, miles) ORDER BY name;
+-------+------------+--------------+ | name | trav_date | longest trip | +-------+------------+--------------+ | Ben | 2006-08-30 | 152 | | Henry | 2006-08-29 | 300 | | Suzi | 2006-09-02 | 502 | +-------+------------+--------------+
18.224.30.19