The SQL construction GROUP BY
is a SELECT
statement clause that is designed to be used in conjunction with aggregation (discussed in Chapter 5) to group data of similar types. An aggregate function is one that extracts information—such as a COUNT
of rows or an average, minimum, or maximum—by operating on multiple rows. We first discuss using GROUP BY
on one column, and then on two columns. Then, we look at how to use GROUP BY
in conjunction with the ORDER BY
, HAVING
, and WHERE
clauses. Finally, we discuss aggregation with subqueries and complexities that nulls present in aggregate functions and
other queries. As we introduce the GROUP BY
and HAVING
, and expand on the ORDER BY
(which has been introduced earlier) in this chapter, we first present a SELECT
in modified BNF showing the GROUP BY
, HAVING
and ORDER BY
, before we start the rest of the discussion.
BNF, short for Backus Naur Form, is used to describe syntax rules. A general form (in modified BNF) of the SELECT
statement for SQL Server, with the FROM
, WHERE
, GROUP BY
, HAVING
and ORDER BY
would be:
SELECT result-set [FROM Tables] [WHERE row-filter] [GROUP BY column names] [HAVING after-filter on groups] [ORDER BY column names]
The [..]
notation means that the contained code is optional.
GROUP BY
is used in conjunction with aggregate functions to group data on the basis of the same values in a column. GROUP BY
returns one row for each value of the column(s) that is grouped. You can use GROUP BY
to group by one column or multiple columns.
As an example of how to group by one column, the following statement shows how you can use the aggregate COUNT
to extract the number of class
groups (number of students in each class) from the Student
table:
SELECT class, COUNT(*) AS [count] FROM Student GROUP BY class
This query produces the following five rows of output, which is grouped by one column, class
:
class count ----- ----------- NULL 10 1 11 2 10 3 7 4 10 (5 row(s) affected)
This type of statement gives you a new way to retrieve and organize aggregate data. Other aggregate functions would have a similar syntax.
If a GROUP BY
clause contains a two-column specification, the result is aggregated and grouped by two columns. For example, the following is COUNT
of class
and major
from the Student
table:
SELECT class, major, COUNT(*) AS [count] FROM Student GROUP BY class, major
This query produces the following output (24 rows), which is grouped by class
within major
:
class major count ----- ----- ----------- NULL NULL 3 2 ACCT 1 4 ACCT 4 3 ART 1 3 CHEM 1 4 CHEM 1 NULL COSC 1 1 COSC 4 2 COSC 2 4 COSC 3 NULL ENGL 1 1 ENGL 3 2 ENGL 2 3 ENGL 4 NULL MATH 2 2 MATH 3 3 MATH 1 4 MATH 1 NULL POLY 2 1 POLY 3 2 POLY 2 4 POLY 1 NULL UNKN 1 1 UNKN 1 (24 row(s) affected)
The sequence of the columns in a GROUP BY
clause has the effect of ordering the output. If we change the order of the GROUP BY
like this:
SELECT class, major, COUNT(*) AS [count] FROM Student GROUP BY major, class
our result will look like this:
class major count ----- ----- ----------- NULL NULL 3 NULL COSC 1 NULL ENGL 1 NULL MATH 2 NULL POLY 2 NULL UNKN 1 1 COSC 4 1 ENGL 3 1 POLY 3 1 UNKN 1 2 ACCT 1 2 COSC 2 2 ENGL 2 2 MATH 3 2 POLY 2 3 ART 1 3 CHEM 1 3 ENGL 4 3 MATH 1 4 ACCT 4 4 CHEM 1 4 COSC 3 4 MATH 1 4 POLY 1 (24 row(s) affected)
Here the output is grouped by major
within class
.
A statement like the following will cause a syntax error, because it says that you are to count both class
and major
, but GROUP BY class
only:
SELECT class, major, COUNT(*) FROM Student GROUP BY class
This query results in the following error message:
Msg 8120, Level 16, State 1, Line 1 Column 'Student.MAJOR' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
To be syntactically and logically correct, you must have all the non aggregate columns of the result set in the GROUP BY
clause. For example, let’s take a look at the data of Table 9-1.
The following query would be improper, because you must GROUP BY
"ohead
" to SUM
capacities for each ohead
value:
SELECT ohead, SUM(capacity) FROM Room
ohead
, an attribute in the Room
table (in our Student_Course
database), is short for rooms with overhead projectors.
This query would produce an error message similar to what we saw previously:
Msg 8120, Level 16, State 1, Line 1 Column 'Room.OHEAD' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
If you SELECT
columns and use an aggregate function, you must GROUP BY
the non aggregate attributes. The correct version of the last statement is as follows:
SELECT ohead, SUM(capacity) AS [sum] FROM Room GROUP BY ohead
which produces the following three rows of output:
ohead sum ----- ----------- NULL 100 N 110 Y 142 Warning: Null value is eliminated by an aggregate or other SET operation. (3 row(s) affected)
This is the sum of room capacities for rooms that have no overhead projectors (N
), rooms that have overhead projectors (Y
), and rooms in which the overhead projector capacity is unknown (null).
Observe that in the Room
table, some rooms have null values for ohead
, and the null rows are summed and grouped along with the non-null rows.
To enhance the display of a GROUP BY
clause, you can combine it with an ORDER BY
clause. Consider the following example:
SELECT class, major, COUNT(*) AS [count] FROM Student GROUP BY class, major
The output for this query was presented earlier in the chapter.
This result set can also be ordered by any other column from the result set using the ORDER BY
. For instance, the following example orders the output in descending order by COUNT(*)
:
SELECT class, major, COUNT(*) AS [count] FROM Student GROUP BY class, major ORDER BY COUNT(*) DESC
This query produces the following output (24 rows):
class major count ------ ----- ----------- 4 ACCT 4 1 COSC 4 3 ENGL 4 2 MATH 3 4 COSC 3 1 ENGL 3 NULL NULL 3 1 POLY 3 2 POLY 2 NULL POLY 2 2 COSC 2 2 ENGL 2 NULL MATH 2 3 MATH 1 4 MATH 1 NULL ENGL 1 2 ACCT 1 3 ART 1 3 CHEM 1 4 CHEM 1 NULL COSC 1 4 POLY 1 NULL UNKN 1 1 UNKN 1 (24 row(s) affected)
When a SELECT
clause includes all the columns specified in a GROUP BY
clause, the use of the DISTINCT
function is unnecessary and inefficient, because the GROUP BY
clause groups rows in such a way that the column(s) that are grouped will not have duplicate values.
The GROUP BY
and HAVING
clauses are used together. The HAVING
clause is used as a final filter (rather than as a conditional filter) on the aggregate column values in the result set of a SELECT
statement. In other words, the query has to be grouped before the HAVING
clause can be applied. For example, consider the following statement, which displays the count of students in various classes (classes of students = 1, 2, 3, 4, corresponding to freshman, sophomore, and so on):
SELECT class, COUNT(*) AS [count] FROM Student GROUP BY class
This query produces the following output:
class count ----- ----------- NULL 10 1 11 2 10 3 7 4 10 (5 row(s) affected)
If you are interested only in classes that have more than a certain number of students in them, you could use the following statement:
SELECT class, COUNT(*) AS [count] FROM Student GROUP BY class HAVING COUNT(*) > 9
which produces the following four rows of output:
class count ----- ----------- NULL 10 1 11 2 10 4 10 (4 row(s) affected)
Whereas HAVING
is a final filter in a SELECT
statement, the WHERE
clause, which excludes rows from a result set, is a conditional filter. HAVING
is used to filter based on aggregate values, WHERE
cannot do that. Consider the following two queries:
SELECT class, COUNT(*) AS [count] FROM Student GROUP BY class HAVING class = 3 SELECT class, COUNT(*) AS [count] FROM Student WHERE class = 3 GROUP BY class
Both queries produce the following output:
class count ----- ----------- 3 7 (1 row(s) affected)
In a typical implementation, the first of these two queries is less efficient because the query engine has to complete the query before removing rows WHERE class = 3
from the result. In the second version, the rows WHERE class = 3
are removed before the grouping takes place. WHERE
is not always a substitute for HAVING
, but when it can be used instead of HAVING
, it should be. Notice that in the example:
SELECT class, COUNT(*) AS [count] FROM Student GROUP BY class HAVING COUNT(*) > 9
HAVING
and WHERE
are not interchangeable because the grouping has to take place before the HAVING
could have an effect. You cannot know in advance what the counts for each class are until they are counted.
Consider the following query, its meaning, and the processing that is required to finalize the result set:
SELECT class, major, COUNT(*) AS [count] FROM Student WHERE major = 'COSC' GROUP BY class, major HAVING COUNT(*) > 2
This query produces the following output:
class major count ----- ----- ----------- 1 COSC 4 4 COSC 3 (2 row(s) affected)
In this example, all computer science (COSC
) majors (per the WHERE
clause) will be grouped and COUNT
ed and then displayed only if COUNT(*) > 2
. The query might erroneously be interpreted as “Group and count all COSC majors by class, but only if there are more than two in a class.” This interpretation is wrong, because SQL applies the WHERE
, then applies the GROUP BY
, and, finally, filters with the HAVING
criterion.
A “usual” GROUP BY
has an aggregate and a column that are grouped like this:
SELECT COUNT(stno) AS [count of student no], class FROM Student GROUP BY class
This produces a result set of 5 rows of counts by class
:
count of student no class ------------------- ----- 10 NULL 11 1 10 2 7 3 10 4 (5 row(s) affected)
Although you must have class
or some other attribute in the GROUP BY
, you do not have to have the class
in the result set. Consider the following query, which generates the same numeric information as the previous query, but does not report the class
in the result:
SELECT COUNT(stno) AS [count of student no] FROM Student GROUP BY class
This query produces the following five rows of output:
count of student no ------------------- 10 11 10 7 10 (5 row(s) affected)
This previous example may seem contradictory to the preceding discussion, but it is not. You must have all the non aggregate columns from the result set in the GROUP BY
, but you do not have to have the columns in the result set that you are grouping. That example may prove useful when a grouped result is needed in a filter. For example, how would you find the class with the most students?
SQL Server 2005 will not allow you to handle aggregation and grouping by nesting
aggregates. For example, suppose you want to find the class
with the minimum number of students. You might try the following query:
SELECT MIN(COUNT(stno)) FROM Student GROUP BY class
Though it may seem logical, this query will not work in SQL Server 2005. It will produce the following error message:
Msg 130, Level 15, State 1, Line 1 Cannot perform an aggregate function on an expression containing an aggregate or a subquery.
The MIN
function is an aggregate function, and aggregate functions operate on rows within tables. In this case, the query is asking MIN
to operate on a table of counted classes that have not yet been calculated. The point is that SQL Server 2005 does not handle this mismatch of aggregation and grouping.
To handle this mismatch of aggregation and grouping in SQL Server 2005, you can use derived structures such as temporary tables, inline views, or regular views (derived structures are covered in Chapter 6). Using either a temporary table or an inline view is the most logical way to solve this problem, so only these two choices are described here.
This section shows how we can handle the mismatch of aggregation and grouping (described earlier) using a global temporary table.
The following steps describe how to use a global temporary table to find the class
with the minimum number of students:
Display the counts of classes, grouped by class
:
SELECT COUNT(stno) AS [count of students] FROM Student GROUP BY class
This query produces the following five rows of output:
count class ----------- ------ 10 NULL 11 1 10 2 7 3 10 4 (5 row(s) affected)
To find the minimum number of students in a class, count the students (you could use stno
for student number) grouped by class
, and put this result in ##Temp1
(a global temporary table)--shown by the first query following, and then find the minimum number of students in a class from the global temporary table, ##Temp1
, with SELECT MIN(count) AS [MINIMUM COUNT] FROM ##Temp1
, and then use this information in a subquery with a HAVING
clause as follows: First type the query:
SELECT (COUNT([stno])) AS [count], class INTO ##Temp1 FROM Student GROUP BY [class]
After executing the previous query, type:
SELECT COUNT(stno) AS [count of stno], class FROM Student GROUP BY class HAVING COUNT(stno) = (SELECT MIN(count) AS [Minimum count] FROM ##Temp1)
This query produces the desired output (the class with the minimum number of students):
count of stno class ------------- ----- 7 3 (1 row(s) affected)
As described in Chapter 6, you can put a query in the FROM
clause of a SELECT
statement to create an inline view. An inline view exists only during the execution of a query.
The following steps describe how to use an inline view to find the class
with the minimum number of students:
Count the stno
in the FROM
clause of the SELECT
statement as follows:
SELECT "Min of Count" = MIN(c) FROM (SELECT c = COUNT(stno) FROM Student GROUP BY class) AS in_view
Because SQL Server 2005 cannot directly find aggregates of aggregates, in the previous query, we give a name to the COUNT
in the inline view, c
, to temporarily store the aggregate result in the inline view, in_view
. We then operate on the inline view as though it were a table and find the minimum value for c.
The previous query produces the following output:
Min of Count ------------ 7 (1 row(s) affected)
To find out which class has the minimum count, you can write the final query using the previous query as a subquery with a HAVING
clause in the outer part of the final query, as follows:
SELECT class, "Count of Class" = COUNT(*) FROM Student GROUP BY class HAVING COUNT(*) = (SELECT MIN(c) FROM (SELECT COUNT(stno) AS [c] FROM Student GROUP BY class) AS in_view)
This query produces the desired output:
class Count of Class ----- -------------- 3 7 (1 row(s) affected)
So, although SQL Server 2005 does not handle a mismatch of aggregation and HAVING
, you can use your knowledge of temporary tables and inline views to work around the problem. This problem may also be solved using regular views. It is also noteworthy to see the process of query development in that some problems require using small queries and building from
them to a final result.
Once again, Chapter 6 covers the advantages and disadvantages of using each one of the derived structures.
In this section, we consider a potential problem of using aggregation with subqueries. As with Cartesian products and joins, aggregation hides details and should always be audited. The two tables that follow will be used to illustrate this problem.
Table 9-2 is similar to the Grade_report
table and contains a student section identifier (ssec
), grades (gd
), and student names (sname
).
Tables 9-2 and 9-3 (GG
and SS
) have not been created for you. You have to create them (and insert the records shown) and then run the queries that follow.
Table 9-3 is similar to the Section
table and contains a section identifier (sec
) and an instructor name (iname
).
Now suppose that you want to find out how many As each instructor awarded. You might start with a join of the GG
and SS
tables. A normal equi-join would be as follows:
SELECT * FROM GG, SS WHERE GG.ssec = SS.sec
This query would produce the following output (nine rows):
ssec gd sname sec iname ----------- ---- ------------ ----------- ------------ 100 A Brenda 100 Jones 110 B Brenda 110 Smith 120 A Brenda 120 Jones 200 A Brenda 200 Adams 210 A Brenda 210 Jones 100 A Richard 100 Jones 100 B Doug 100 Jones 200 A Richard 200 Adams 110 B Morris 110 Smith (9 row(s) affected)
In addition, the following query tells you that there are six As in the GG
table:
SELECT COUNT(*) AS [Count of As] FROM GG WHERE gd = 'A'
giving:
Count of As ------------ 6 (1 row(s) affected)
Now, if you want to find out which instructor gave the As, you would type this query:
SELECT SS.iname FROM SS, GG WHERE SS.sec = GG.ssec AND GG.gd = 'A'
You get the following six rows of output:
iname ------------ Jones Jones Adams Jones Jones Adams (6 row(s) affected)
Now, to find “how many” As each instructor gave, include a COUNT
and GROUP BY
as follows:
SELECT SS.iname AS [iname], COUNT(*) AS [count] FROM SS, GG WHERE SS.sec = GG.ssec AND GG.gd = 'A' GROUP BY SS.iname
This query produces the following output:
iname count ------------ ----------- Adams 2 Jones 4 (2 row(s) affected)
This shows that instructor Adams gave two As and instructor Jones gave four As. So far, so good. You should note that the final count/grouping has the same number of As as the original tables—the sum of the counts equals 6. Now, if you had devised a COUNT
query with a sub-SELECT
, you could get an answer that looked correct but in fact was not. For example, consider the following subquery version of the preceding join query:
SELECT SS.iname AS [iname], COUNT(*) AS [count] FROM SS WHERE SS.sec IN (SELECT GG.ssec FROM GG WHERE GG.gd = 'A') GROUP BY SS.iname
This query produces the following output:
iname count ------------ ----------- Adams 1 Jones 3 (2 row(s) affected)
The reason that you get this output is that the second query is counting names of instructors and whether an A is present in the set of courses that this instructor teaches—not how many As are in the set, just whether any exist. The previous join query gives you all the As in the joined table and hence gives the correct answer to the question “How many As did each instructor award?” The sub-SELECT
ed query answers a different question: “In how many sections did the instructor award an A?”
The point in this example is that if you are SELECT
ing and COUNT
ing, it is a very good idea to audit your results often. If you want to COUNT
the number of As by instructor, begin by first counting how many As there are. Then, you can construct a query to join and count. You should be able to total and reconcile the number of As to the number of As by instructor. The fact that the result makes sense is very useful in determining (albeit not proving) correctness.
Nulls present a complication with regard to aggregate functions and other queries, because nulls are never equal to, less than, greater than, or not equal to any value. Using aggregates by themselves on columns that contain nulls will ignore the null values. For example, suppose you have the following Table 9-4 called Sal
.
Table 9-4 (Sal
) has not been created for you. You have to create it to run the queries that follow.
Now consider the following query:
SELECT COUNT(*) AS [count], AVG(salary) AS [average], SUM(salary) AS [sum], MAX(salary) AS [max], MIN(salary) AS [min] FROM Sal
which produces the following output:
count average sum max min ----------- ----------- ----------- ----------- ----------- 4 2000.00 6000.00 3000.00 1000.00 Warning: Null value is eliminated by an aggregate or other SET operation. (1 row(s) affected)
COUNT (*)
counts all the rows. But, the AVERAGE
, SUM
, MAX
, and MIN
functions ignore the nulled salary row in computing the aggregates. Counting columns also indicates the presence of nulls. If you count by using the following query:
SELECT COUNT(name) AS [Count of Names] FROM Sal
you get:
Count of Names -------------- 4 (1 row(s) affected)
If you use the “salary” column, you get:
SELECT COUNT(salary) AS [Count of salary] FROM Sal
which produces:
Count of salary --------------- 3 Warning: Null value is eliminated by an aggregate or other SET operation. (1 row(s) affected)
This result indicates that you have a null salary. If you want to include nulls in the aggregate and have a rational value to substitute for a value that is not known (a big assumption), you can use the ISNULL
function.
The ISNULL
function was introduced and discussed in Chapter 5.
ISNULL
returns a value if the value is null. ISNULL
has the form ISNULL(
column name
,
value if null
), which is used in place of the column name. For example, if you type the following:
SELECT name, ISNULL(salary, 0) AS [salary] FROM Sal
you get the following output:
name salary ------------ ----------- Joe 1000.00 Sam 2000.00 Bill 3000.00 Dave 0.00 (4 row(s) affected)
If you type the following:
SELECT COUNT(ISNULL(salary,0)) AS [Count of salary] FROM Sal
you get:
Count of salary --------------- 4 (1 row(s) affected)
The “Count of salary” is now 4 instead of the 3 that you received earlier when the ISNULL
function was not used.
If you type the following:
SELECT AVG(ISNULL(salary, 0)) AS [Average of salary] FROM Sal
you get:
Average of salary ----------------- 1500.00 (1 row(s) affected)
The “Average of salary” is now 1500.00, instead of the 2000.00 that you had received earlier because the zero value for the null was used in the calculation. What seems almost contradictory to these examples is that when grouping is added to the query, nulls in the grouped column are included in the result set. So, if the Sal
table had another column like this:
Name salary job ------------ ----------- -------------------- Joe 1000.00 Programmer Sam 2000.00 NULL Bill 3000.00 Plumber Dave NULL Programmer
And if you ran a query like this:
SELECT SUM(salary) AS [Sum of salary], job FROM Sal GROUP BY job
You would get the following output:
Sum of salary Job ------------- -------------------- 2000.00 NULL 3000.00 Plumber 1000.00 Programmer Warning: Null value is eliminated by an aggregate or other SET operation. (3 row(s) affected)
The aggregate will ignore values that are null, but grouping will compute a value for the nulled column value.
In this chapter we not only introduced the GROUP BY
and HAVING
clauses, but we also discussed what would and would not work and some efficiency issues. We discussed how aggregates and grouping can be handled in SQL Server 2005 and how it is always important to audit your queries and the results for correctness.
What do aggregate functions do?
How does the GROUP BY
clause work?
What is the difference between a GROUP BY
and ORDER BY
?
What is the HAVING
clause used for?
Can the WHERE
clause always be considered a substitute for the HAVING
clause? Why or why not?
Do functions of functions have to be handled in a special way in Server SQL 2005?
Will nulls in grouped columns be included in a result set?
How do aggregate functions treat nulls?
Does the sequence of the columns in a GROUP BY
clause have an effect on the end result?
When would it not make sense to use the GROUP BY
and DISTINCT
functions together?
Is GROUP BY
affected by nulls?
Which comes first in a SELECT
statement, an ORDER BY
or GROUP BY
? Why?
The GROUP BY
and ________________ clauses are used together.
Unless specified otherwise, use the Student_course
database to answer the following questions. Also, use appropriate column headings when displaying your output.
Display a list of courses (course names) that have prerequisites and the number of prerequisites for each course. Order the list by the number of prerequisites.
How many juniors (class = 3) are there in the Student
table?
Group and count all MATH
majors by class and display the count if there are two or more in a class. (Remember that class here refers to freshman, sophomore, and so on and is recorded as 1, 2, and so on.)
Print the counts of As, Bs, and so on from the Grade_report
table.
Using temporary tables (local or global), print the minimum counts of the grades (that is, if there were 20 As, 25 Bs, and 18 Cs, you should print the minimum count of grades as C) from the Grade_report
table.
Using inline views, print the maximum counts of the grades (that is, if there were 20 As, 25 Bs, and 18 Cs, you should print the maximum count of grades as B) from the Grade_report
table.
Why would you not want to use views for this problem?
Print the counts of course numbers offered in descending order by count. Use the Section
table only.
Create a table with names and number-of-children (NOC
). Populate the table with five or six rows. Use COUNT
, SUM
, AVG
, MIN
, and MAX
on the NOC
attribute in one query and confirm that the numbers you get are what you expect.
Create a table of names, salaries and job locations. Populate the table with at least 10 rows and no fewer than three job locations. (There will be several employees at each location.) Find the average salary for each job location with one SELECT
.
Print an ordered list of instructors and the number of As they assigned to students. Order the output by number of As (lowest to greatest). You can (and probably will) ignore instructors that assign no As.
Create a table called Employees
with a name, a salary and job title. Include exactly six rows. Make the salary null in one row, the job title null in another, and both the salary and the job title in another. Use this data:
Name |
Salary |
Title |
Mary |
1000 |
Programmer |
Brenda |
3000 |
|
Stephanie |
|
Artist |
Alice |
|
|
Lindsay |
2000 |
Artist |
Christina |
500 |
Programmer |
Display the table.
Display count, sum, maximum, minimum, and average salary.
Display count, sum, maximum, minimum, and average salary, counting salary as 0 if no salary is listed.
Display the average salary grouped by job title on the table as is.
Display the average salary grouped by job title when null salary is counted as 0.
Display the average salary grouped by job title when salary is counted as 0 if it is null and include a value for “no job title.”
Find the instructor and the section where the maximum number of As were awarded.
Find the COUNT
of the number of students by class who are taking classes offered by the computer science (COSC
) department. Perform the query in two ways: once using a condition in the WHERE
clause and once filtering with a HAVING
clause. (Hint: These queries need a five-table join.)
Delete (DROP
) all of your “scratch” tables (the ones you created just for this exercise: Employees, NOC
, and any others you may have created).
18.217.199.122