• Why you would want to group data
• The GROUP BY
clause
• Group value functions
• The how and why of group functions
• Grouping by columns
• GROUP BY
versus ORDER BY
• The HAVING
clause
You have learned how to query the database and return data in an organized fashion. You have also learned how to sort data from a query. During this hour, you learn how to break returned data from a query into groups for improved readability.
Grouping data is the process of combining columns with duplicate values in a logical order. For example, a database might contain information about employees; many employees live in different cities, but some employees live in the same city. You might want to execute a query that shows employee information for each particular city. You are grouping employee information by city and creating a summarized report.
Or perhaps you want to figure the average salary paid to employees according to each city. You can do this by using the aggregate function AVG
on the SALARY
column, as you learned in the previous hour, and by using the GROUP BY
clause to group the output by city.
Grouping data is accomplished through the use of the GROUP BY
clause of a SELECT
statement (query). In Hour 9, “Summarizing Data Results from a Query,” you learned how to use aggregate functions. In this lesson, you see how to use aggregate functions in conjunction with the GROUP BY
clause to display results more effectively.
The GROUP BY
clause is used in collaboration with the SELECT
statement to arrange identical data into groups. This clause follows the WHERE
clause in a SELECT
statement and precedes the ORDER BY
clause.
The position of the GROUP BY
clause in a query is as follows:
SELECT
FROM
WHERE
GROUP BY
ORDER BY
The following is the SELECT
statement’s syntax, including the GROUP BY
clause:
SELECT COLUMN1, COLUMN2
FROM TABLE1, TABLE2
WHERE CONDITIONS
GROUP BY COLUMN1, COLUMN2
ORDER BY COLUMN1, COLUMN2
This ordering normally takes a little getting used to when writing your first queries with the GROUP BY
clause; however, it is logical. The GROUP BY
clause is normally a much more CPU-intensive operation, and if we do not constrain the rows provided to it we are grouping unnecessary data that would later be discarded. So we intentionally reduce the data set with the WHERE
clause so that we perform our grouping only on the rows we need.
You can use the ORDER BY
statement, but normally the relational database management system (RDBMS) also orders the results by the column ordering in the GROUP BY
clause, which is discussed more in depth later in this hour. So unless you need to order the values in a different pattern than the GROUP BY
clause, the ORDER BY
clause is redundant. However, sometimes it is provided because you are using aggregate functions in the SELECT
statement that are not in the GROUP BY
clause or because your particular RDBMS functions slightly differently from the standard.
The following sections give examples and explanations of the GROUP BY
clause’s use in a variety of situations.
Typical group functions—those that the GROUP BY
clause uses to arrange data in groups—include AVG
, MAX
, MIN
, SUM
, and COUNT
. These are the aggregate functions that you learned about in Hour 9. Remember that the aggregate functions were used for single values in Hour 9; now you use the aggregate functions for group values.
Grouping data is simple. The selected columns (the column list following the SELECT
keyword in a query) are the columns you can reference in the GROUP BY
clause. If a column is not in the SELECT
statement, you cannot use it in the GROUP BY
clause. How can you group data on a report if the data is not displayed?
If the column name has been qualified, the qualified name must go into the GROUP BY
clause. The column name can also be represented by a number, which is discussed later in the “Representing Column Names with Numbers” section. When grouping the data, the order of columns grouped does not have to match the column order in the SELECT
clause.
The SELECT
clause has conditions that must be met when using GROUP BY
. Specifically, whatever columns are selected must appear in the GROUP BY
clause, except for any aggregate values. The columns in the GROUP BY
clause do not necessarily have to be in the same order as they appear in the SELECT
clause. Should the columns in the SELECT
clause be qualified, the qualified names of the columns must be used in the GROUP BY
clause. Some examples of syntax for the GROUP BY
clause are shown next.
The following SQL statement selects the EMP_ID
and the CITY
from the EMPLOYEE_TBL
and groups the data returned by CITY
and then EMP_ID
:
SELECT EMP_ID, CITY
FROM EMPLOYEE_TBL
GROUP BY CITY, EMP_ID;
This SQL statement returns the EMP_ID
and the total of the SALARY
column. Then it groups the results by both the salaries and employee IDs:
SELECT EMP_ID, SUM(SALARY)
FROM EMPLOYEE_PAY_TBL
GROUP BY SALARY, EMP_ID;
This SQL statement returns the total of all the salaries from EMPLOYEE_PAY_TBL
:
SELECT SUM(SALARY) AS TOTAL_SALARY
FROM EMPLOYEE_PAY_TBL;
TOTAL_SALARY
90000.00
1 row selected
This SQL statement returns the totals for the different groups of salaries:
SELECT SUM(SALARY)
FROM EMPLOYEE_PAY_TBL
GROUP BY SALARY;
SUM(SALARY)
(null)
20000.00
30000.00
40000.00
4 rows selected
Practical examples using real data follow. In this first example, you can see three distinct cities in the EMPLOYEE_TBL
table:
SELECT CITY
FROM EMPLOYEE_TBL;
Note the order of the columns selected, versus the order of the columns in the GROUP BY
clause.
CITY
------------
GREENWOOD
INDIANAPOLIS
WHITELAND
INDIANAPOLIS
INDIANAPOLIS
INDIANAPOLIS
6 rows selected.
In the following example, you select the city and a count of all records for each city. You see a count on each of the three distinct cities because you are using a GROUP BY
clause:
SELECT CITY, COUNT(*)
FROM EMPLOYEE_TBL
GROUP BY CITY;
CITY COUNT(*)
-------------- --------
GREENWOOD 1
INDIANAPOLIS 4
WHITELAND 1
3 rows selected.
The following is a query from a temporary table created based on EMPLOYEE_TBL
and EMPLOYEE_PAY_TBL
. You soon learn how to join two tables for a query:
SELECT *
FROM EMP_PAY_TMP;
CITY LAST_NAM FIRST_NA PAY_RATE SALARY
------------ -------- --------- -------- ------
GREENWOOD STEPHENS TINA 30000
INDIANAPOLIS PLEW LINDA 14.75
WHITELAND GLASS BRANDON 40000
INDIANAPOLIS GLASS JACOB 20000
INDIANAPOLIS WALLACE MARIAH 11
INDIANAPOLIS SPURGEON TIFFANY 15
6 rows selected.
In the following example, you retrieve the average pay rate and salary on each distinct city using the aggregate function AVG
. There is no average pay rate for GREENWOOD
or WHITELAND
because no employees living in those cities are paid hourly:
SELECT CITY, AVG(PAY_RATE), AVG(SALARY)
FROM EMP_PAY_TMP
GROUP BY CITY;
CITY AVG(PAY_RATE) AVG(SALARY)
------------ ------------- -----------
GREENWOOD 30000
INDIANAPOLIS 13.5833333 20000
WHITELAND 40000
3 rows selected.
In the next example, you combine the use of multiple components in a query to return grouped data. You still want to see the average pay rate and salary, but only for INDIANAPOLIS
and WHITELAND
. You group the data by CITY
—you have no choice because you are using aggregate functions on the other columns. Lastly, you want to order the report by 2 and then 3, which are the average pay rate and then average salary, respectively. Study the following details and output:
SELECT CITY, AVG(PAY_RATE), AVG(SALARY)
FROM EMP_PAY_TMP
WHERE CITY IN ('INDIANAPOLIS','WHITELAND')
GROUP BY CITY
ORDER BY 2,3;
CITY AVG(PAY_RATE) AVG(SALARY)
------------ ------------- -----------
INDIANAPOLIS 13.5833333 20000
WHITELAND 40000
Values are sorted before NULL
values; therefore, the record for INDIANAPOLIS
is displayed first. GREENWOOD
is not selected, but if it was, its record would have been displayed before the WHITELAND
record because the average salary for GREENWOOD
is $30,000. (The second sort in the ORDER BY
clause was on average salary.)
The last example in this section shows the use of the MAX
and MIN
aggregate functions with the GROUP BY
clause:
SELECT CITY, MAX(PAY_RATE), MIN(SALARY)
FROM EMP_PAY_TMP
GROUP BY CITY;
CITY MAX(PAY_RATE) MIN(SALARY)
------------ ------------- -----------
GREENWOOD 30000
INDIANAPOLIS 15 20000
WHITELAND 40000
3 rows selected.
Like the ORDER BY
clause, you can order the GROUP BY
clause by using an integer to represent the column name. The following is an example of representing column names with numbers:
SELECT YEAR(DATE_HIRE) as YEAR_HIRED, SUM(SALARY)
FROM EMPLOYEE_PAY_TBL
GROUP BY 1;
YEAR_HIRED SUM(SALARY)
----------- -------------
1999 40000.00
2000
2001
2004 30000.00
2006
2007 20000.00
6 rows selected.
This SQL statement returns the SUM
of the employee salaries grouped by the year in which the employees were hired. The GROUP BY
clause is performed on the entire result set. The order for the groupings is 1, representing EMP_ID
.
You should understand that the GROUP BY
clause works the same as the ORDER BY
clause in that both sort data. Specifically, you use the ORDER BY
clause to sort data from a query. The GROUP BY
clause also sorts data from a query to properly group the data.
However, there are some differences and disadvantages of using GROUP BY
instead of ORDER BY
for sorting operations:
• All nonaggregate columns selected must be listed in the GROUP BY
clause.
• The GROUP BY
clause is generally not necessary unless you’re using aggregate functions.
An example of performing sort operations utilizing the GROUP BY
clause in place of the ORDER BY
clause is shown next:
SELECT LAST_NAME, FIRST_NAME, CITY
FROM EMPLOYEE_TBL
GROUP BY LAST_NAME;
SELECT LAST_NAME, CITY
*
ERROR at line 1:
ORA-00979: not a GROUP BY expression
Different SQL implementations return errors in different formats.
In this example, an Oracle database server received an error stating that FIRST_NAME
is not a GROUP BY
expression. Remember that all columns and expressions in the SELECT
statement must be listed in the GROUP BY
clause, with the exception of aggregate columns (those columns targeted by an aggregate function).
In the next example, the previous problem is solved by adding all the expressions in the SELECT
statement to the GROUP BY
clause:
SELECT LAST_NAME, FIRST_NAME, CITY
FROM EMPLOYEE_TBL
GROUP BY LAST_NAME, FIRST_NAME, CITY;
LAST_NAME FIRST_NAME CITY
--------- ----------- ------------
GLASS BRANDON WHITELAND
GLASS JACOB INDIANAPOLIS
PLEW LINDA INDIANAPOLIS
SPURGEON TIFFANY INDIANAPOLIS
STEPHENS TINA GREENWOOD
WALLACE MARIAH INDIANAPOLIS
6 rows selected.
In this example, the same columns were selected from the same table, but all columns in the GROUP BY
clause are listed as they appeared after the SELECT
keyword. The results are ordered by LAST_NAME
first, FIRST_NAME
second, and CITY
third. These results could have been accomplished easier with the ORDER BY
clause; however, it might help you better understand how the GROUP BY
clause works if you can visualize how it must first sort data to group data results.
The following example shows a SELECT
statement from EMPLOYEE_TBL
and uses the GROUP BY
clause to order by CITY
:
SELECT CITY, LAST_NAME
FROM EMPLOYEE_TBL
GROUP BY CITY, LAST_NAME;
CITY LAST_NAME
------------ ----------
GREENWOOD STEPHENS
INDIANAPOLIS GLASS
INDIANAPOLIS PLEW
INDIANAPOLIS SPURGEON
INDIANAPOLIS WALLACE
WHITELAND GLASS
6 rows selected.
Notice the order of data in the previous results, as well as the LAST_NAME
of the individual for each CITY
. In the following example, all employee records in the EMPLOYEE_TBL
table are now counted, and the results are grouped by CITY
but ordered by the count on each city first:
SELECT CITY, COUNT(*)
FROM EMPLOYEE_TBL
GROUP BY CITY
ORDER BY 2,1;
CITY COUNT(*)
-------------- --------
GREENWOOD 1
WHITELAND 1
INDIANAPOLIS 4
3 rows selected.
Check out the order of the results. The results were first sorted by the count on each city (1–4) and then sorted by city. The count for the first two cities in the output is 1. Because the count is the same, which is the first expression in the ORDER BY
clause, the city is then sorted; GREENWOOD
is placed before WHITELAND
.
Although GROUP BY
and ORDER BY
perform a similar function, there is one major difference. The GROUP BY
clause is designed to group identical data, whereas the ORDER BY
clause is designed merely to put data into a specific order. You can use GROUP BY
and ORDER BY
in the same SELECT
statement, but you must follow a specific order.
You can use the GROUP BY
clause in the CREATE VIEW
statement to sort data, but the ORDER BY
clause is not allowed in the CREATE VIEW
statement. The CREATE VIEW
statement is discussed in depth in Hour 20, “Creating and Using Views and Synonyms.”
Sometimes it is advantageous to get summary totals within a certain group. For instance, you might want to have a breakdown of the SUM
of sales per year, country, and product type but also want to see the totals in each year and country. Luckily, the ANSI SQL standard provides for such functionality using the CUBE
and ROLLUP
expressions.
The ROLLUP
expression is used to get subtotals, or what is commonly referred to as super-aggregate rows, along with a grand total row. The ANSI syntax is as follows:
GROUP BY ROLLUP(ordered column list of grouping sets)
The way the ROLLUP
expression works is that, for every change in the LAST
column provided for the grouping set, an additional row is inserted into the result set with a NULL
value for that column and the subtotal of the values in the set. Additionally, a row is inserted at the end of the result set with NULL
values for each of the group columns and a grand total for the aggregate information. Both Microsoft SQL Server and Oracle follow the ANSI-compliant format, but MySQL follows the following slightly different format:
GROUP BY order column list of grouping sets WITH ROLLUP
Let’s first examine a result set of a simple GROUP BY
statement in which we examine average employee pay by CITY
and ZIP
:
SELECT CITY,ZIP, AVG(PAY_RATE), AVG(SALARY)
FROM EMPLOYEE_TBL E
INNER JOIN EMPLOYEE_PAY_TBL P
ON E.EMP_ID=P.EMP_ID
GROUP BY CITY,ZIP
ORDER BY CITY,ZIP;
CITY ZIP AVG(PAY_RATE) AVG(SALARY)
------------- -------- ----------------- ------------
GREENWOOD 47890 NULL 40000
INDIANAPOLIS 45734 NULL 20000
INDIANAPOLIS 46224 14.75 NULL
INDIANAPOLIS 46234 15.00 NULL
INDIANAPOLIS 46741 11.00 NULL
WHITELAND 47885 NULL 30000
6 rows selected.
The following is an example of using the ROLLUP
expression to get subtotals of sales:
SELECT CITY,ZIP, AVG(PAY_RATE), AVG(SALARY)
FROM EMPLOYEE_TBL E
INNER JOIN EMPLOYEE_PAY_TBL P
ON E.EMP_ID=P.EMP_ID
GROUP BY ROLLUP(CITY,ZIP);
CITY ZIP AVG(PAY_RATE) AVG(SALARY)
------------- --------- --------------- ------------
GREENWOOD 47890 NULL 40000
GREENWOOD NULL NULL 40000
INDIANAPOLIS 45734 NULL 20000
INDIANAPOLIS 46224 14.75 NULL
INDIANAPOLIS 46234 15.00 NULL
INDIANAPOLIS 46741 11.00 NULL
INDIANAPOLIS NULL 13.58 20000
WHITELAND 47885 NULL 30000
WHITELAND NULL NULL 30000
NULL NULL 13.58 30000
10 rows selected.
Notice how we now get an average super-aggregate row for each one of the cities and an overall average for the entire set as the last row.
The CUBE
expression is different. It returns a single row of data with every combination of the columns in the column list along with a row for the grand total of the whole set. The syntax for the CUBE
expression is as follows:
GROUP BY CUBE(column list of grouping sets)
CUBE
is often used to create crosstab reports due to its unique nature. For instance, if we want to have sales use the following columns in the GROUP BY CUBE
expression list, CITY
, STATE
, REGION
, we receive rows for each of the following:
CITY
CITY, STATE
CITY, REGION
CITY, STATE, REGION
REGION
STATE,REGION
STATE
<grand total row>
This expression is supported in both Microsoft SQL Server and Oracle, but as of the time of this writing it is not available in MySQL. The following statement shows an example of using the CUBE
expression:
SELECT CITY,ZIP, AVG(PAY_RATE), AVG(SALARY)
FROM EMPLOYEE_TBL E
INNER JOIN EMPLOYEE_PAY_TBL P
ON E.EMP_ID=P.EMP_ID
GROUP BY CUBE(CITY,ZIP);
CITY ZIP AVG(PAY_RATE) AVG(SALARY)
------------- ------- -------------- ------------
INDIANAPOLIS 45734 NULL 20000
NULL 45734 NULL 20000
INDIANAPOLIS 46224 14.75 NULL
NULL 46224 14.75 NULL
INDIANAPOLIS 46234 15.00 NULL
NULL 46234 15.00 NULL
INDIANAPOLIS 46741 11.00 NULL
NULL 46741 11.00 NULL
WHITELAND 47885 NULL 30000
NULL 47885 NULL 30000
GREENWOOD 47890 NULL 40000
NULL 47890 NULL 40000
GREENWOOD NULL NULL 40000
INDIANAPOLIS NULL 13.58 20000
WHITELAND NULL NULL 30000
NULL NULL 13.58 30000
16 rows selected.
Now you can see that with the CUBE
expression, there are even more rows because the statement needs to return each combination of columns within the column set that we provided.
The HAVING
clause, when used in conjunction with the GROUP BY
clause in a SELECT
statement, tells GROUP BY
which groups to include in the output. HAVING
is to GROUP BY
as WHERE
is to SELECT
. In other words, the WHERE
clause places conditions on the selected columns, and the HAVING
clause places conditions on groups created by the GROUP BY
clause. Therefore, when you use the HAVING
clause, you are effectively including or excluding, as the case might be, whole groups of data from the query results.
The following is the position of the HAVING
clause in a query:
SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY
The following is the syntax of the SELECT
statement, including the HAVING
clause:
SELECT COLUMN1, COLUMN2
FROM TABLE1, TABLE2
WHERE CONDITIONS
GROUP BY COLUMN1, COLUMN2
HAVING CONDITIONS
ORDER BY COLUMN1, COLUMN2
In the following example, you select the average pay rate and salary for all cities except GREENWOOD
. You group the output by CITY
, but you only want to display those groups (cities) that have an average salary greater than $20,000
. You sort the results by average salary for each city:
SELECT CITY, AVG(PAY_RATE), AVG(SALARY)
FROM EMP_PAY_TMP
WHERE CITY <> 'GREENWOOD'
GROUP BY CITY
HAVING AVG(SALARY) > 20000
ORDER BY 3;
CITY AVG(PAY_RATE) AVG(SALARY)
----------- -------------- ------------
WHITELAND 40000
1 row selected.
Why was only one row returned by this query?
• The city GREENWOOD
was eliminated from the WHERE
clause.
• INDIANAPOLIS
was deducted from the output because the average salary was 20000
, which is not greater than 20000
.
You have learned how to group the results of a query using the GROUP BY
clause. The GROUP BY
clause is primarily used with aggregate SQL functions, such as SUM
, AVG
, MAX
, MIN
, and COUNT
. The nature of GROUP BY
is like that of ORDER BY
in that both sort query results. The GROUP BY
clause must sort data to group results logically, but you can also use it exclusively to sort data. However, an ORDER BY
clause is much simpler for this purpose.
The HAVING
clause, an extension to the GROUP BY
clause, places conditions on the established groups of a query. The WHERE
clause places conditions on a query’s SELECT
clause. During the next hour, you learn a new arsenal of functions that enable you to further manipulate query results.
Q. Is using the GROUP BY
clause mandatory when using the ORDER BY
clause in a SELECT
statement?
A. No. Using the GROUP BY
clause is strictly optional, but it can be helpful when used with ORDER BY
.
A. Take the CITY
column from the EMPLOYEE_TBL
. If you select the employee’s name and city and then group the output by city, all the cities that are identical are arranged together.
Q. Must a column appear in the SELECT
statement to use a GROUP BY
clause on it?
A. Yes, a column must be in the SELECT
statement to use a GROUP BY
clause on it.
The following workshop is composed of a series of quiz questions and practical exercises. The quiz questions are designed to test your overall understanding of the current material. The practical exercises are intended to afford you the opportunity to apply the concepts discussed during the current hour, as well as build upon the knowledge acquired in previous hours of study. Please take time to complete the quiz questions and exercises before continuing. Refer to Appendix C, “Answers to Quizzes and Exercises,” for answers.
1. Will the following SQL statements work?
a. SELECT SUM(SALARY), EMP_ID
FROM EMPLOYEE_PAY_TBL
GROUP BY 1 and 2;
b. SELECT EMP_ID, MAX(SALARY)
FROM EMPLOYEE_PAY_TBL
GROUP BY SALARY, EMP_ID;
c. SELECT EMP_ID, COUNT(SALARY)
FROM EMPLOYEE_PAY_TBL
ORDER BY EMP_ID
GROUP BY SALARY;
d. SELECT YEAR(DATE_HIRE) AS YEAR_HIRED,SUM(SALARY)
FROM EMPLOYEE_PAY_TBL
GROUP BY 1
HAVING SUM(SALARY)>20000;
2. True or false: You must also use the GROUP BY
clause when using the HAVING
clause.
3. True or false: The following SQL statement returns a total of the salaries by groups:
SELECT SUM(SALARY)
FROM EMPLOYEE_PAY_TBL;
4. True or false: The columns selected must appear in the GROUP BY
clause in the same order.
5. True or false: The HAVING
clause tells the GROUP BY
which groups to include.
1. Invoke the database and enter the following query to show all cities in EMPLOYEE_TBL
:
SELECT CITY
FROM EMPLOYEE_TBL;
2. Enter the following query and compare the results to the query in Exercise 2:
SELECT CITY, COUNT(*)
FROM EMPLOYEE_TBL
GROUP BY CITY;
3. The HAVING
clause works like the WHERE
clause in that it enables the user to specify conditions on data returned. The WHERE
clause is the main filter on the query, and the HAVING
clause is the filter used after groups of data have been established using the GROUP BY
clause. Enter the following query to see how the HAVING
clause works:
SELECT CITY, COUNT(*)
FROM EMPLOYEE_TBL
GROUP BY CITY
HAVING COUNT(*) > 1;
4. Modify the query in Exercise 3 to order the results in descending order, from highest count to lowest.
5. Write a query to list the average pay rate and salary by position from the EMPLOYEE_PAY_TBL
table.
6. Write a query to list the average salary by position from the EMPLOYEE_PAY_TBL
table where the average salary is greater than 20000
.
18.117.187.113