Hour 10. Sorting and Grouping Data


What You’ll Learn in This Hour:

Why you would want to group data

The GROUP BY clause

Group value functions

The how and why of group functions

Grouping by columns

GROUP BY versus ORDER BY

The HAVING clause


You have learned how to query the database and return data in an organized fashion. You have also learned how to sort data from a query. During this hour, you learn how to break returned data from a query into groups for improved readability.

Why Group Data?

Grouping data is the process of combining columns with duplicate values in a logical order. For example, a database might contain information about employees; many employees live in different cities, but some employees live in the same city. You might want to execute a query that shows employee information for each particular city. You are grouping employee information by city and creating a summarized report.

Or perhaps you want to figure the average salary paid to employees according to each city. You can do this by using the aggregate function AVG on the SALARY column, as you learned in the previous hour, and by using the GROUP BY clause to group the output by city.

Grouping data is accomplished through the use of the GROUP BY clause of a SELECT statement (query). In Hour 9, “Summarizing Data Results from a Query,” you learned how to use aggregate functions. In this lesson, you see how to use aggregate functions in conjunction with the GROUP BY clause to display results more effectively.

The GROUP BY Clause

The GROUP BY clause is used in collaboration with the SELECT statement to arrange identical data into groups. This clause follows the WHERE clause in a SELECT statement and precedes the ORDER BY clause.

The position of the GROUP BY clause in a query is as follows:

SELECT
FROM
WHERE
GROUP BY
ORDER BY

The following is the SELECT statement’s syntax, including the GROUP BY clause:

SELECT COLUMN1, COLUMN2
FROM TABLE1, TABLE2
WHERE CONDITIONS
GROUP BY COLUMN1, COLUMN2
ORDER BY COLUMN1, COLUMN2

This ordering normally takes a little getting used to when writing your first queries with the GROUP BY clause; however, it is logical. The GROUP BY clause is normally a much more CPU-intensive operation, and if we do not constrain the rows provided to it we are grouping unnecessary data that would later be discarded. So we intentionally reduce the data set with the WHERE clause so that we perform our grouping only on the rows we need.

You can use the ORDER BY statement, but normally the relational database management system (RDBMS) also orders the results by the column ordering in the GROUP BY clause, which is discussed more in depth later in this hour. So unless you need to order the values in a different pattern than the GROUP BY clause, the ORDER BY clause is redundant. However, sometimes it is provided because you are using aggregate functions in the SELECT statement that are not in the GROUP BY clause or because your particular RDBMS functions slightly differently from the standard.

The following sections give examples and explanations of the GROUP BY clause’s use in a variety of situations.

Group Functions

Typical group functions—those that the GROUP BY clause uses to arrange data in groups—include AVG, MAX, MIN, SUM, and COUNT. These are the aggregate functions that you learned about in Hour 9. Remember that the aggregate functions were used for single values in Hour 9; now you use the aggregate functions for group values.

Grouping Selected Data

Grouping data is simple. The selected columns (the column list following the SELECT keyword in a query) are the columns you can reference in the GROUP BY clause. If a column is not in the SELECT statement, you cannot use it in the GROUP BY clause. How can you group data on a report if the data is not displayed?

If the column name has been qualified, the qualified name must go into the GROUP BY clause. The column name can also be represented by a number, which is discussed later in the “Representing Column Names with Numbers” section. When grouping the data, the order of columns grouped does not have to match the column order in the SELECT clause.

Creating Groups and Using Aggregate Functions

The SELECT clause has conditions that must be met when using GROUP BY. Specifically, whatever columns are selected must appear in the GROUP BY clause, except for any aggregate values. The columns in the GROUP BY clause do not necessarily have to be in the same order as they appear in the SELECT clause. Should the columns in the SELECT clause be qualified, the qualified names of the columns must be used in the GROUP BY clause. Some examples of syntax for the GROUP BY clause are shown next.

The following SQL statement selects the EMP_ID and the CITY from the EMPLOYEE_TBL and groups the data returned by CITY and then EMP_ID:

SELECT EMP_ID, CITY

FROM EMPLOYEE_TBL
GROUP BY CITY, EMP_ID;

This SQL statement returns the EMP_ID and the total of the SALARY column. Then it groups the results by both the salaries and employee IDs:

SELECT EMP_ID, SUM(SALARY)
FROM EMPLOYEE_PAY_TBL
GROUP BY SALARY, EMP_ID;

This SQL statement returns the total of all the salaries from EMPLOYEE_PAY_TBL:

SELECT SUM(SALARY) AS TOTAL_SALARY
FROM EMPLOYEE_PAY_TBL;

TOTAL_SALARY
90000.00

1 row selected

This SQL statement returns the totals for the different groups of salaries:

SELECT SUM(SALARY)
FROM EMPLOYEE_PAY_TBL
GROUP BY SALARY;

SUM(SALARY)
(null)
20000.00
30000.00
40000.00

4 rows selected

Practical examples using real data follow. In this first example, you can see three distinct cities in the EMPLOYEE_TBL table:

SELECT CITY
FROM EMPLOYEE_TBL;


By the Way: Column Ordering in the Group By Statement Matters

Note the order of the columns selected, versus the order of the columns in the GROUP BY clause.


CITY
------------
GREENWOOD
INDIANAPOLIS
WHITELAND

INDIANAPOLIS
INDIANAPOLIS
INDIANAPOLIS

6 rows selected.

In the following example, you select the city and a count of all records for each city. You see a count on each of the three distinct cities because you are using a GROUP BY clause:

SELECT CITY, COUNT(*)
FROM EMPLOYEE_TBL
GROUP BY CITY;

CITY                   COUNT(*)
--------------         --------
GREENWOOD              1
INDIANAPOLIS           4
WHITELAND              1

3 rows selected.

The following is a query from a temporary table created based on EMPLOYEE_TBL and EMPLOYEE_PAY_TBL. You soon learn how to join two tables for a query:

SELECT *
FROM EMP_PAY_TMP;

CITY              LAST_NAM           FIRST_NA        PAY_RATE      SALARY
------------      --------           ---------       --------      ------
GREENWOOD         STEPHENS           TINA                          30000
INDIANAPOLIS      PLEW               LINDA           14.75
WHITELAND         GLASS              BRANDON                       40000
INDIANAPOLIS      GLASS              JACOB                         20000
INDIANAPOLIS      WALLACE            MARIAH          11
INDIANAPOLIS      SPURGEON           TIFFANY         15

6 rows selected.

In the following example, you retrieve the average pay rate and salary on each distinct city using the aggregate function AVG. There is no average pay rate for GREENWOOD or WHITELAND because no employees living in those cities are paid hourly:

SELECT CITY, AVG(PAY_RATE), AVG(SALARY)
FROM EMP_PAY_TMP
GROUP BY CITY;

CITY                  AVG(PAY_RATE)         AVG(SALARY)
------------          -------------         -----------
GREENWOOD                                    30000
INDIANAPOLIS           13.5833333            20000

WHITELAND                                    40000

3 rows selected.

In the next example, you combine the use of multiple components in a query to return grouped data. You still want to see the average pay rate and salary, but only for INDIANAPOLIS and WHITELAND. You group the data by CITY—you have no choice because you are using aggregate functions on the other columns. Lastly, you want to order the report by 2 and then 3, which are the average pay rate and then average salary, respectively. Study the following details and output:

SELECT CITY, AVG(PAY_RATE), AVG(SALARY)
FROM EMP_PAY_TMP
WHERE CITY IN ('INDIANAPOLIS','WHITELAND')
GROUP BY CITY
ORDER BY 2,3;

CITY                   AVG(PAY_RATE)         AVG(SALARY)
------------           -------------         -----------
INDIANAPOLIS           13.5833333            20000
WHITELAND                                    40000

Values are sorted before NULL values; therefore, the record for INDIANAPOLIS is displayed first. GREENWOOD is not selected, but if it was, its record would have been displayed before the WHITELAND record because the average salary for GREENWOOD is $30,000. (The second sort in the ORDER BY clause was on average salary.)

The last example in this section shows the use of the MAX and MIN aggregate functions with the GROUP BY clause:

SELECT CITY, MAX(PAY_RATE), MIN(SALARY)
FROM EMP_PAY_TMP
GROUP BY CITY;

CITY                   MAX(PAY_RATE)   MIN(SALARY)
------------           -------------   -----------
GREENWOOD                              30000
INDIANAPOLIS           15              20000
WHITELAND                              40000

3 rows selected.

Representing Column Names with Numbers

Like the ORDER BY clause, you can order the GROUP BY clause by using an integer to represent the column name. The following is an example of representing column names with numbers:

SELECT YEAR(DATE_HIRE) as YEAR_HIRED, SUM(SALARY)
FROM EMPLOYEE_PAY_TBL
GROUP BY 1;

YEAR_HIRED       SUM(SALARY)
-----------      -------------
1999             40000.00
2000
2001
2004             30000.00
2006
2007             20000.00

6 rows selected.

This SQL statement returns the SUM of the employee salaries grouped by the year in which the employees were hired. The GROUP BY clause is performed on the entire result set. The order for the groupings is 1, representing EMP_ID.

GROUP BY Versus ORDER BY

You should understand that the GROUP BY clause works the same as the ORDER BY clause in that both sort data. Specifically, you use the ORDER BY clause to sort data from a query. The GROUP BY clause also sorts data from a query to properly group the data.

However, there are some differences and disadvantages of using GROUP BY instead of ORDER BY for sorting operations:

• All nonaggregate columns selected must be listed in the GROUP BY clause.

• The GROUP BY clause is generally not necessary unless you’re using aggregate functions.

An example of performing sort operations utilizing the GROUP BY clause in place of the ORDER BY clause is shown next:

SELECT LAST_NAME, FIRST_NAME, CITY
FROM EMPLOYEE_TBL
GROUP BY LAST_NAME;

SELECT LAST_NAME, CITY
                  *
ERROR at line 1:
ORA-00979: not a GROUP BY expression


By the Way: Error Messages Differ

Different SQL implementations return errors in different formats.


In this example, an Oracle database server received an error stating that FIRST_NAME is not a GROUP BY expression. Remember that all columns and expressions in the SELECT statement must be listed in the GROUP BY clause, with the exception of aggregate columns (those columns targeted by an aggregate function).

In the next example, the previous problem is solved by adding all the expressions in the SELECT statement to the GROUP BY clause:

SELECT LAST_NAME, FIRST_NAME, CITY
FROM EMPLOYEE_TBL
GROUP BY LAST_NAME, FIRST_NAME, CITY;

LAST_NAME             FIRST_NAME             CITY
---------             -----------            ------------
GLASS                 BRANDON                WHITELAND
GLASS                 JACOB                  INDIANAPOLIS
PLEW                  LINDA                  INDIANAPOLIS
SPURGEON              TIFFANY                INDIANAPOLIS
STEPHENS              TINA                   GREENWOOD
WALLACE               MARIAH                 INDIANAPOLIS

6 rows selected.

In this example, the same columns were selected from the same table, but all columns in the GROUP BY clause are listed as they appeared after the SELECT keyword. The results are ordered by LAST_NAME first, FIRST_NAME second, and CITY third. These results could have been accomplished easier with the ORDER BY clause; however, it might help you better understand how the GROUP BY clause works if you can visualize how it must first sort data to group data results.

The following example shows a SELECT statement from EMPLOYEE_TBL and uses the GROUP BY clause to order by CITY:

SELECT CITY, LAST_NAME
FROM EMPLOYEE_TBL
GROUP BY CITY, LAST_NAME;

CITY                  LAST_NAME
------------          ----------
GREENWOOD             STEPHENS
INDIANAPOLIS          GLASS
INDIANAPOLIS          PLEW
INDIANAPOLIS          SPURGEON

INDIANAPOLIS          WALLACE
WHITELAND             GLASS

6 rows selected.

Notice the order of data in the previous results, as well as the LAST_NAME of the individual for each CITY. In the following example, all employee records in the EMPLOYEE_TBL table are now counted, and the results are grouped by CITY but ordered by the count on each city first:

SELECT CITY, COUNT(*)
FROM EMPLOYEE_TBL
GROUP BY CITY
ORDER BY 2,1;

CITY                  COUNT(*)
--------------        --------
GREENWOOD             1
WHITELAND             1
INDIANAPOLIS          4

3 rows selected.

Check out the order of the results. The results were first sorted by the count on each city (1–4) and then sorted by city. The count for the first two cities in the output is 1. Because the count is the same, which is the first expression in the ORDER BY clause, the city is then sorted; GREENWOOD is placed before WHITELAND.

Although GROUP BY and ORDER BY perform a similar function, there is one major difference. The GROUP BY clause is designed to group identical data, whereas the ORDER BY clause is designed merely to put data into a specific order. You can use GROUP BY and ORDER BY in the same SELECT statement, but you must follow a specific order.


Did You Know?: You Can’t Use the ORDER BY Clause in a View

You can use the GROUP BY clause in the CREATE VIEW statement to sort data, but the ORDER BY clause is not allowed in the CREATE VIEW statement. The CREATE VIEW statement is discussed in depth in Hour 20, “Creating and Using Views and Synonyms.”


CUBE and ROLLUP Expressions

Sometimes it is advantageous to get summary totals within a certain group. For instance, you might want to have a breakdown of the SUM of sales per year, country, and product type but also want to see the totals in each year and country. Luckily, the ANSI SQL standard provides for such functionality using the CUBE and ROLLUP expressions.

The ROLLUP expression is used to get subtotals, or what is commonly referred to as super-aggregate rows, along with a grand total row. The ANSI syntax is as follows:

GROUP BY ROLLUP(ordered column list of grouping sets)

The way the ROLLUP expression works is that, for every change in the LAST column provided for the grouping set, an additional row is inserted into the result set with a NULL value for that column and the subtotal of the values in the set. Additionally, a row is inserted at the end of the result set with NULL values for each of the group columns and a grand total for the aggregate information. Both Microsoft SQL Server and Oracle follow the ANSI-compliant format, but MySQL follows the following slightly different format:

GROUP BY order column list of grouping sets WITH ROLLUP

Let’s first examine a result set of a simple GROUP BY statement in which we examine average employee pay by CITY and ZIP:

SELECT CITY,ZIP, AVG(PAY_RATE), AVG(SALARY)
FROM EMPLOYEE_TBL E
INNER JOIN EMPLOYEE_PAY_TBL P
ON E.EMP_ID=P.EMP_ID
GROUP BY CITY,ZIP
ORDER BY CITY,ZIP;

CITY                 ZIP           AVG(PAY_RATE)               AVG(SALARY)
-------------        --------      -----------------            ------------
GREENWOOD            47890         NULL                        40000
INDIANAPOLIS         45734         NULL                        20000
INDIANAPOLIS         46224         14.75                       NULL
INDIANAPOLIS         46234         15.00                       NULL
INDIANAPOLIS         46741         11.00                       NULL
WHITELAND            47885         NULL                        30000

6 rows selected.

The following is an example of using the ROLLUP expression to get subtotals of sales:

SELECT CITY,ZIP, AVG(PAY_RATE), AVG(SALARY)
FROM EMPLOYEE_TBL E
INNER JOIN EMPLOYEE_PAY_TBL P
ON E.EMP_ID=P.EMP_ID
GROUP BY ROLLUP(CITY,ZIP);


CITY                ZIP            AVG(PAY_RATE)       AVG(SALARY)
-------------       ---------      ---------------     ------------
GREENWOOD           47890          NULL                40000
GREENWOOD           NULL           NULL                40000
INDIANAPOLIS        45734          NULL                20000
INDIANAPOLIS        46224          14.75               NULL
INDIANAPOLIS        46234          15.00               NULL
INDIANAPOLIS        46741          11.00               NULL
INDIANAPOLIS        NULL           13.58               20000
WHITELAND           47885          NULL                30000
WHITELAND           NULL           NULL                30000
NULL                NULL           13.58               30000

10 rows selected.

Notice how we now get an average super-aggregate row for each one of the cities and an overall average for the entire set as the last row.

The CUBE expression is different. It returns a single row of data with every combination of the columns in the column list along with a row for the grand total of the whole set. The syntax for the CUBE expression is as follows:

GROUP BY CUBE(column list of grouping sets)

CUBE is often used to create crosstab reports due to its unique nature. For instance, if we want to have sales use the following columns in the GROUP BY CUBE expression list, CITY, STATE, REGION, we receive rows for each of the following:

CITY
CITY, STATE
CITY, REGION
CITY, STATE, REGION
REGION
STATE,REGION
STATE
<grand total row>

This expression is supported in both Microsoft SQL Server and Oracle, but as of the time of this writing it is not available in MySQL. The following statement shows an example of using the CUBE expression:

SELECT CITY,ZIP, AVG(PAY_RATE), AVG(SALARY)
FROM EMPLOYEE_TBL E
INNER JOIN EMPLOYEE_PAY_TBL P
ON E.EMP_ID=P.EMP_ID
GROUP BY CUBE(CITY,ZIP);


CITY                ZIP            AVG(PAY_RATE)       AVG(SALARY)
-------------       -------        --------------      ------------
INDIANAPOLIS        45734          NULL                20000
NULL                45734          NULL                20000
INDIANAPOLIS        46224          14.75               NULL
NULL                46224          14.75               NULL
INDIANAPOLIS        46234          15.00               NULL
NULL                46234          15.00               NULL
INDIANAPOLIS        46741          11.00               NULL
NULL                46741          11.00               NULL
WHITELAND           47885          NULL                30000
NULL                47885          NULL                30000
GREENWOOD           47890          NULL                40000
NULL                47890          NULL                40000
GREENWOOD           NULL           NULL                40000
INDIANAPOLIS        NULL           13.58               20000
WHITELAND           NULL           NULL                30000
NULL                NULL           13.58               30000

16 rows selected.

Now you can see that with the CUBE expression, there are even more rows because the statement needs to return each combination of columns within the column set that we provided.

The HAVING Clause

The HAVING clause, when used in conjunction with the GROUP BY clause in a SELECT statement, tells GROUP BY which groups to include in the output. HAVING is to GROUP BY as WHERE is to SELECT. In other words, the WHERE clause places conditions on the selected columns, and the HAVING clause places conditions on groups created by the GROUP BY clause. Therefore, when you use the HAVING clause, you are effectively including or excluding, as the case might be, whole groups of data from the query results.

The following is the position of the HAVING clause in a query:

SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY

The following is the syntax of the SELECT statement, including the HAVING clause:

SELECT COLUMN1, COLUMN2
FROM TABLE1, TABLE2

WHERE CONDITIONS
GROUP BY COLUMN1, COLUMN2
HAVING CONDITIONS
ORDER BY COLUMN1, COLUMN2

In the following example, you select the average pay rate and salary for all cities except GREENWOOD. You group the output by CITY, but you only want to display those groups (cities) that have an average salary greater than $20,000. You sort the results by average salary for each city:

SELECT CITY, AVG(PAY_RATE), AVG(SALARY)
FROM EMP_PAY_TMP
WHERE CITY <> 'GREENWOOD'
GROUP BY CITY
HAVING AVG(SALARY) > 20000
ORDER BY 3;

CITY                AVG(PAY_RATE)    AVG(SALARY)
-----------         --------------   ------------
WHITELAND                            40000

1 row selected.

Why was only one row returned by this query?

• The city GREENWOOD was eliminated from the WHERE clause.

INDIANAPOLIS was deducted from the output because the average salary was 20000, which is not greater than 20000.

Summary

You have learned how to group the results of a query using the GROUP BY clause. The GROUP BY clause is primarily used with aggregate SQL functions, such as SUM, AVG, MAX, MIN, and COUNT. The nature of GROUP BY is like that of ORDER BY in that both sort query results. The GROUP BY clause must sort data to group results logically, but you can also use it exclusively to sort data. However, an ORDER BY clause is much simpler for this purpose.

The HAVING clause, an extension to the GROUP BY clause, places conditions on the established groups of a query. The WHERE clause places conditions on a query’s SELECT clause. During the next hour, you learn a new arsenal of functions that enable you to further manipulate query results.

Q&A

Q. Is using the GROUP BY clause mandatory when using the ORDER BY clause in a SELECT statement?

A. No. Using the GROUP BY clause is strictly optional, but it can be helpful when used with ORDER BY.

Q. What is a group value?

A. Take the CITY column from the EMPLOYEE_TBL. If you select the employee’s name and city and then group the output by city, all the cities that are identical are arranged together.

Q. Must a column appear in the SELECT statement to use a GROUP BY clause on it?

A. Yes, a column must be in the SELECT statement to use a GROUP BY clause on it.

Workshop

The following workshop is composed of a series of quiz questions and practical exercises. The quiz questions are designed to test your overall understanding of the current material. The practical exercises are intended to afford you the opportunity to apply the concepts discussed during the current hour, as well as build upon the knowledge acquired in previous hours of study. Please take time to complete the quiz questions and exercises before continuing. Refer to Appendix C, “Answers to Quizzes and Exercises,” for answers.

Quiz

1. Will the following SQL statements work?

a. SELECT SUM(SALARY), EMP_ID
FROM EMPLOYEE_PAY_TBL
GROUP BY 1 and 2;

b. SELECT EMP_ID, MAX(SALARY)
FROM EMPLOYEE_PAY_TBL
GROUP BY SALARY, EMP_ID;

c. SELECT EMP_ID, COUNT(SALARY)
FROM EMPLOYEE_PAY_TBL
ORDER BY EMP_ID
GROUP BY SALARY;

d. SELECT YEAR(DATE_HIRE) AS YEAR_HIRED,SUM(SALARY)
FROM EMPLOYEE_PAY_TBL
GROUP BY 1
HAVING SUM(SALARY)>20000;

2. True or false: You must also use the GROUP BY clause when using the HAVING clause.

3. True or false: The following SQL statement returns a total of the salaries by groups:

SELECT SUM(SALARY)
FROM EMPLOYEE_PAY_TBL;

4. True or false: The columns selected must appear in the GROUP BY clause in the same order.

5. True or false: The HAVING clause tells the GROUP BY which groups to include.

Exercises

1. Invoke the database and enter the following query to show all cities in EMPLOYEE_TBL:

SELECT CITY
FROM EMPLOYEE_TBL;

2. Enter the following query and compare the results to the query in Exercise 2:

SELECT CITY, COUNT(*)
FROM EMPLOYEE_TBL
GROUP BY CITY;

3. The HAVING clause works like the WHERE clause in that it enables the user to specify conditions on data returned. The WHERE clause is the main filter on the query, and the HAVING clause is the filter used after groups of data have been established using the GROUP BY clause. Enter the following query to see how the HAVING clause works:

SELECT CITY, COUNT(*)
FROM EMPLOYEE_TBL
GROUP BY CITY
HAVING COUNT(*) > 1;

4. Modify the query in Exercise 3 to order the results in descending order, from highest count to lowest.

5. Write a query to list the average pay rate and salary by position from the EMPLOYEE_PAY_TBL table.

6. Write a query to list the average salary by position from the EMPLOYEE_PAY_TBL table where the average salary is greater than 20000.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.72.165