Summarizing and Grouping Data

Overview

Instead of just listing individual rows, you can use a summary function (also called an aggregate function) to produce a statistical summary of data in a table. For example, in the SELECT clause in the following query, the AVG function calculates the average (or mean) miles traveled by frequent-flyer club members. The GROUP BY clause tells PROC SQL to calculate and display the average for each membership group (MemberType).

proc sql;
   select membertype, 
          avg(milestraveled)
          as AvgMilesTraveled
      from sasuser.frequentflyers
      group by membertype;

You should already be familiar with the list of summary functions that can be used in a PROC SQL query.

PROC SQL calculates summary functions and outputs results in different ways depending on a combination of factors. Four key factors are

whether the summary function specifies one or multiple columns as arguments
whether the query contains a GROUP BY clause
if the summary function is specified in a SELECT clause, whether there are additional columns listed that are outside of a summary function
whether the WHERE clause, if there is one, contains only columns that are specified in the SELECT clause.

To ensure that your PROC SQL queries produce the intended output, it is important to understand how the factors listed above affect the processing of summary functions. Consider an overview of all the factors, followed by a detailed example that illustrates each factor.

Number of Arguments and Summary Function Processing

Summary functions specify one or more arguments in parentheses. In the examples shown in this chapter, the arguments are always columns in the table being queried.

Note: The ANSI-standard summary functions, such as AVG and COUNT, can be used only with a single argument. The SAS summary functions, such as MEAN and N, can be used with either single or multiple arguments.

The following chart shows how the number of columns specified as arguments affects how PROC SQL calculates a summary function.

If a summary function...	Then the calculation is...	Example
specifies one column as argument	performed down the column	proc sql; select avg(salary)as AvgSalary from sasuser.payrollmaster;
specifies multiple columns as arguments	performed across columns for each row	proc sql outobs=10; select sum(boarded,transferred,nonrevenue) as Total from sasuser.marchflights;

Groups and Summary Function Processing

Summary functions perform calculations on groups of data. When PROC SQL processes a summary function, it looks for a GROUP BY clause:

If a GROUP BY clause...	Then PROC SQL...	Example
is not present in the query	applies the function to the entire table	proc sql outobs=10; select jobcode, avg(salary) as AvgSalary from sasuser.payrollmaster;
is present in the query	applies the function to each group specified in the GROUP BY clause	proc sql outobs=10; select jobcode, avg(salary) as AvgSalary from sasuser.payrollmaster group by jobcode; If a query contains a GROUP BY clause, all columns in the SELECT clause that do not contain a summary function should be listed in the GROUP BY clause or unexpected results might be returned.

SELECT Clause Columns and Summary Function Processing

A SELECT clause that contains a summary function can also list additional columns that are not specified in the summary function. The presence of these additional columns in the SELECT clause list causes PROC SQL to display the output differently.

If a SELECT clause...	Then PROC SQL...	Example
contains summary function(s) and no columns outside of summary functions	calculates a single value by using the summary function for the entire table or, if groups are specified in the GROUP BY clause, for each group combines or rolls up the information into a single row of output for the entire table or, if groups are specified, for each group	proc sql; select avg(salary) as AvgSalary from sasuser.payrollmaster;
contains summary function(s) and additional columns outside of summary functions	calculates a single value for the entire table or, if groups are specified, for each group, and displays all rows of output with the single or grouped value(s) repeated	proc sql; select jobcode, gender, avg(salary) as AvgSalary from sasuser.payrollmaster group by jobcode,gender;

Note: WHERE clause columns also affect summary function processing. If there is a WHERE clause that references only columns that are specified in the SELECT clause, PROC SQL combines information into a single row of output. However, this condition is not covered in this chapter. For more information, see the SAS documentation for the SQL procedure.

In the next few sections, look more closely at the query examples shown above to see how the first three factors impact summary function processing.

Compare two PROC SQL queries that contain a summary function: one with a single argument and the other with multiple arguments. To keep things simple, these queries do not contain a GROUP BY clause.

Using a Summary Function with a Single Argument (Column)

Below is a PROC SQL query that displays the average salary of all employees listed in the table Sasuser.Payrollmaster:

 proc sql;
    select avg(salary) as AvgSalary
       from sasuser.payrollmaster;

The SELECT statement contains the summary function AVG with Salary as its argument. Because there is only one column as an argument, the function calculates the statistic down the Salary column to display a single value: the average salary for all employees. The output is shown here.

Using a Summary Function with Multiple Arguments (Columns)

Consider a PROC SQL query that contains a summary function with multiple columns as arguments. This query calculates the total number of passengers for each flight in March by adding the number of boarded, transferred, and nonrevenue passengers:

 proc sql outobs=10;
    select sum(boarded,transferred,nonrevenue)
           as Total
       from sasuser.marchflights;

The SELECT clause contains the summary function SUM with three columns as arguments. Because the function contains multiple arguments, the statistic is calculated across the three columns for each row to produce the following output.

Note: Without the OUTOBS= option, all rows in the table would be displayed in the output.

Consider how a PROC SQL query with a summary function is affected by including a GROUP BY clause and including columns outside of a summary function.

Using a Summary Function without a GROUP BY Clause

Once again, here is the PROC SQL query that displays the average salary of all employees listed in the table Sasuser.Payrollmaster. This query contains a summary function but, since the goal is to display the average across all employees, there is no GROUP BY clause.

 proc sql outobs=20;
    select avg(salary) as AvgSalary
       from sasuser.payrollmaster;

Note that the SELECT clause lists only one column: a new column that is defined by a summary function calculation. There are no columns listed outside of the summary function.

Here is the query output.

Using a Summary Function with Columns outside of the Function

Suppose you calculate an average for each job group and group the results by job code. Your first step is to add an existing column (JobCode) to the SELECT clause list. The modified query is shown here:

proc sql outobs=20;
   select jobcode, avg(salary) as AvgSalary
      from sasuser.payrollmaster;

Consider what the query output looks like now that the SELECT statement contains a column (JobCode) that is not a summary function argument.

Note: Remember that this PROC SQL query uses the OUTOBS= option to limit the output to 20 rows. Without this limitation, the output of this query would display all 148 rows in the table.

As this result shows, adding a column to the SELECT clause that is not within a summary function causes PROC SQL to output all rows instead of a single value. To generate this output, PROC SQL

calculated the average salary down the column as a single value (54079.62)
displayed all rows in the output, because JobCode is not specified in a summary function.

Therefore, the single value for AvgSalary is repeated for each row.

Note: When this query is submitted, the SAS log displays a message indicating that data remerging has occurred. Data remerging is explained later in this chapter.

This result is interesting, but you have not yet reached your goal: grouping the data by JobCode. The next step is to add the GROUP BY clause.

Using a Summary Function with a GROUP BY Clause

Below is the PROC SQL query from the previous page, to which has been added a GROUP BY clause that specifies the column JobCode. (In the SELECT clause, JobCode is specified but is not used as a summary function argument.) Other changes to the query include removing the OUTOBS= option (it is unnecessary) and specifying a format for the AvgSalary column.

 proc sql;
    select jobcode, 
           avg(salary) as AvgSalary format=dollar11.2
       from sasuser.payrollmaster
       group by jobcode;

Consider how the addition of the GROUP BY clause affects the output.

Success! The summary function has been calculated for each JobCode group, and the results are grouped by JobCode.

Counting Values By Using the COUNT Summary Function

Sometimes you want to count the number of rows in an entire table or in groups of rows. In PROC SQL, you can use the COUNT summary function to count the number of rows that have nonmissing values. There are three main ways to use the COUNT function.

Using this form of COUNT...	Returns...	Example
COUNT(*)	the total number of rows in a group or in a table	select count(*) as Count
COUNT(column)	the total number of rows in a group or in a table for which there is a nonmissing value in the selected column	select count(jobcode) as Count
COUNT(DISTINCT column)	the total number of unique values in a column	select count(distinct jobcode) as Count

CAUTION:

The COUNT summary function counts only the nonmissing values; missing values are ignored. Many other summary functions also ignore missing values. For example, the AVG function returns the average of the nonmissing values only. When you use a summary function with data that contains missing values, the results might not provide the information that you expect. It is a good idea to familiarize yourself with the data before you use summary functions in queries.

Tip

To count the number of missing values, use the NMISS function. For more information about the NMISS function, see the SAS documentation.

Consider the three ways of using the COUNT function.

Counting All Rows

Suppose you want to know how many employees are listed in the table Sasuser. Payrollmaster. This table contains a separate row for each employee, so counting the number of rows in the table gives you the number of employees. The following PROC SQL query accomplishes this task:

 proc sql;
    select count(*) as Count
       from sasuser.payrollmaster;

Note: The COUNT summary function is the only function that enables you to use an asterisk (*) as an argument.

You can also use COUNT(*) to count rows within groups of data. To do this, you specify the groups in the GROUP BY clause. Consider a more complex PROC SQL query that uses COUNT(*) with grouping. This time, the goal is to find the total number of employees within each job category, using the same table that is used above.

proc sql;
   select substr(jobcode,1,2)
          label='Job Category',
          count(*) as Count
      from sasuser.payrollmaster
      group by 1;

This query defines two new columns in the SELECT clause. The first column that is labeled JobCategory, is created by using the SAS function SUBSTR to extract the two-character job category from the existing JobCode field. The second column, Count, is created by using the COUNT function. The GROUP BY clause specifies that the results are to be grouped by the first defined column (referenced by 1 because the column was not assigned a name).

CAUTION:

When a column contains missing values, PROC SQL treats the missing values as a single group. This can sometimes produce unexpected results.

Counting All Non-Missing Values in a Column

Suppose you want to count all of the nonmissing values in a specific column instead of in the entire table. To do this, you specify the name of the column as an argument of the COUNT function. For example, the following PROC SQL query counts all nonmissing values in the column JobCode:

 proc sql;
    select count(JobCode) as Count
       from sasuser.payrollmaster;

Because the table has no missing data, you get the same output with this query as you would by using COUNT(*). JobCode has a nonmissing value for each row in the table. However, if the JobCode column contained missing values, this query would produce a lower value of Count than the previous query. For example, if JobCode contained three missing values, the value of Count would be 145.

Counting All Unique Values in a Column

To count all unique values in a column, add the keyword DISTINCT before the name of the column that is used as an argument. For example, here is the previous query modified to count only the unique values:

proc sql;
   select count(distinct jobcode) as Count
      from sasuser.payrollmaster;

This query counts 16 unique values for JobCode.

To display the unique JobCode values, you can apply the method of eliminating duplicates, which was discussed earlier. The following query lists only the unique values for JobCode.

proc sql;
   select distinct jobcode
      from sasuser.payrollmaster;

There are 16 job codes, so the output contains 16 rows.

Selecting Groups By Using the HAVING Clause

You have seen how to use the GROUP BY clause to group data. For example, the following query calculates the average salary within each job-code group, and displays the average for each job code:

 proc sql;
    select jobcode, 
           avg(salary) as AvgSalary
           format=dollar11.2
       from sasuser.payrollmaster
       group by jobcode;

There are 16 job codes in the table, so the output displays 16 rows.

Now, suppose you want to select only a subset of groups for your query output. You can use a HAVING clause, following a GROUP BY clause, to select (or filter) the groups to be displayed. The way a HAVING clause affects groups is similar to how a WHERE clause affects individual rows. As in a WHERE clause, the HAVING clause contains an expression that is used to subset the data. Any valid SAS expression can be used. When you use a HAVING clause, PROC SQL displays only the groups that satisfy the HAVING expression.

Note: You can use summary functions in a HAVING clause but not in a WHERE clause, because a HAVING clause is used with groups, but a WHERE clause can be used only with individual rows.

Modify the query shown above so that it selects only the JobCode groups with an average salary of more than $56,000. The HAVING clause has been added at the end of the query.

 proc sql;
    select jobcode, 
           avg(salary) as AvgSalary
           format=dollar11.2
       from sasuser.payrollmaster
       group by jobcode
       having avg(salary) > 56000;

Tip

Alternatively, because the average salary is already calculated in the SELECT clause, the HAVING clause could specify the column alias AvgSalary:

having AvgSalary > 56000

Note that you do not have to specify the keyword CALCULATED in a HAVING clause; you would have to specify it in a WHERE clause.

The query output is shown below. This output is smaller than the previous output, because only a subset of the job-code groups is displayed.

Average salaries over $56,000 for each JobCode

If you omit the GROUP BY clause in a query that contains a HAVING clause, then the HAVING clause and summary functions (if any are specified) treat the entire table as one group. Without a GROUP BY clause, the HAVING clause in the example shown above calculates the average salary for the table as a whole (all jobs in the company), not for each group (each job code). The output contains either all the rows in the table (if the average salary for the entire table is greater than $56,000) or none of the rows in the table (if the average salary for the entire table is less than $56,000).

Understanding Data Remerging

Sometimes, when you use a summary function in a SELECT clause or a HAVING clause, PROC SQL must remerge data (make two passes through the table). Remerging requires additional processing time and is often unavoidable. However, there are some situations in which you might be able to modify your query to avoid remerging. Understanding how and when remerging occurs increases your ability to write efficient queries.

Consider a PROC SQL query that requires remerging. This query calculates each navigator's salary as a percentage of all navigators' salaries:

proc sql;
   select empid, salary,
          (salary/sum(salary)) as Percent
          format=percent8.2
      from sasuser.payrollmaster
      where jobcode contains 'NA';

When you submit this query, the SAS log displays the following message.

Table 2.4 SAS Log
`NOTE: The query requires remerging summary statistics back with the original data.`

Remerging occurs whenever any of the following conditions exist:

The values returned by a summary function are used in a calculation.
The SELECT clause specifies a column that contains a summary function and other column(s) that are not listed in a GROUP BY clause.
The HAVING clause specifies one or more columns or column expressions that are not included in a subquery or a GROUP BY clause.

During remerging, PROC SQL makes two passes through the table:

PROC SQL calculates and returns the value of summary functions. PROC SQL also groups data according to the GROUP BY clause.
PROC SQL retrieves any additional columns and rows that it needs to display in the output, and uses the result from the summary function to calculate any arithmetic expressions in which the summary function participates.

Example

Consider how PROC SQL remerges data when it processes the following query:

proc sql;
   select empid, salary,
          (salary/sum(salary)) as Percent
          format=percent8.2
      from sasuser.payrollmaster
      where jobcode contains 'NA';

In the first pass, for each row in which the jobcode contains 'NA', PROC SQL calculates and returns the value of the SUM function (specified in the SELECT clause).

In the second pass, PROC SQL retrieves the additional columns and rows that it needs to display in output (EmpID, Salary) and the rows in which JobCode contains 'NA'. PROC SQL also uses the result from the SUM function to calculate the arithmetic expression (salary/sum(salary)).

CAUTION:

Some implementations of SQL do not support remerging and would consider the preceding example to be in error.

Tip

You can obtain the same results by using a subquery. Subqueries are discussed later in this chapter.

Table of Contents for Summarizing and Grouping Data

Create new playlist

Sign In

Sign Up

Summarizing and Grouping Data

Overview

Number of Arguments and Summary Function Processing

Groups and Summary Function Processing

SELECT Clause Columns and Summary Function Processing

Using a Summary Function with a Single Argument (Column)

Using a Summary Function with Multiple Arguments (Columns)

Using a Summary Function without a GROUP BY Clause

Using a Summary Function with Columns outside of the Function

Using a Summary Function with a GROUP BY Clause

Counting Values By Using the COUNT Summary Function

Counting All Rows

Counting All Non-Missing Values in a Column

Counting All Unique Values in a Column

Selecting Groups By Using the HAVING Clause

Understanding Data Remerging

Example

Table of Contents for
Summarizing and Grouping Data