Comparing Tools for Summarizing Data

Overview

SAS provides a variety of tools for summarizing data. These summarization tools generate similar but not identical output, and they vary in efficiency.
Note: Throughout this section, all references to the MEANS procedure apply also to the SUMMARY procedure.
Tool
Description
MEANS procedure or SUMMARY procedure
  • computes descriptive statistics for numeric variables
  • can produce a printed report and create an output data set
TABULATE procedure
  • produces descriptive statistics in a tabular format
  • can produce 1-, 2-, or 3-dimensional tables with descriptive statistics
  • can also create an output data set
REPORT procedure
  • combines features of the PRINT, MEANS, and TABULATE procedures with features of the DATA step in a single report-writing tool that can produce a variety of reports
  • can also create an output data set
SQL procedure
  • computes descriptive statistics for one or more SAS data sets or DBMS tables
  • can produce a printed report or create a SAS data set
DATA step
  • can produce a printed report
  • can also create an output data set
Note: You can also use the FREQ and UNIVARIATE procedures to generate summary data reports, but these procedures are not covered in this chapter. For more information about any of these summarization tools, see the SAS documentation for PROC FREQ and PROC UNIVARIATE in the Base SAS Procedures Guide.
Any of these tools can summarize the entire data set or by any combination of one or more class variables.
To group data, PROC MEANS, PROC SUMMARY, and PROC TABULATE use a CLASS statement. PROC SQL uses a GROUP BY clause. Every tool except PROC SQL accepts a BY statement.

Comparing Resource Usage across Summarization Tools

When summarizing data for one or more class variables, the tools in each of the following groups are similar in resource usage:
  • PROC MEANS (or PROC SUMMARY), PROC REPORT, and PROC TABULATE
  • PROC SQL and the DATA step with PROC SORT
However, the relative efficiency of the two groups of tools can vary based on the number of values of the CLASS variables. You need to test the techniques with your data.

Comparative Example: Displaying Summary Statistics for One Class Variable

Overview

Suppose you want to summarize the data set Retail.Orders by calculating the average quantity of products sold for each value of the class variable Order_Type. You can use several techniques to produce the summary report.
The following programs compare five techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site.

Programming Techniques

1 PROC MEANS
This PROC MEANS step creates a report that displays the mean of the analysis variable Quantity for each value of the class variable Order_Type.
proc means data=retail.orders 
     (keep=order_type quantity)
     mean maxdec=2;
   class order_type;
   var quantity;
run;
2 PROC REPORT
This PROC REPORT step creates a report that displays the mean of the analysis variable Quantity for each value of the class variable Order_Type.
proc report data=retail.orders
     (keep=order_type quantity);
   column order_type quantity;
   define order_type / group width=13
          'Order Type';
   define quantity / mean format=5.2
          'Average Quantity'
          width=8;
run; 
3 PROC SORT and a DATA Step
This program uses a PROC SORT step and a DATA step to create a report. The PROC SORT step sorts the data by the values of the variable Order_Type, keeps only the necessary variables, and generates the temporary output data set Orders. The DATA step calculates the mean of the analysis variable Quantity for each value of the BY variable Order_Type and displays these values in a report.
proc sort data=retail.orders 
     (keep=order_type quantity)
     out=orders;
   by order_type;
run;

data _null_;
   set orders;
   by order_type;
   format average_order 5.2;
   if first.order_type then do;
      num=0;
      sum=0;
   end;
   num+1;
   sum+quantity;
   if last.order_type then do;
      average_order=sum / num;
      file print;
      put @5 'Order Type' @20 'Average Order';
      put;
      put @13 Order_type 1. @27 Average_order 5.2;
   end;
run;
4 PROC SQL
This PROC SQL step creates a report that displays the mean of the analysis variable Quantity for each value of the group variable Order_Type.
proc sql;
   select order_type,
     avg(quantity) label='Average Order'
                   format=5.2
     from retail.orders
     group by order_type;
quit; 
5 PROC TABULATE
This PROC TABULATE step creates a report that displays the mean of the analysis variable Quantity for each value of the class variable Order_Type.
proc tabulate data=retail.orders
     (keep=order_type quantity)
     format=comma8.2;
   class order_type;
   var quantity;
   table order_type, quantity*mean;
run; 

General Recommendations

  • When summarizing data for one class variable, test your data to determine which summarization tools are most efficient.

Using PROC MEANS to Display Summary Statistics for Combinations of Class Variables

To produce summary statistics for combinations of class variables, you can use PROC MEANS in the following ways. These techniques differ in resource usage.
Combinations of Class Variables
Technique
Example
all possible combinations:
a
b
c
a * b
a * c
b * c
a * b * c
basic PROC MEANS step
proc means data=lib.dataset mean;
   class a b c;
   var salary;
   output out=summary1 
          mean=average;
run;
specific combinations:
a * b and a * c
TYPES statement in PROC MEANS
proc means data=lib.dataset mean;
   class a b c;
   var salary;
   types a*b a*c;
   output out=summary2 
          mean=average;
run;
specific combinations:
a * b and a * c
NWAY option in multiple PROC MEANS steps
proc means data=lib.dataset nway;
   class a b;
   var salary;
   output out=summary3a 
          mean=average;
run;

proc means data=lib.dataset nway;
   class a c;
   var salary;
   output out=summary3b
          mean=average;
run;
specific combinations:
a * b and a * c
WHERE= option in the OUTPUT statement in PROC MEANS
proc means data=lib.dataset;
   class a b c;
   var salary;
   output out=summary4
          (where=(_type_ in (5,3)))
          n=employees
          mean=average;
run;

Comparing Resource Usage across Three Techniques for Using PROC MEANS

The three techniques for summarizing data for specific combinations of class variables (all but the basic PROC MEANS step) differ in resource usage as follows:
  • The TYPES statement in a PROC MEANS step uses the fewest resources.
  • A program that contains the NWAY option in multiple PROC MEANS steps uses the most resources because SAS must read the data set separately for each PROC MEANS step. The NWAY option in a single PROC MEANS step is efficient.
  • The WHERE= data set option in a PROC MEANS step uses more resources than the TYPES statement in PROC MEANS because SAS must calculate all possible combinations of class variables before subsetting. However, the WHERE= data set option in PROC MEANS uses fewer resources than the NWAY option in multiple PROC MEANS steps.
We learn how to use a basic PROC MEANS step and the three other techniques that are listed above.

Using a Basic PROC MEANS Step to Combine All Class Variables

PROC MEANS (or PROC SUMMARY) creates the following:
  • An output report that groups data and displays summary statistics for the combination of all class variables. This is the default action.
  • If an OUTPUT statement appears, PROC MEANS creates an output data set with summary statistics for all possible combinations of the n class variables (from 1-way to n-way), as well as for the entire data set

Example: Displaying Summary Statistics for All Combinations of the Class Variables

Suppose you want to calculate average employee salaries and group results for the combination of the three class variables Employee_Country, Department, and Employee_Gender.
The following PROC MEANS program creates both a report data set and a SAS data set:
proc means data=company.organization_dim mean;
   class employee_country department 
         employee_gender;
   var salary;
   output out=summary mean=average;
run;
The report displays summary statistics for every combination of the three class variables. A partial report is shown below:
Analysis Variable: Salary
The output data set contains summary statistics for the following:
  • all possible combinations (1-way, 2-way, and 3-way) of the three class variables:
    • Employee_Gender
    • Department
    • Employee_Country
    • Department and Employee_Gender
    • Employee_Country and Employee_Gender
    • Employee_Country and Department
    • Employee_Country and Department and Employee_Gender
  • the entire data set
A partial view of the output data set is shown below:
partial output

Understanding Types

Each combination of class variables that is used to calculate and group statistics for PROC MEANS is called a type.
For example, the following basic PROC MEANS step specifies the three class variables a, b, and c:
proc means data=lib.dataset mean;
   class a b c;
   var salary;
   output out=summary1 
          mean=average;
run;
This PROC MEANS step generates seven possible types (combinations of the three variables):
Variable Combined
Dimension
a
1-way
b
1-way
c
1-way
b * c
2-way
a * b
2-way
a * c
2-way
a * b * c
3-way
The _TYPE_ variable has a unique value for each combination of class variables, based on their order in the CLASS statement. For example, for each of the seven types (seven possible combinations of three class variables) shown above, SAS assigns a value to _TYPE_ as follows:
_TYPE_ Value
Description of Combination
Variables Combined
Dimension
1
rightmost variable only
c
1-way
2
middle variable only
b
1-way
3
rightmost variable and middle variable
b
*
c
2-way
4
leftmost variable
a
1-way
5
leftmost variable and rightmost variable
a
*
c
2-way
6
leftmost variable and middle variable
a
*
b
2-way
7
rightmost variable and middle variable and leftmost variable
a
*
b
*
c
3-way
As the number of class variables increases, so does the number of types. However, the highest _TYPE_ (7, in this example) always indicates the combination of all class variables.
SAS includes the _TYPE_ variable in the output data set generated by PROC MEANS. Observations are generated in order of increasing values of the _TYPE_ variable:
observations listed in order of increasing values
The first observation in the output data set has a _TYPE_ value of 0, which indicates that the statistics are generated for the entire data set.
By default, the output data set generated by PROC MEANS contains a separate observation for each unique combination of class variable values within each type. Each unique combination of values within a type is called a level of that type. In the output data set linked above, there are 17 levels for type 2. Therefore, 17 observations have a _TYPE_ value of 2.
The report generated by PROC MEANS contains only the combinations of all class variables, _TYPE_=7. _TYPE_ is not displayed in the report.

Using the TYPES Statement in PROC MEANS to Combine Class Variables

The TYPES statement specifies the desired combinations of class variables. The CLASS statement is required with the TYPES statement.
General form, TYPES statement:
TYPES request(s);
Here is an explanation of the syntax:
request(s)
specifies the desired combination or combinations of class variables. A request includes one of the following:
  • one class variable name
  • several class variable names separated by asterisks
  • ( ) to request overall results (_TYPE_=0)
To request combinations of class variables more concisely, you can use a grouping syntax by placing parentheses around several variables and joining other variables or variable combinations. The following examples of TYPES statements illustrate the use of grouping syntax:
Example with Grouping Syntax
Equivalent Example without Grouping Syntax
types a*(b c);
types a*b a*c;
types (a b)*(c d);
types a*c a*d b*c b*d;
types (a b c)*d;
types a*d b*d c*d;
types () a*(b c);
types a*b*c a*b a*c;

Example: Using the TYPES Statement in PROC MEANS

Suppose you want to calculate average employee salaries, as in the previous example. This time, you want the two combinations of class variables shown below:
  • Employee_Country and Department
  • Employee_Country and Employee_Gender
To do this, you can add a TYPES statement to the PROC MEANS step:
proc means data=company.organization_dim mean;
   class employee_country department 
         employee_gender;
   var salary;
   types employee_country*department
         employee_country*employee_gender;
   output out=summary mean=average;
run;
This PROC MEANS step generates both a report data set and an output data set. The report, shown below, has a separate table for each of the two combinations specified in the TYPES statement:
Analysis Variable: Salary Annual Salary
The output data set includes only the combinations that are specified in the TYPES statement. A partial view of the output data set is shown below:
partial output

Using the NWAY Option in PROC MEANS to Combine Class Variables

Another way to specify a combination of class variables is to use the NWAY option in PROC MEANS:
General form, NWAY option in the PROC MEANS statement:
PROC MEANS NWAY;
Here is an explanation of the syntax:
NWAY
specifies that the output data set contains statistics for the combination of all specified class variables (only observations with the highest _TYPE_ value).
The NWAY option generates summary statistics for every combination of all class variables. Therefore, to generate statistics for different combinations of class variables, you can specify a separate PROC MEANS step with the NWAY option for each combination.

Example: Using the NWAY Option in Multiple PROC MEANS Steps

Suppose you want to calculate average employee salaries and to group results for the following combinations of class variables:
  • Employee_Country and Department
  • Employee_Country and Employee_Gender
You can use two PROC MEANS steps, each containing the NWAY option, as shown below. The first PROC MEANS step generates statistics for the first combination of class variables, and the second PROC MEANS step generates statistics for the second combination of class variables.
proc means data=company.organization_dim nway;
   class employee_country department;
   var salary;
   output out=summary1
          n=employees
          mean=average;
run;

proc means data=company.organization_dim nway;
   class employee_country  employee_gender;
   var salary;
   output out=summary2
          n=employees
          mean=average;
run;
When processing this program, SAS must read the data set once for each PROC MEANS step. This processing is not efficient.
This program generates two reports and two output data sets. The report, shown in part below, has a separate table for each PROC MEANS step:
Analysis variable: salary annual salary
partial output
A partial view of each output data set is shown below:
Figure 23.2 SAS Data Set Work.Summary1
data set Work.Summary1
Figure 23.3 SAS Data Set Work.Summary2
SAS data set Work.Summary2

Using the WHERE= Output Data Set Option in PROC MEANS to Select Desired Types

Yet another way to select desired types is to use the WHERE= output data set option in the OUTPUT statement:
General form, WHERE= output data set option in a basic OUTPUT statement:
OUTPUT <OUT=SAS-data-set> (WHERE=
(where-expression-1 <logical-operator where-expression-n>));
Here is an explanation of the syntax:
SAS-data-set
specifies the output data set as a 1-level or 2-level name.
where-expression
is an arithmetic or logical expression that consists of a sequence of operators, operands, and SAS functions. The expression must be enclosed in parentheses.
logical-operator
can be AND, AND NOT, OR, or OR NOT.
When you use the WHERE= output data set option in the OUTPUT statement, SAS must calculate all possible combinations of class variables. Subsetting does not occur until the results are written to the output data set.

Example: Using the WHERE= Output Data Set Option in PROC MEANS

Suppose you want to calculate average employee salaries and select results for two 2-way combinations of the three class variables Employee_Country, Department, and Employee_Gender. All possible combinations of these variables are listed below:
_TYPE_ Value
Variables Combined
Dimension
1
Employee_Gender
1-way
2
Department
1-way
3
Department * Employee_Gender
2-way
4
Employee_Country
1-way
5
Employee_Country * Employee_Gender
2-way
6
Employee_Country * Department
2-way
7
Employee_Country * Department * Employee_Gender
3-way
To specify the types by _TYPE_ value, you can use the WHERE= output data set option in the OUTPUT statement as shown below:
proc means data=company.organization_dim;
   class employee_country department 
         employee_gender;
   var salary;
   output out=summary 
          (where=(_type_ in (5,6)))
          n=employees
          mean=average;
run;
A partial view of the report is shown below. The PROC MEANS report represents the highest type, the NWAY combination, type 7, which was not requested.
partial output
A partial view of the output data set Work.Summary is shown below. The output data set includes types 5 and 6, as requested.
Work.Summary
Next, compare the resources used by these summarization techniques:
  • the TYPES statement in PROC MEANS
  • the NWAY option in multiple PROC MEANS steps
  • the WHERE= output data set option in PROC MEANS

Comparative Example: Displaying Summary Statistics for Combinations of Class Variables

Overview

Suppose you want to summarize the data set Retail.Organization by calculating average employee salaries for two 3-way combinations of four class variables:
  • Employee_Country, Department, and Employee_Gender
  • Department, Section, and Employee_Gender
You can use several techniques to produce a report and one or more output data sets.
The following programs compare three techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site.

Programming Techniques

1 TYPES Statement in PROC MEANS
This program calculates the average employee salary for two 3-way combinations of the class variables Employee_Country, Department, Employee_Gender, and Section. The TYPES statement requests the two combinations. The program generates a report data set and an output data set named Summary.
proc means data=retail.organization mean;
   class employee_country department
         employee_gender section;
   var salary;
   types employee_country*department*employee_gender
         department*section*employee_gender;
   output out=summary
          n=employees
          mean=average;
run;
2 NWAY Option in Two PROC MEANS Steps
Each of the two PROC MEANS steps in this program calculates the average employee salary for a combination of three of the four class variables Employee_Country, Department, Employee_Gender, and Section. In each step, the NWAY option specifies that all three variables that are specified in the CLASS statement should be combined. The program generates two reports and two output data sets named Summary1 and Summary2.
proc means data=retail.organization nway;
   class employee_country department
         employee_gender;
   var salary;
   output out=summary1
          n=employees
          mean=average;
run;
	 
proc means data=retail.organization nway;
   class department section
         employee_gender;
   var salary;
   output out=summary2
          n=employees
          mean=average;
run;
3 WHERE= Option in PROC MEANS
This program calculates the average employee salary for two 3-way combinations of the class variables Employee_Country, Department, Employee_Gender, and Section. The WHERE= data set option in the OUTPUT statement specifies the two combinations by their _TYPE_ values. The program generates a report and an output data set named Summary3.
proc means data=retail.organization;
   class employee_country department
         employee_gender section;
   var salary;
   output out=summary3 (where=(_type_ in (7,14)))
          n=employees
          mean=average;
run;

General Recommendations

  • To summarize data for particular combinations of class variables, use the TYPES statement in PROC MEANS.

Additional Features

The WAYS statement in PROC MEANS provides yet another way to display summary statistics for combinations of class variables. In the WAYS statement, you specify one or more integers that define the number of class variables to combine in order to form all the unique combinations of class variables.
For example, the following program uses the WAYS statement to create summary statistics for the following combinations of the three class variables Employee_Country, Department, and Employee_Gender:
  • each individual variable (all 1-way combinations)
  • all 2-way combinations (Employee_Country and Department, Employee_Country and Employee_Gender, and Employee_Gender and Department)
proc means data=company.organization mean;
   class employee_country department
         employee_gender;
   var salary;
   ways 1 2;
   output out=summary
          mean=average;
run;
The WAYS statement can be used instead of or in addition to the TYPES statement.
Note: For more information about the WAYS statement, see the SAS documentation.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.31.67