Example 5.3 Computing Summary Data and Combining Them with Detail Data

Goal

Summarize several variables in a data set. Add these summarizations to each observation in a data set. Compute new variables by combining summary and detail variables.

Example Features

Featured StepDATA step
Featured Step Options and StatementsMultiple SET statements SET statement with END= option IF _N_=1 statement DO UNTIL loop
Related TechniquePROC SQL, remerging summary statistics
A Closer LookSelecting a Method to Combine Summary and Detail Data

Input Data Set

Data set SESSIONS contains the number of attendees from four groups at each of three training sessions.

                      SESSIONS Data Set

  Obs    session    group_a    group_b    group_c    group_d
   1     March          8         29         29         34
   2     June          19          3         27          8
   3     August         9         32          2         15

Resulting Data Set

Output 5.3 ATTENDANCE_ FIGURES Data Set

                    Example 5.3 ATTENDANCE_FIGURES Data Set Created with DATA Step

                                                                                 t      p
                                                                                 o      c
                                                                                 t      t
                                                                                 _      _
       s       g    g    g    g    t    t    t    t                        o     s      s
       e       r    r    r    r    o    o    o    o                        v     e      e
       s       o    o    o    o    t    t    t    t    p    p    p    p    e     s      s
       s       u    u    u    u    a    a    a    a    c    c    c    c    r     s      s
 O     i       p    p    p    p    l    l    l    l    t    t    t    t    a     i      i
 b     o       _    _    _    _    _    _    _    _    _    _    _    _    l     o      o
 s     n       a    b    c    d    a    b    c    d    a    b    c    d    l     n      n

 1   March     8   29   29   34   36   64   58   57   22   45   50   60   215   100    47
 2   June     19    3   27    8   36   64   58   57   53    5   47   14   215    57    27
 3   August    9   32    2   15   36   64   58   57   25   50    3   26   215    58    27


Example Overview

Similar to Example 5.1, this program adds summary data to each observation in a data set. It computes summary data and adds them to each observation in a data set in the same DATA step. It does not start with a PROC MEANS step as in Example 5.1. Instead it reads the data set twice, first to calculate the summary statistics and second to add the summary data to each observation in the data set.

The DATA step accesses the same data set twice with two SET statements. A DO UNTIL loop at the top of the DATA step executes once on the first iteration of the DATA step. It reads every observation in the data set that is specified by the SET statement in the loop and computes summary statistics in this loop. The summary statistics that are computed in the DO UNTIL loop remain in the Program Data Vector (PDV) throughout execution of the DATA step.

The second SET statement reads the data set again. This time the DATA step computes the percentage statistics by using the observation's values and the summary values that are retained in the PDV throughout execution of the DATA step.

Data set SESSIONS tracks the number of attendees from four groups (A, B, C, and D) at a training session held in March, June, and August. The goal is to calculate attendance totals by group and monthly attendance percentages by group. The output variables TOTAL_A, TOTAL_B, TOTAL_C, and TOTAL_D record the total number of attendees in each group over the three months. These variables are used as the denominator for the calculation of the percentage variables, PCT_A, PCT_B, PCT_C, and PCT_D. The PCT variables are the percentage that a month's group session contributed to the total attendance of the group.

The statistics that are computed by the DATA step are relatively simple to code. If your data are more complicated or if you have more variables to analyze, you might want to use a procedure to compute the statistics for you. The "A Closer Look" section at the end of this example discusses the selection of methods to combine summary and detail data sets.

Program

Create data set ATTENDANCE_FIGURES. Specify variable attributes and arrays for the DATA step. Define the TOTALS array, which will contain the total number of participants in each group. Define the PCTS array, which will contain the percentage of total attendees for a group at a session.

Process a DO UNTIL loop only on the first iteration of the DATA step. Set the DO UNTIL condition to be the value of variable EOF, which is defined by the END= option in the following SET statement. Read data set SESSIONS during the first iteration. Define EOF so that it can be determined when SAS reaches the end of SESSIONS. During the first iteration of the DATA step, tally in array TOTALS the total number of attendees in each group and tally in variable OVERALL the total number of attendees in all groups.

Read data set SESSIONS again. Initialize variable TOT_SESSION, which will contain the total number of attendees from all groups at one session. Compute the percentage of total attendees by group at a session. Use as the denominator the totals for the groups that were computed in the DO UNTIL loop on the first iteration of the DATA step.

Tally the number of attendees from all groups at one session.

Compute the percentage of total attendees for all groups at a session.

data attendance_figures;

  attrib session length=$8 label='Session Attended';

  array groups{*} group_a group_b group_c group_d;
  array totals{*} total_a total_b total_c total_d;
  array pcts{*} pct_a pct_b pct_c pct_d;


  drop i;
  if _n_=1 then do until(eof);




    set sessions end=eof;


    do i=1 to dim(totals);
      totals{i}+groups{i};
      overall+groups{i};
    end;

  end;
  set sessions;
  tot_session=0;


  do i=1 to dim(pcts);
    pcts{i}=round(100*groups{i}/totals{i},1);

    tot_session+groups{i};

  end;
  pct_session=round(100*tot_session/overall,1);

run;

Related Technique

The following PROC SQL step creates a table equivalent to the data set that was created by the DATA step in the main example. The SELECT query combines the detail information from table SESSIONS with the summary statistics it produces by using the SUM aggregate function. This process is called remerging, and when a query remerges data, PROC SQL displays a note in the SAS log that this has occurred.

Create table ATTENDANCE_FIGURES. Select all columns from SESSIONS. Compute the percentage of total attendees for a group at each session. Use the SUM aggregate function to obtain the total number of attendees in each group. Specify aliases for the four new columns. Find the total number of attendees from all groups at one session. Specify an alias for the calculated column. Use the SUM aggregate function to find the total number of attendees from all groups at all sessions. Specify an alias for the calculated column. Compute the percentage of total attendees for all groups at each session. Place the CALCULATED keyword before each of the two column alias references, which were computed earlier in the SELECT query.

proc sql;
  create table attendance_figures as

    select *,
     round(group_a/sum(group_a)*100,1) as pct_a,
     round(group_b/sum(group_b)*100,1) as pct_b,
     round(group_c/sum(group_c)*100,1) as pct_c,
     round(group_d/sum(group_d)*100,1) as pct_d,


     group_a+group_b+group_c+group_d as tot_session,

     sum(group_a)+sum(group_b)+sum(group_c)+
                     sum(group_d) as overall,

     round( calculated tot_session/
               calculated overall*100,1) as pct_session



  from sessions;
quit;

A Closer Look

Selecting a Method to Combine Summary and Detail Data

Having a general knowledge of the multiple methods for combining summary and detail data will help you select the method to combine your summary and detail data. In making this determination, you need to consider SAS coding requirements and have an understanding of your input data sets and the kind of output you need to produce.

The preceding example processed data set SESSIONS twice within a DATA step. It computed percentages for different cuts of the data and saved them in output data set ATTENDANCE_FIGURES. The DATA step included a simple tally of observations in the four different groups. This tally could have been computed instead by a PROC MEANS step similar to Example 5.1. The output data set could then have been combined with data set SESSIONS. Input data set SESSIONS is small, so either way the processing proceeds quickly. When your input data sets are large and your combinations are complex, you might want to select the method that reads your data set the fewest times or is the easiest to code.

The following program revises the main example to include the PROC MEANS step that was described in the previous paragraph.

proc means data=sessions;
  var group_a group_b group_c group_d;
  output out=groupsums sum=total_a total_b total_c total_d;
run;
data attendance_figures;
  attrib session length=$8 label='Session Attended';

  if _n_=1 then do;
    set groupsums(keep=total_a total_b total_c total_d);
    overall+total_a+total_b+total_c+total_d;
  end;

  array groups{*} group_a group_b group_c group_d;
  array totals{*} total_a total_b total_c total_d;
  array pcts{*} pct_a pct_b pct_c pct_d;

  drop i;

  set sessions;
  tot_session=0;
  do i=1 to dim(pcts);
    pcts{i}=round(100*groups{i}/totals{i},1);
    tot_session+groups{i};
  end;
  pct_session=round(100*tot_session/overall,1);
run;

The goal of this example and the other in this chapter is to produce an output data set. If your goal is to produce a report instead, most likely you will want to choose a procedure such as TABULATE or REPORT to compute and present the percentages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.8.91