Example 8.12 Collapsing Observations within a BY Group into a Single Observation

Goal

Rearrange a data set by changing selected variables in a group of observations to a group of variables in one observation. Reshape data by collapsing observations within a BY group into a single observation in order to simplify data analysis and report generation.

The example uses the same input data set as Example 8.9.

Example Features

Featured StepDATA step
Featured Step Options and StatementsBY statement

FIRST.variable and LAST.variable temporary variables

OUTPUT statement
Related TechniquePROC TRANSPOSE, PREFIX= option, BY, ID, RETAIN, and VAR statements DATA step, match-merging by value and merging a data set with itself

Input Data Set

Data set WEIGHT_BMI contains weight, BMI, and session information for four study participants over a four-week period. Only ID S003 had data recorded for all four weeks.

                    WEIGHT_BMI

 Obs     id     week    weight     bmi    session
   1    S001      1       231     29.7      AM1
   2    S001      2       223     28.6      AM2
   3    S002      1       187     28.4      AM1
   4    S002      3       176     26.8      AM2
   5    S003      1       154     27.3      PM1
   6    S003      2       151     26.7      AM1
   7    S003      3       148     26.2      PM1
   8    S003      4       142     25.2      PM1
   9    S004      1       134     25.3      PM3
  10    S004      2       133     25.1      PM3
  11    S004      4       129     24.4      PM3

Resulting Data Set

Output 8.12a WTRESULTS Data Set

                Example 8.12 WTRESULTS Data Set Created with PROC TRANSPOSE and DATA Step

                                            bmi_  bmi_  bmi_  bmi_  session_ session_ session_  session_
Obs  id wt_week1 wt_week2 wt_week3 wt_week4 week1 week2 week3 week4   week1    week2    week3       week4

 1  S001   231      223        .        .    29.7  28.6    .     .    AM1       AM2
 2  S002   187        .      176        .    28.4    .   26.8    .    AM1                AM2
 3  S003   154      151      148      142    27.3  26.7  26.2  25.2   PM1       AM1      PM1         PM1
 4  S004   134      133        .      129    25.3  25.1    .   24.4   PM3       PM3                  PM3


Example Features

This example takes a data set with multiple observations per BY group and collapses each BY group's observations into one observation. The DATA step changes the shape of the input data from "long and narrow" to "short and wide." The output data set will have fewer observations and more variables.

The DATA step copies the values for several variables to arrays of new variables where each element of the array is a specific variable's value in one observation in the BY group.

The DATA step uses FIRST. and LAST. temporary variables to determine when it is processing the first observation and last observation in a BY group. When processing the first observation in a BY group, the DATA step initializes to missing all the elements in the arrays of new variables. A RETAIN statement retains the values of these new variables across observations. The DATA step outputs an observation to the data set only when processing the last observation in the BY group. When processing reaches the last observation in a BY group, all rearrangement of variables within a BY group is completed.

Data set WEIGHT_BMI contains up to four observations for each of four study participants over a four-week period. Each observation records weight, BMI, and session attendance for a subject for one week.

The goal of the DATA step is to rearrange the observations in WEIGHT_BMI so that there is only one observation per study participant. The values for weight, BMI, and session attendance are defined in data set WEIGHT_BMI as the three variables WEIGHT, BMI, and SESSION. The values of these three variables are to be moved to new variables. Variable WEEK tracks the week in the period for which the measurements were obtained for the study participant.

The DATA step defines an array for each of the three variables, WEIGHT, BMI, and SESSION. Their input data set values will be moved to new output data set variables that are defined by these arrays. Because it is known that measurements are recorded a maximum of four times per participant, the ARRAY statements specify four variables with each element's position identifying the week for which a measurement was obtained. The DATA step points to the correct array element by using variable WEEK as the array index value.

Because it is necessary to process WEIGHT_BMI in BY groups that are defined by the values of ID, data set WEIGHT_BMI is sorted by ID prior to the DATA step.

Program

Sort the observations in WEIGHT_BMI by the variable that defines the BY groups. Create data set WTRESULTS. Read the observations in WEIGHT_BMI. Process WEIGHT_BMI in BY groups. Define three arrays to hold the weekly measurements for each participant. Define the elements as serial variables from 1 to 4. Drop WEIGHT, BMI, and SESSION because their values are moved to the elements of the three arrays that were defined earlier. Drop WEEK because the suffix on the new variables identifies the week. Retain the values of the new variables across observations. When processing the first observation in a BY group, initialize all the new variables to missing.

When the value of week is inclusively between the expected values of 1 and 4, copy the current values of WEIGHT, BMI, and SESSION to the array elements that are pointed to by the value of WEEK.

When the value of WEEK is outside its expected range, write an error message. Output an observation only when processing the last observation in a BY group, which is after all rearrangement of the values of input variables WEIGHT, BMI, and SESSION.

proc sort data=weight_bmi;
  by id;
run;

data wtresults;
  set weight_bmi;

  by id;
  array wts{*} wt_week1-wt_week4;
  array bmis{*} bmi_week1-bmi_week4;
  array sessions{*} $ 3 session_week1-session_week4;

  drop i week weight bmi session;




  retain wt_week1-wt_week4 bmi_week1-bmi_week4
         session_week1-session_week4;
  if first.id then do;
    do i=1 to 4;
      wts{i}=.;
      bmis{i}=.;
      sessions{i}=' ';
    end;
  end;
  if 1 le week le 4 then do;
    wts{week}=weight;
    bmis{week}=bmi;
    sessions{week}=session;
  end;

else putlog "ERROR: Week value " week
            "out of range from 1 to 4 for id=" id;
if last.id then output;



run;

Related Technique

The following program uses PROC TRANSPOSE and the DATA step to produce a data set equivalent to the one that was produced by the DATA step in the main example. It uses one PROC TRANSPOSE step to transpose the two numeric variables, WEIGHT and BMI, and a second PROC TRANSPOSE step to transpose the character variable, SESSION.

If you transpose both character and numeric variables in the same PROC TRANSPOSE step, SAS converts the numeric variable values to character values. By submitting one step for the numeric variables and a second for the character variable, the transposed values of WEIGHT and BMI remain numeric.

PROC TRANSPOSE uses variable WEEK to determine how to name the transposed variables. However, if you had missing data for all observations for one week for one of the variables, a variable would not be created for that missing week. You should understand the limitations of your data and possibly specify ATTRIB, LENGTH, or ARRAY statements in the final DATA step to make sure all elements are defined.

For more information about creating variables with PROC TRANSPOSE, see "Naming the Variables in the Transposed Data Set" in Example 3.12 in the "A Closer Look" section.

The first PROC TRANSPOSE step saves two observations per value of the BY variable ID. One observation contains the transposed values of WEIGHT and the other observation contains the transposed values of BMI. These values are saved in the same four variables. To distinguish between the two observations for each ID, the variable _NAME_ that was created by PROC TRANSPOSE saves the name of the variable that created the transposed variables.

A DATA step match-merges the two data sets that PROC TRANSPOSE produced by ID. The MERGE statement lists data set TRANSWTBMI twice and data set TRANSSESSION once. The two occurrences of TRANSWTBMI in the MERGE statement cause the data set to be merged with itself. Its first placement in the MERGE statement applies a WHERE statement to select observations where the value of the _NAME_ variable is "WEIGHT". Its second placement applies a WHERE statement to select observations where the value of the _NAME_ variable is "BMI".

When you use PROC TRANSPOSE as in this example, you do not need to know how many occurrences of observations there are per BY group. This is an advantage over using the DATA step in the main example. However, when running the DATA step that match-merges the PROC TRANSPOSE data sets, you do need to know the maximum number of measurements made because it is necessary to rename the numeric transposed variables. A way around the renaming requirement is to execute PROC TRANSPOSE once for each variable, ensuring that the PREFIX= values are unique.

Output 8.12b shows intermediate data set TRANSWTBMI. Note the two observations per value of ID, each with a different value for _NAME_.

Output 8.12b TRANSWTBMI Data Set

                           TRANSWTBMI

       Obs     id     _NAME_    week1    week2    week3     week4

        1     S001    weight    231.0    223.0       .         .
        2     S001    bmi        29.7     28.6       .         .
        3     S002    weight    187.0       .     176.0        .
        4     S002    bmi        28.4       .      26.8        .
        5     S003    weight    154.0    151.0    148.0     142.0
        6     S003    bmi        27.3     26.7     26.2      25.2
        7     S004    weight    134.0    133.0       .      129.0
        8     S004    bmi        25.3     25.1       .       24.4


Output 8.12c shows intermediate data set TRANSSESSION.

Output 8.12c TRANSSESSION Data Set

                             TRANSSESSION

                    session_    session_    session_    session_
     Obs     id      week1       week2       week3       week4
      1     S001      AM1         AM2
      2     S002      AM1                     AM2
      3     S003      PM1         AM1         PM1         PM1
      4     S004      PM3         PM3                     PM3


Transpose data set WEIGHT_BMI. Save the transposed observations in TRANSWTBMI. Specify a prefix to use in constructing names for transposed variables in the output data set. Create one observation per BY value for each variable that is transposed. Transpose two variables. Specify that the values of WEEK name the transposed variables in TRANSWTBMI.

Transpose the character variable. Save the transposed data in TRANSSESSION. Specify a different PREFIX= value than in the previous PROC TRANSPOSE step.

Create data set WTRESULTS. Merge data set TRANSWTBMI with itself and with TRANSSESSION. On the first placement of TRANSWTBMI, select observations where the uppercased value of _NAME_ is "WEIGHT" and rename the four WEEK variables to reflect the source of the four values.

Select observations where the uppercased value of _NAME_ is "BMI" and rename the four WEEK variables to reflect the source of the four values.

Match-merge the three data sets.

Drop variable _NAME_ since it is no longer needed to identify the source of any of the variable values.

proc transpose data=weight_bmi out=transwtbmi


               prefix=week;

  by id;

  var weight bmi;
  id week;

run;
proc transpose data=weight_bmi out=transsession
               prefix=session_week;
  by id;
  var session;
  id week;
run;
data wtresults;
  merge transwtbmi(where=(upcase(_name_)='WEIGHT')
                   rename=(week1=wt_week1 week2=wt_week2
                              week3=wt_week3
                              week4=wt_week4))
        transwtbmi(where=(upcase(_name_)='BMI')
                    rename=(week1=bmi_week1
                              week2=bmi_week2
                              week3=bmi_week3
                              week4=bmi_week4))

        transsession;

    by id;

  drop _name_;


run;

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.57.223