Rearrange a data set by changing selected variables in a group of observations to a group of variables in one observation. Reshape data by collapsing observations within a BY group into a single observation in order to simplify data analysis and report generation.
The example uses the same input data set as Example 8.9.
Featured Step | DATA step |
Featured Step Options and Statements | BY statement
FIRST.variable and LAST.variable temporary variables OUTPUT statement |
Related Technique | PROC TRANSPOSE, PREFIX= option, BY, ID, RETAIN, and VAR statements DATA step, match-merging by value and merging a data set with itself |
Data set WEIGHT_BMI contains weight, BMI, and session information for four study participants over a four-week period. Only ID S003 had data recorded for all four weeks.
WEIGHT_BMI Obs id week weight bmi session 1 S001 1 231 29.7 AM1 2 S001 2 223 28.6 AM2 3 S002 1 187 28.4 AM1 4 S002 3 176 26.8 AM2 5 S003 1 154 27.3 PM1 6 S003 2 151 26.7 AM1 7 S003 3 148 26.2 PM1 8 S003 4 142 25.2 PM1 9 S004 1 134 25.3 PM3 10 S004 2 133 25.1 PM3 11 S004 4 129 24.4 PM3
Output 8.12a WTRESULTS Data SetExample 8.12 WTRESULTS Data Set Created with PROC TRANSPOSE and DATA Step bmi_ bmi_ bmi_ bmi_ session_ session_ session_ session_ Obs id wt_week1 wt_week2 wt_week3 wt_week4 week1 week2 week3 week4 week1 week2 week3 week4 1 S001 231 223 . . 29.7 28.6 . . AM1 AM2 2 S002 187 . 176 . 28.4 . 26.8 . AM1 AM2 3 S003 154 151 148 142 27.3 26.7 26.2 25.2 PM1 AM1 PM1 PM1 4 S004 134 133 . 129 25.3 25.1 . 24.4 PM3 PM3 PM3 |
This example takes a data set with multiple observations per BY group and collapses each BY group's observations into one observation. The DATA step changes the shape of the input data from "long and narrow" to "short and wide." The output data set will have fewer observations and more variables.
The DATA step copies the values for several variables to arrays of new variables where each element of the array is a specific variable's value in one observation in the BY group.
The DATA step uses FIRST. and LAST. temporary variables to determine when it is processing the first observation and last observation in a BY group. When processing the first observation in a BY group, the DATA step initializes to missing all the elements in the arrays of new variables. A RETAIN statement retains the values of these new variables across observations. The DATA step outputs an observation to the data set only when processing the last observation in the BY group. When processing reaches the last observation in a BY group, all rearrangement of variables within a BY group is completed.
Data set WEIGHT_BMI contains up to four observations for each of four study participants over a four-week period. Each observation records weight, BMI, and session attendance for a subject for one week.
The goal of the DATA step is to rearrange the observations in WEIGHT_BMI so that there is only one observation per study participant. The values for weight, BMI, and session attendance are defined in data set WEIGHT_BMI as the three variables WEIGHT, BMI, and SESSION. The values of these three variables are to be moved to new variables. Variable WEEK tracks the week in the period for which the measurements were obtained for the study participant.
The DATA step defines an array for each of the three variables, WEIGHT, BMI, and SESSION. Their input data set values will be moved to new output data set variables that are defined by these arrays. Because it is known that measurements are recorded a maximum of four times per participant, the ARRAY statements specify four variables with each element's position identifying the week for which a measurement was obtained. The DATA step points to the correct array element by using variable WEEK as the array index value.
Because it is necessary to process WEIGHT_BMI in BY groups that are defined by the values of ID, data set WEIGHT_BMI is sorted by ID prior to the DATA step.
Sort the observations in WEIGHT_BMI by the variable that defines the BY groups. Create data set WTRESULTS. Read the observations in WEIGHT_BMI. Process WEIGHT_BMI in BY groups. Define three arrays to hold the weekly measurements for each participant. Define the elements as serial variables from 1 to 4. Drop WEIGHT, BMI, and SESSION because their values are moved to the elements of the three arrays that were defined earlier. Drop WEEK because the suffix on the new variables identifies the week. Retain the values of the new variables across observations. When processing the first observation in a BY group, initialize all the new variables to missing.
When the value of week is inclusively between the expected values of 1 and 4, copy the current values of WEIGHT, BMI, and SESSION to the array elements that are pointed to by the value of WEEK.
When the value of WEEK is outside its expected range, write an error message. Output an observation only when processing the last observation in a BY group, which is after all rearrangement of the values of input variables WEIGHT, BMI, and SESSION.
proc sort data=weight_bmi; by id; run; data wtresults; set weight_bmi; by id; array wts{*} wt_week1-wt_week4; array bmis{*} bmi_week1-bmi_week4; array sessions{*} $ 3 session_week1-session_week4; drop i week weight bmi session; retain wt_week1-wt_week4 bmi_week1-bmi_week4 session_week1-session_week4; if first.id then do; do i=1 to 4; wts{i}=.; bmis{i}=.; sessions{i}=' '; end; end; if 1 le week le 4 then do; wts{week}=weight; bmis{week}=bmi; sessions{week}=session; end;
else putlog "ERROR: Week value " week "out of range from 1 to 4 for id=" id; if last.id then output; run;
The following program uses PROC TRANSPOSE and the DATA step to produce a data set equivalent to the one that was produced by the DATA step in the main example. It uses one PROC TRANSPOSE step to transpose the two numeric variables, WEIGHT and BMI, and a second PROC TRANSPOSE step to transpose the character variable, SESSION.
If you transpose both character and numeric variables in the same PROC TRANSPOSE step, SAS converts the numeric variable values to character values. By submitting one step for the numeric variables and a second for the character variable, the transposed values of WEIGHT and BMI remain numeric.
PROC TRANSPOSE uses variable WEEK to determine how to name the transposed variables. However, if you had missing data for all observations for one week for one of the variables, a variable would not be created for that missing week. You should understand the limitations of your data and possibly specify ATTRIB, LENGTH, or ARRAY statements in the final DATA step to make sure all elements are defined.
For more information about creating variables with PROC TRANSPOSE, see "Naming the Variables in the Transposed Data Set" in Example 3.12 in the "A Closer Look" section.
The first PROC TRANSPOSE step saves two observations per value of the BY variable ID. One observation contains the transposed values of WEIGHT and the other observation contains the transposed values of BMI. These values are saved in the same four variables. To distinguish between the two observations for each ID, the variable _NAME_ that was created by PROC TRANSPOSE saves the name of the variable that created the transposed variables.
A DATA step match-merges the two data sets that PROC TRANSPOSE produced by ID. The MERGE statement lists data set TRANSWTBMI twice and data set TRANSSESSION once. The two occurrences of TRANSWTBMI in the MERGE statement cause the data set to be merged with itself. Its first placement in the MERGE statement applies a WHERE statement to select observations where the value of the _NAME_ variable is "WEIGHT". Its second placement applies a WHERE statement to select observations where the value of the _NAME_ variable is "BMI".
When you use PROC TRANSPOSE as in this example, you do not need to know how many occurrences of observations there are per BY group. This is an advantage over using the DATA step in the main example. However, when running the DATA step that match-merges the PROC TRANSPOSE data sets, you do need to know the maximum number of measurements made because it is necessary to rename the numeric transposed variables. A way around the renaming requirement is to execute PROC TRANSPOSE once for each variable, ensuring that the PREFIX= values are unique.
Output 8.12b shows intermediate data set TRANSWTBMI. Note the two observations per value of ID, each with a different value for _NAME_.
Output 8.12b TRANSWTBMI Data SetTRANSWTBMI Obs id _NAME_ week1 week2 week3 week4 1 S001 weight 231.0 223.0 . . 2 S001 bmi 29.7 28.6 . . 3 S002 weight 187.0 . 176.0 . 4 S002 bmi 28.4 . 26.8 . 5 S003 weight 154.0 151.0 148.0 142.0 6 S003 bmi 27.3 26.7 26.2 25.2 7 S004 weight 134.0 133.0 . 129.0 8 S004 bmi 25.3 25.1 . 24.4 |
Output 8.12c shows intermediate data set TRANSSESSION.
Output 8.12c TRANSSESSION Data SetTRANSSESSION session_ session_ session_ session_ Obs id week1 week2 week3 week4 1 S001 AM1 AM2 2 S002 AM1 AM2 3 S003 PM1 AM1 PM1 PM1 4 S004 PM3 PM3 PM3 |
Transpose data set WEIGHT_BMI. Save the transposed observations in TRANSWTBMI. Specify a prefix to use in constructing names for transposed variables in the output data set. Create one observation per BY value for each variable that is transposed. Transpose two variables. Specify that the values of WEEK name the transposed variables in TRANSWTBMI.
Transpose the character variable. Save the transposed data in TRANSSESSION. Specify a different PREFIX= value than in the previous PROC TRANSPOSE step.
Create data set WTRESULTS. Merge data set TRANSWTBMI with itself and with TRANSSESSION. On the first placement of TRANSWTBMI, select observations where the uppercased value of _NAME_ is "WEIGHT" and rename the four WEEK variables to reflect the source of the four values.
Select observations where the uppercased value of _NAME_ is "BMI" and rename the four WEEK variables to reflect the source of the four values.
Match-merge the three data sets.
Drop variable _NAME_ since it is no longer needed to identify the source of any of the variable values.
proc transpose data=weight_bmi out=transwtbmi prefix=week; by id; var weight bmi; id week; run; proc transpose data=weight_bmi out=transsession prefix=session_week; by id; var session; id week; run; data wtresults; merge transwtbmi(where=(upcase(_name_)='WEIGHT') rename=(week1=wt_week1 week2=wt_week2 week3=wt_week3 week4=wt_week4)) transwtbmi(where=(upcase(_name_)='BMI') rename=(week1=bmi_week1 week2=bmi_week2 week3=bmi_week3 week4=bmi_week4)) transsession; by id; drop _name_; run;
3.14.248.69