The use of data manipulation and data subsetting statements are illustrated here with reference to the fictitious study described in the preceding chapter. In that chapter, you were asked to imagine that you had developed a 7-item questionnaire dealing with volunteerism, as shown in the following example.
Volunteerism Survey Please indicate the extent to which you agree or |
Assume that you administer this survey to a number of participants and you also obtain information concerning sex, IQ scores, GRE verbal test scores, and GRE math test scores for each participant. Once the data are entered, you might want to write a SAS program that includes some data-manipulation or data-subsetting statements to transform the raw data. But where within the SAS program should these statements appear?
In general, these statements should appear only within the DATA step. Remember that the DATA step begins with the DATA statement and ends as soon as SAS encounters a procedure. This means that if you prepare the DATA step, end the DATA step with a procedure, and then place some manipulation or subsetting statement immediately after the procedure, you will receive an error.
To avoid this error (and keep things simple), place your data-manipulation and data-subsetting statements in one of two locations within a SAS program:
immediately following the INPUT statement;
or immediately following the creation of a new dataset.
The first of the two preceding guidelines indicates that the statements may be placed immediately following the INPUT statement. This guideline is illustrated again by referring to the volunteerism study. Assume that you prepare the following SAS program to analyze data obtained in your study. In the following program, lines 11 and 12 indicate where you can place data-manipulation or data-subsetting statements in that program. (To conserve space, only some of the data lines are reproduced in the program.)
1 DATA D1; 2 INPUT #1 @1 Q1-Q7 1. 3 @9 AGE 2. 4 @12 IQ 3. 5 @16 NUMBER 2. 6 @19 SEX $1. 7 #2 @1 GREVERBAL 3. 8 @5 GREMATH 3. ; 9 10 11 place data-manipulation statements and 12 data-subsetting statements here 13 14 DATALINES; 15 2234243 22 98 1 M 16 520 490 17 3424325 20 105 2 M 18 440 410 19 . 20 . 21 22 5433224 19 107 10 F 23 640 590 24 ; 25 RUN; 26 27 PROC MEANS DATA=D1; 28 RUN;
The second guideline for placement provides another option regarding where you can place data-manipulation or data-subsetting statements; they can also be placed immediately following program statements that create a new dataset. A new dataset can be created at virtually any point in a SAS program (even after procedures are requested).
At times, you might want to create a new dataset so that, initially, it is identical to an existing dataset (perhaps the one created with a preceding INPUT statement). If data-manipulation or data-subsetting statements follow the creation of this new dataset, the new set displays the modifications requested by those statements.
To create a new dataset that is identical to an existing dataset, the general form is
DATA new-dataset-name; SET existing-dataset-name;
To create such a dataset, use the following statements:
DATA D2; SET D1;
These lines told SAS to create a new dataset called D2 and to make this new dataset identical to D1. Now that a new set has been created, you can write as many manipulation and subsetting statements as you like. However, once you write a procedure, that effectively ends the DATA step and you cannot write any more manipulation or subsetting statements beyond that point (unless you create another dataset later in the program).
The following is an example of how you might write your program so that the manipulation and subsetting statements follow the creation of the new dataset:
1 DATA D1; 2 INPUT #1 @1 Q1-Q7 1. 3 @9 AGE 2. 4 @12 IQ 3. 5 @16 NUMBER 2. 6 @19 SEX $1. 7 #2 @1 GREVERBAL 3. 8 @5 GREMATH 3. ; 9 10 DATALINES; 11 2234243 22 98 1 M 12 520 490 13 3424325 20 105 2 M 14 440 410 15 . 16 . 17 18 5433224 19 107 10 F 19 640 590 20 ; 21 RUN; 22 23 DATA D2; 24 SET D1; 25 26 place data manipulation statements and 27 data subsetting statements here 28 29 PROC MEANS DATA=D2; 30 RUN;
SAS creates two datasets according to the preceding program: D1 contains the original data; and D2 is identical to D1 except for modifications requested by the data-manipulation and data-subsetting statements.
Notice that the MEANS procedure in line 29 requests the computation of some simple descriptive statistics. It is clear that these statistics are performed on the data from dataset D2 because DATA=D2 appears in the PROC MEANS statement. If the statement, instead, specified DATA=D1, the analyses would have been performed on the original dataset.
The preceding program illustrates the use of the DATALINES statement rather than the INFILE statement. The guidelines regarding the placement of data-modifying statements are the same regardless of which approach is followed. The data-manipulation or data-subsetting statement should either immediately follow the INPUT statement or the creation of a new dataset. When a program is written using the INFILE statement rather than the DATALINES statement, data-manipulation and data-subsetting statements should appear after the INPUT statement but before the first procedure. For example, if your data are entered into an external file called VOLUNTEER.DAT, you can write the following program. (Notice where the manipulation and subsetting statements are placed.)
1 DATA D1; 2 INFILE 'A:/VOLUNTEER.DAT'; 3 INPUT #1 @1 Q1-Q7 1. 4 @9 AGE 2. 5 @12 IQ 3. 6 @16 NUMBER 2. 7 @19 SEX $1. 8 #2 @1 GREVERBAL 3. 9 @5 GREMATH 3. ; 10 11 place data manipulation statements and 12 data subsetting statements here 13 14 PROC MEANS DATA=D1; 15 RUN;
In the preceding program, the data-modifying statements again come immediately after the INPUT statement but before the first procedure, consistent with earlier recommendations.
18.225.72.133