Data Subsetting

Using a Simple Subsetting Statement

Often, it is necessary to perform an analysis on only a subset of the participants in the dataset. For example, you might want to review the mean survey responses provided by just the female participants. A subsetting IF statement can be used to accomplish this, and the general form is presented here:

DATA  new-dataset-name;
   SET  existing-dataset-name;
IF  comparison;

PROC  name-of-desired-statistical-procedure   DATA=new-dataset-name;
RUN;

The comparison described in the preceding statements generally includes an existing variable and at least one comparison operator. The following statements enable you to calculate the mean survey responses for only the female participants.

15        .
16        .
17
18      5433224 19 107 10 F
19      640 590
20      ;
21      RUN;
22
23      DATA D2;
24         SET D1;
25
26         IF SEX = 'F';
27
28      PROC MEANS   DATA=D2;
29      RUN;

The preceding statements tell SAS to create a new dataset called D2 and to make it identical to D1. However, the program keeps a participant’s data only if her SEX has a value of F. Then the program executes the MEANS procedure for the data that are retained.

Using Comparison Operators

All of the comparison operators previously described can be used in a subsetting IF statement. For example, consider the following:

DATA D2;
   SET D1;

   IF SEX = 'F' AND AGE GE 65;

PROC MEANS   DATA=D2;
RUN;

The preceding statements analyze only data from women who are 65 or older.

Eliminating Observations with Missing Data for Some Variables

One of the most common difficulties encountered by researchers in the social sciences is the problem of missing data. Briefly, missing data involves not having scores for all variables for all participants in a dataset. This section discusses the problem of missing data, and shows how a subsetting IF statement can be used to deal with it.

Assume that you administer your volunteerism survey to 100 participants, and you use their scores to calculate a single volunteerism score for each. You also obtain a number of additional variables regarding the participants. The SAS names for the study’s variables and their descriptions are as follows:

VOLUNTEER:participant scores on the volunteerism questionnaire, where higher scores reveal greater likelihood to engage in unpaid prosocial activities;
GREVERBAL:participant scores on the verbal subtest of the Graduate Record Examinations;
GREMATH:participant scores on the math subtest of the Graduate Record Examinations;
IQ:participant scores on a standard intelligence test (intelligence quotient).

Assume further that you obtained scores for VOLUNTEER, GREVERBAL, and GREMATH for all 100 participants. Due to a recordkeeping error, however, you were able to obtain IQ scores for only 75 participants.

You now want to analyze your data using a procedure called multiple regression. (This procedure is covered in a later chapter; you do not need to understand multiple regression to understand the points to be made here.) In Analysis #1, VOLUNTEER is the criterion variable, and GREVERBAL and GREMATH are the predictor variables. The multiple regression equation for Analysis #1 is represented in the following PROC REG statement:

Analysis #1:

PROC REG   DATA=D1;
   MODEL VOLUNTEER  =  GREVERBAL   GREMATH ;
RUN;

When you review the analysis results, note that the analysis is based on 100 participants. This makes sense because you had complete data on all of the variables included in this analysis.

In Analysis #2, VOLUNTEER is again the criterion variable; this time, however, the predictor variables include GREVERBAL and GREMATH as well as IQ. The equation for Analysis #2 is as follows:

Analysis #2:

PROC REG   DATA=D1;
   MODEL VOLUNTEER  =  GREVERBAL   GREMATH   IQ;
RUN;

When you review the results of analysis #2, you see that you have encountered a problem. The SAS output indicates that the analysis is based on only 75 participants. At first you might not understand this because you know that there are 100 participants in the dataset. But then you remember that you did not have complete data for one of the variables. You had values for the IQ variable for only 75 participants. The REG procedure (and many other SAS procedures) includes in the analysis only those participants who have complete data for all of the variables analyzed with that procedure. For Analysis #2, this means that any participant with missing data for IQ will be eliminated from the analysis. Twenty-five participants had missing data for IQ and were therefore eliminated.

Why were these 25 participants not eliminated from Analysis #1? Because that analysis did not involve the IQ variable. It involved only VOLUNTEER, GREVERBAL, and GREMATH; and all 100 participants had complete data for each of these three variables.

In a situation such as this, you have a number of options with respect to how you might perform these analyses and summarize the results. One option is to retain the results described previously. You could report that you performed one analysis on all 100 participants and a second analysis on just the 75 who had complete data for the IQ variable.

This approach might leave you open to criticism, however. The beginning of your research paper probably reported demographic characteristics for all 100 participants (e.g., how many were female, mean age). However, you might not have a section providing demographics for the subgroup of 75. This might lead readers to wonder if the subgroup differed in some important way from the aggregate group.

There are statistical reasons why this approach might cause problems as well. For example, you might wish to test the significance of the difference between the squared multiple correlation (R2) value obtained from Analysis #1 and the R2 value obtained from Analysis #2. (This test is described in Chapter 14, “Multiple Regression.”) When performing this test, it is important that both R2 values be based on exactly the same participants in both analyses. This is obviously not the case in your study as 25 of the participants used in Analysis #1 were not used in Analysis #2.

In situations such as this, you are usually better advised to ensure that every analysis is performed on exactly the same sample. This means that, in general, any participant who has missing data for variables to be included in any (reported) analysis should be deleted before the analyses are performed. In this instance, therefore, it is best to see to it that both Analysis #1 and Analysis #2 are performed on only those 75 participants who had complete data for all four variables (i.e., VOLUNTEER, GREVERBAL, GREMATH, and IQ). Fortunately, this can easily be done using a subsetting IF statement.

Recall that with SAS, a missing value is represented with a period (“.”). You can take advantage of this to eliminate any participant with missing data for any analyzed variable. For example, consider the following subsetting IF statement:

      DATA D2;
         SET D1;
      IF VOLUNTEER NE . AND GREVERBAL NE . AND
         GREMATH   NE . AND IQ   NE . ;

The preceding statements tell the system to do the following:

  1. Create a new dataset named D2, and make it an exact copy of D1.

  2. Retain a participant in this new dataset only if (for that participant):

    • VOLUNTEER is not equal to missing;

    • GREVERBAL is not equal to missing;

    • GREMATH is not equal to missing;

    • IQ is not equal to missing.

In other words, the system creates a new dataset named D2; this new dataset contains only the 75 participants who have complete data for all four variables of interest. You can now specify DATA=D2 in all SAS procedures, with the result that all analyses will be performed on exactly the same 75 participants.

The following SAS program shows where these statements should be placed:

14        .
15        .
16        .
17
18      5433224 19 107 10 F
19      640 590
20      ;
21      RUN;
22
23      DATA D2;
24         SET D1;
25
26      IF VOLUNTEER NE . AND GREVERBAL NE . AND
27         GREMATH   NE . AND IQ   NE . ;
28
29      PROC REG   DATA=D2;
30         MODEL VOLUNTEER =  GREVERBAL  GREMATH ;
31      RUN;
32
33      PROC REG   DATA=D2;
34         MODEL VOLUNTEER =  GREVERBAL  GREMATH  IQ ;
35      RUN;

As evident above, the subsetting IF statement must appear in the program before the procedures that request the modified dataset (dataset D2, in this case).

How should I enter missing data?

If you are entering data and come to a participant with a missing value for some variable, you do not need to record a “.” to represent the missing data. So long as your data are being input using the DATALINES statement and the conventions discussed here, it is acceptable to simply leave that column (or those columns) blank by hitting the space bar on your keyboard. SAS will internally assign that participant a missing data value (“.”) for that variable. In some cases, however, it might be useful to enter a “.” for variables with missing data, as this can make it easier to keep your place when entering information.


When using a subsetting IF statement to eliminate participants with missing data, exactly which variables should be included in that statement? In most cases, it should be those variables, and only those variables, that are ultimately discussed. This means that you might not know exactly which variables to include until you actually begin analyzing the data. For example, imagine that you conduct your study and obtain data for the following number of participants for each of the following variables:

VariableNumber of Participants with Valid Data for This Variable
VOLUNTEER100
GREVERBAL100
GREMATH100
IQ75
AGE10

As before, you obtained complete data for all 100 participants for VOLUNTEER, GREVERBAL and GREMATH, and you obtained data for 75 participants on IQ. But notice the last variable. You obtained information regarding age for only 10 participants. What would happen if you included the variable AGE in the subsetting IF statement, as shown here?

IF VOLUNTEER NE . AND GREVERBAL NE . AND
   GREMATH   NE . AND IQ   NE . AND  AGE  NE  . ;

This IF statement causes the system to eliminate from the sample anyone who does not have complete data for all five variables. Since only 10 participants have values for the AGE variable, you know that the resulting dataset includes just these 10 participants. However, this sample is too small for virtually all statistical procedures. At this point, you have to decide whether to gather more data or forget about doing any analyses with the AGE variable.

In summary, one approach for identifying those variables to be included in the subsetting IF statement is to do the following:

  • perform some initial analyses;

  • decide which variables will be included in the final analyses (for your study);

  • include all of those variables in the subsetting IF statement;

  • perform all analyses on this reduced dataset so that all analyses reported in the study are performed on exactly the same sample.

Of course, there are circumstances in which it is neither necessary nor desirable that all analyses be performed on exactly the same group of participants. The purpose of the research, along with other considerations, should determine when this is appropriate.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.55.42