Additional Guidelines

Inputting String Variables with the Same Prefix and Different Numeric Suffixes

In this section, prefix refers to the first part of a variable’s name, while suffix refers to the last part. For example, think about our variables Q1, Q2, Q3, Q4, Q5, Q6, and Q7. These are multiple variables with the same prefix (Q) and different numeric suffixes (i.e., 1, 2, 3, 4, 5, 6, and 7). Variables such as this are sometimes referred to as string variables. Earlier, this chapter provided one way of inputting these variables; the original INPUT statement is repeated here:

INPUT   #1   @1   Q1      1.
             @2   Q2      1.
             @3   Q3      1.
             @4   Q4      1.
             @5   Q5      1.
             @6   Q6      1.
             @7   Q7      1.
             @9   AGE     2.
             @12  IQ      3.
             @16  NUMBER  2.  ;

However, with string variables named in this way, there is an easier way of writing the INPUT statement. You could have written it this way:

INPUT   #1   @1   Q1-Q7   1.
             @9   AGE     2.
             @12  IQ      3.
             @16  NUMBER  2.  ;

The first line of this INPUT statement gives SAS the following directions: “Go to line #1. Once there, go to column 1. Beginning in column 1, you find variables Q1 through Q7. Each of these numeric variables is one column wide.” With this second INPUT statement, SAS reads the data in exactly the same way that it would have using the original input statement.

As an additional example, imagine you had a 50-item survey instead of a 7-item survey. You called your variables Q1, Q2, Q3, and so forth. You entered your data in the following way:

ColumnVariable NameExplanation
1-50Q1-Q50Responses to survey questions 1-50
51blank
52-53AGEParticipant’s age in years
54blank
55-57IQParticipant’s IQ score
58blank
59-60NUMBERParticipant’s number

You could use the following INPUT to read these data:

INPUT   #1   @1   Q1-Q50  1.
             @52  AGE     2.
             @55  IQ      3.
             @59  NUMBER  2.  ;

Inputting Character Variables

This text deals with two types of basic variables: numeric and character variables. A numeric variable consists entirely of numbers, and it contains no letters. For example, all of your variables from the preceding dataset were numeric variables: Q1 could assume only the values of 1, 2, 3, 4, or 5. Similarly, AGE could take on only numeric values. On the other hand, a character variable can consist of either numbers, alphabetic characters (letters), or both.

Remember that responses to the seven questions of the Volunteerism Survey are entered in columns 1 to 7 in this dataset, AGE is entered in columns 9 to 10, IQ is entered in columns 12 to 14, and participant number is in columns 16 to 17.

You could include the sex of each participant and create a new variable called SEX. If a participant is male, SEX would assume the value “M.” If a participant is female, SEX would assume the value “F.” In the following, the new SEX variable appears in column 19 (the last column):

2234243 22  98  1 M
3424325 20 105  2 M
3242424 32  90  3 F
3242323  9 119  4 F
3232143  8 101  5 F
3242242 24 104  6 M
4343525 16 110  7 F
3232324 12  95  8 M
1322424 41  85  9 M
5433224 19 107 10 F

You can see that participants 1 and 2 are males whereas participants 3, 4, and 5 are females, and so forth.

You must use a special command within the INPUT statement to input a character variable. Specifically, in the column width region for the character variable, precede the column width with a dollar sign ($). For the preceding dataset, you would use the following INPUT statement. Note the dollar sign in the column width region for the SEX variable:


INPUT   #1   @1   Q1-Q7    1.
             @9   AGE      2.
             @12  IQ       3.
             @16  NUMBER   2.
             @19  SEX     $1.  ;

Using Multiple Lines of Data for Each Participant

Often, a researcher obtains so much data from each participant that it is impractical to enter all data on just one line. For example, imagine that you administer a 100-item questionnaire to a sample, and that you plan to enter responses to question 1 in column 1, responses to question 2 in column 2, and so forth. Following this process, you are likely to run into difficulty because you will need 100 columns to enter all responses from a given participant. Many computer monitors, however, allow no more than 79 columns. If you continue entering data past column 79, your data are likely to wrap around or appear in some way that makes it difficult to verify that you are entering a given value in the correct column.

In situations in which you require a large number of columns for your data, it is often best to divide each participant’s data so that they appear on more than one line. (In other words, it is best to have multiple lines of data for each participant.) To do this, it is necessary to modify your INPUT statement.

To illustrate, assume that you obtained two additional variables for each participant in your study: their GRE verbal test scores and GRE math test scores. You decide to enter your data so that there are two lines of data for each participant. On line 1 for a given participant, you enter Q1 through Q7, AGE, IQ, NUMBER, and SEX (as above). On line 2 for that participant, you enter GREVERBAL (the GRE verbal test score) in columns 1 through 3, and you enter GREMATH (the GRE math test score) in columns 5 through 7:

2234243 22  98  1 M
520 490
3424325 20 105  2 M
440 410
3242424 32  90  3 F
390 420
3242323  9 119  4 F

3232143  8 101  5 F

3242242 24 104  6 M
330 340
4343525 16 110  7 F

3232324 12  95  8 M

1322424 41  85  9 M
380 410
5433224 19 107 10 F
640 590

GREVERBAL score for participant 1 is 520, and the GREMATH score is 490.

When a participant has no data for a variable which would normally appear on a given line, your dataset must still include a line for that participant, even if it is blank. For example, participant 4 is only 9 years old, so she has not yet taken the GRE and obviously does not have GRE scores. Nonetheless, you still need to include a second line for participant 4 even though it is blank. Notice that blank lines also appear for participants 5, 7, and 8, who are also too young to take the GRE.


Be warned that, with some text editors, it is necessary to create these blank lines by pressing the ENTER key, thus creating a hard carriage return. With these editors, using the directional arrows on the keypad might not create the necessary hard return. Problems in reading the data are also likely to occur if tabs are used; it is generally best to avoid the use of tabs or other hidden codes when entering data.

The following coding guide tells us where each variable appears. Notice that this guide indicates the line on which a variable is located, as well as the column where it is located.

LineColumnVariable NameExplanation
11-7Q1-Q7Survey questions 1-7
8blank
 9-10AGEParticipant’s age in years
11blank
 12-14IQParticipant’s IQ score
15blank
 16-17NUMBERParticipant’s number
18blank
 19SEXParticipant’s sex
21-3GREVERBALGRE-Verbal test score
 5-7GREMATHGRE-Math test score

When there are multiple lines of data for each participant, the INPUT statement must indicate on which line a given variable is located. This is done with the line number command (#) that was introduced earlier. You could use the following INPUT statement to read the preceding dataset:


INPUT   #1   @1   Q1-Q7         1.
             @9   AGE           2.
             @12  IQ            3.
             @16  NUMBER        2.
             @19  SEX          $1.
        #2   @1   GREVERBAL     3.
             @5   GREMATH       3.  ;

This INPUT statement tells SAS to begin at line #1 for a given participant, to go to column 1, and find variables Q1 through Q7. It continues to tell SAS where it will find each of the other variables located on line #1. After reading the SEX variable, SAS is told to move to line #2. There, it is to go to column 1 and find the variable GREVERBAL that is three columns wide. The variable GREMATH begins in column 5 and is also three columns wide. In theory, it is possible to have any number of lines of data for each participant so long as you use the line number command correctly.

Creating Decimal Places for Numeric Variables

Assume that you have obtained the high school grade point averages (GPAs) for a sample of five participants. You could create a SAS dataset containing these GPAs using the following program:

 1        DATA D1;
 2           INPUT   #1   @1   GPA   4.  ;
 3        DATALINES;
 4        3.56
 5        2.20
 6        2.11
 7        3.25
 8        4.00
 9        ;
10        RUN;
11
12        PROC MEANS   DATA=D1;
13        RUN;

The INPUT statement tells SAS to go to line 1, column 1, to find a variable called GPA that is four columns wide. Within the dataset itself, values of GPA were entered using a period as a decimal point, with two digits to the right of the decimal point.

This same dataset could have been entered in a slightly different way. For example, what if the data had been entered without a decimal point, as follows?

356
220
211
325
400

It is still possible to have SAS insert a decimal point where it belongs, in front of the last two digits in each number. You do this in the column width command of the INPUT statement. With this column width command, you indicate how many columns the variable occupies, enter a period, and then indicate how many columns of data should appear to the right of the decimal place. In the present example, the GPA variable is three columns wide and two columns of data should appear to the right of the decimal place. So you would modify the SAS program in the following way. Notice the column width command:

 1        DATA D1;
 2           INPUT   #1   @1   GPA   3.2  ;
 3        DATALINES;
 4        356
 5        220
 6        211
 7        325
 8        400
 9        ;
10        RUN;
11
12        PROC MEANS   DATA=D1;
13        RUN;

Inputting “Check All That Apply” Questions as Multiple Variables

A “check all that apply” question is a special type of questionnaire item that is often used in social science research. These items generate data that must be input in a special way. The following is an example of a “check all that apply” item that could have appeared on your volunteerism survey:

Below is a list of activities.  Please place a
 check mark next to
any activity in which you have engaged in the past
 six months.

Check here
-----

_____ 1. Did volunteer work at a shelter for the
 homeless.
_____ 2. Did volunteer work at a shelter for
 battered women.
_____ 3. Did volunteer work at a hospital or hospice.
_____ 4. Did volunteer work for any other
 community agency or
         organization.
_____ 5. Donated money to the United Way.
_____ 6. Donated money to a congregation-sponsored
 charity.
_____ 7. Donated money to any other charitable cause.


An inexperienced researcher might think of the preceding as a single question with seven possible responses and try to enter the data in a single column in the dataset (e.g., in column 1). But this would lead to big problems. What would you enter in column 1 if a participant checked more than one category?

One way around this difficulty is to treat the seven possible responses as seven different questions. When entered, each of these questions is treated as a separate variable and appears in a separate column. For example, whether or not a participant checked activity 1 can be coded in column 1, whether the participant checked activity 2 can be coded in column 2, and so forth.

Researchers can code these variables by placing any values they like in these columns, but you should enter a two (2) if the participant did not check that activity and a one (1) if the participant did check it. Why code the variables using 1s and 2s? The reason is that this makes it easier to perform some types of analyses that you might later want to perform. A variable that can assume only two values is called a dichotomous variable, and the process of coding dichotomous variables with 1s and 2s is known as dummy coding. When dummy coding, we recommend that you do not use zeros to avoid the possibility that these might be confused with missing values.

Once a dichotomous variable is dummy coded, it can be analyzed using a variety of SAS procedures such as PROC REG to perform multiple regression, a procedure that allows you to assess the nature of the relationship between a single criterion variable and multiple predictor variables. If a dichotomous variable has been dummy coded properly, it can be used as a predictor variable in a multiple regression analysis. For these and other reasons, it is good practice to code dichotomous variables using 1s and 2s.

The following coding guide summarizes how you could enter responses to the preceding question:

LineColumnVariable NameExplanation
11-7ACT1-ACT7Responses regarding activities 1 through 7. For each activity, a 2 was recorded if the participant did not check the activity, and a 1 was recorded if the participant did check the activity.

When participants have responded to a “check all that apply” item, it is often best to analyze the resulting data with the FREQ (frequency) procedure. PROC FREQ indicates the actual number of people who appear in each category. In this case, PROC FREQ indicates the number of people who did not check a given activity versus the number who did. It also indicates the percentage of people who appear in each category, along with some additional information.

The following program inputs some fictitious data and requests frequency tables for each activity using PROC FREQ:

 1        DATA D1;
 2           INPUT   #1   @1   ACT1-ACT7   1.  ;
 3
 4        DATALINES;
 5        2212222
 6        1211111
 7        2221221
 8        2212222
 9        1122222
10        ;
11        RUN;
12
13        PROC FREQ    DATA=D1;
14           TABLES ACT1-ACT7;
15        RUN;

Data for the first participant appears on line 5 of the program. Notice that a 1 is entered in column 3 for this participant, indicating that he or she did perform activity 3 (“did volunteer work at a hospital or hospice”) and that 2s are recorded for the remaining six activities, meaning that the participant did not perform those activities. The data entered for participant 2 on line 6 shows that this participant performed all of the activities except for activity 2.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.196.175