Data Manipulation

Data manipulation involves performing a transformation on one or more variables in the DATA step. This section discusses several types of transformations that are frequently required in social science research. These include creation of duplicate variables with new variable names, creation of new variables from existing variables, recoding reversed items, and using IF-THEN/ELSE statements as well as other related procedures.

Creating Duplicate Variables with New Variable Names

Suppose that you give a variable a certain name when it is inputted, but then you want the variable to have a different, perhaps a more meaningful, name when it appears later in the SAS program or in the SAS output. This can easily be accomplished with a statement written according to the following syntax:

new-variable-name  =  existing-variable-name;

For example, in the preceding dataset, the first seven questions are given variable names of Q1 through Q7. Item 1 in the questionnaire reads, “I feel a personal responsibility to help needy people in my community.” In the INPUT statement, this item was given a SAS variable name Q1, which is not very meaningful. RESNEEDY, which stands for “responsible for the needy,” is a more meaningful name. Similarly, RESHOME is more meaningful than Q2, and NORES is more meaningful than Q3.

One way to rename an existing variable is to create a new variable that is identical to the existing variable and assign a new, more meaningful name to this variable. The following program renames Q1, Q2, and Q3 in this way.

This and later examples show only a portion of the entire program. However, enough of the program appears to illustrate where the remaining statements should be placed.


15        .
16        .
17
18      5433224 19 107 10 F
19      640 590
20      ;
21      RUN;
21
22      DATA D2;
23         SET D1;
24
25      RESNEEDY = Q1;
26      RESHOME  = Q2;
27      NORES    = Q3;
28
29      PROC MEANS   DATA=D2;
30      RUN;

Line 25 tells SAS to create a new variable called RESNEEDY and for it to be identical to the existing variable, Q1. Variables RESNEEDY and Q1 now have identical data but RESNEEDY has a more meaningful name to facilitate the reading of printouts when statistical analyses are later performed.

When creating a new variable name, conform to the rules for naming SAS variables discussed in Chapter 3, “Data Input” (e.g., begins with a letter). Also, note that each statement that creates a duplicate of an existing variable must end with a semicolon.


Duplicating Variables versus Renaming Variables

Technically, the previous program did not really rename variables Q1, Q2, and Q3. Rather, the program created duplicates of these variables and assigned new names to these duplicate variables. Therefore, the resulting dataset contains both the original variables under their old names (Q1, Q2, and Q3) as well as the duplicate variables under their new names (RESNEEDY, RESHOME, and NORES). If, for some reason, you literally need to rename the existing variables so that the old variable names no longer exist in the dataset, consider using the RENAME statement.

Creating New Variables from Existing Variables

It is often necessary to perform mathematical operations on existing variables and use the results to create a new variable. With SAS, the following symbols can be used in arithmetic operations:

+(addition)
-(subtraction)
*(multiplication)
/(division)
=(equals)

When writing formulae, you should make extensive use of parentheses. Remember that operations enclosed within parentheses are performed first, and operations outside of the parentheses are performed later. To create a new variable by performing a mathematical operation on an existing variable, use the following general form:

new-variable-name  =  formula-including-existing-variables;

For example, two existing variables in your dataset are GREVERBAL (GRE verbal test scores) and GREMATH (GRE math test scores). Suppose you wanted to create a new variable called GRECOMB. This variable includes each participant’s combined GRE score. For each participant, you need to add together GREVERBAL and GREMATH scores; therefore, the GRECOMB value is the sum of the values for GREVERBAL and GREMATH. The program repeats this operation for each participant in the sample, using just one statement:

GRECOMB = (GREVERBAL + GREMATH);

The preceding statement tells SAS to create a new variable called GRECOMB and set it equal to the sum of GREVERBAL and GREMATH.

Suppose that you want to calculate the average of GREVERBAL and GREMATH scores. The new variable might be called GREAVG. The program repeats this operation for each participant in the sample using the following statement:

GREAVG = (GREVERBAL + GREMATH) / 2;

The preceding statement tells SAS to create a new variable called GREAVG by first adding together the values of GREVERBAL and GREMATH, then dividing this sum by 2. The resulting quotient is labeled GREAVG. You can also arrive at the same result by using two statements instead of one, as shown here:

GRECOMB = (GREVERBAL + GREMATH);
GREAVG  = GRECOMB/2;

Very often, researchers need to calculate the average of several items on a questionnaire. For example, look at items 1 and 2 in the questionnaire shown previously. Both items seem to be measuring participants’ sense of personal responsibility to help the needy.

Rather than analyze responses to the items separately, it might be more useful to calculate the average of responses to those items. This average could then serve as participants’ scores on some “personal responsibility” variable. For example, consider the following:

RESPONSE = (Q1 + Q2) / 2;

The preceding statement tells SAS to create a new variable called RESPONSE by adding together participants’ scores for Q1 and Q2 and then dividing the resulting sum by 2. The resulting quotient creates the new RESPONSE variable.

When creating new variables in this manner, be sure that all variables on the right side of the equals sign are existing variables. This means that they already exist in the dataset, either because they are listed in the INPUT statement or because they were created with earlier data-manipulation statements.


Priority of Operators in Compound Expressions

A SAS expression (e.g., a formula) that contains just one operator is known as a simple expression. The following statement contains a simple expression. Notice that there is only one operator (+ sign) to the right of the = sign:

RESPONSE = Q1 + Q2;

In contrast, a compound expression contains more than one operator. A compound expression is illustrated in the following example. Notice that several different operators appear to the right of the = sign:

RESPONS = Q1 + Q2 - Q3 / Q4 * Q5;

When an expression contains more than one operator, SAS follows a set of rules that determine which operations are performed first, which are performed second, and so forth. The rules that pertain to mathematical operators (+, -, /, and *) are summarized here:

  • Multiplication and division operators (* and /) have equal priority, and they are performed first.

  • Addition and subtraction operators (+ and -) have equal priority, and they are performed second.

One point made in the preceding rules is that multiplication and division are performed prior to addition or subtraction. For example, consider the following statement:

RESPONS = Q1 + Q2 / Q3;

Since division has priority over addition, the operations in the preceding statement would be executed in this sequence:

  • Q2 would first be divided by Q3.

  • The resulting quotient would then be added to Q1.

Notice that division is performed first, even though the addition appears earlier in the formula (reading from left to right).

But what if multiple operators having equal priority appear in the same statement? In this situation, SAS reads the formula from left to right, and performs the operations in that sequence. For example, consider the following:

RESPONS = Q1 + Q2 - Q3;

The preceding expression contains only addition and subtraction operations that have equal priority. SAS therefore reads the statement from left to right: first Q1 is added to Q2; then Q3 is subtracted from the resulting sum.

Because different priority is given to different operators, it is unfortunately very easy to write a statement that results in operations being performed in a sequence other than that intended. For example, imagine that you want to create a new variable called RESPONSE. Each participant’s score for RESPONSE is created by adding responses to Q1, Q2, and Q3 and by dividing this sum by 3. Imagine further that you try to achieve this with the following statement:

RESPONSE = Q1 + Q2 + Q3 / 3;

The preceding statement will not create the RESPONSE variable as you had intended. Because division has priority over addition, SAS performs the operations in the following order:

  1. Q3 is divided by 3.

  2. The resulting quotient is then added to Q1 and Q2.

Obviously, this is not what you intended.

To avoid mistakes such as this, it is important to use parentheses when writing formulae. Because operations inside parentheses are performed first, the use of parentheses gives you control over the sequence in which operations are executed. For example, the following statement creates the RESPONSE variable in the way originally intended because the lower priority operations (adding together Q1 plus Q2 plus Q3) are now included within parentheses:

RESPONSE = (Q1 + Q2 + Q3) / 3;

This statement tells SAS to add together Q1 plus Q2 plus Q3; the sum of these operations is then divided by 3.

This section has provided a brief introduction to the priority of a few operators that can be performed with SAS.

Recoding Reversed Variables

Very often, a questionnaire contains a number of reversed items. A reversed item is a question stated so that its meaning is opposite the meaning of other items in that group. For example, consider the meaning of the following items from the volunteerism survey:

1  2  3  4  5      1.    I feel a personal
 responsibility to help
                         needy people in my community.

1  2  3  4  5      2.    I feel I am personally
 obligated to help
                         homeless families.

1  2  3  4  5      3.    I feel no personal
 responsibility to work
                         with poor people in my
 community.


In a sense, all of these questions are measuring the same thing (i.e., whether the participant feels some sense of personal responsibility to help the needy). Items 1 and 2 are stated so that the more strongly you agree with these statements, the greater your sense of personal responsibility. This means that scores of 5 indicate a strong sense of responsibility and scores of 1 indicate a weak sense of responsibility. However, item 3 is a reversed or negatively keyed item. It is stated so that the more strongly you agree, the weaker your sense of personal responsibility. Here, a response of 1 indicates a strong sense of responsibility whereas a response of 5 indicates a weak sense of responsibility (which is just the reverse of items 1 and 2).

For later analyses, all three items must be consistent so that scores of 5 always indicate a strong sense of responsibility whereas scores of 1 always indicate a weak sense of responsibility. This requires that you recode item 3 so that those who actually select 5, instead, are given a score of 1; those who actually circle 4 are given, instead, a score of 2; those who actually select 2 are given a score of 4; and those who select 1 are given a score of 5. This can be done very well with the following statement:

Q3 = 6 - Q3;

The preceding statement tells SAS to create a new version of the variable Q3, then take the number 6 and subtract from it participants’ existing (old) scores for Q3. The result is a new score for Q3. Notice that with this statement, if an initial score for Q3 was 5, the new score becomes 1; and if the initial score was 1, the new score is 5.

The syntax for this recoding statement is as follows:

existing-variable  =  constant  -  existing-variable;

The constant is always equal to the number of response points on your survey plus 1. For example, the volunteerism survey included 5 response points: participants could circle “1” for “Disagree Strongly” all the way through “5” for “Agree Strongly.” It was a 5-point scale, so the constant is 5 + 1 = 6. What would the constant be if the following 7-point scale had been used instead?

7 = Agree Very Strongly
6 = Agree Strongly
5 = Agree Somewhat
4 = Neither Agree nor Disagree
3 = Disagree Somewhat
2 = Disagree Strongly
1 = Disagree Very Strongly

It would be 8 because 7 + 1 = 8, and the recoding statement would read as follows:

Q3 = 8 - Q3;

Where should the recoding statements go?

In most cases, reversed items should be recoded before other data manipulations are performed. For example, assume that you want to create a new variable called RESPONSE, which stands for “personal responsibility.” With this scale, higher scores indicate higher levels of perceived personal responsibility. Scores on this scale are the average of participant responses to items 1, 2, and 3 from the survey. Because item 3 is a reversed item, it is important that it be recoded before it is added to items 1 and 2 when calculating the overall scale score. Therefore, the correct sequence of statements is as follows:

Q3 = 6 - Q3;
RESPONSE = (Q1 + Q2 + Q3) / 3;

The following sequence is not correct:

RESPONSE = (Q1 + Q2 + Q3) / 3;
 Q3 = 6 - Q3;


Using IF-THEN Control Statements

An IF-THEN control statement allows you to make sure that operations are performed on data only if certain conditions are true. The following comparison operators can be used with IF-THEN statements:

      = (equal to)
     NE (not equal to)
GT or > (greater than)
     GE (greater than or equal to)
LT or < (less than)
     LE (less than or equal to)

The general form for an IF-THEN statement is as follows:

IF  expression  THEN  statement ;

The expression usually consists of some comparison involving existing variables. The statement usually involves some operation performed on existing variables or new variables. For example, assume that you want to create a new variable called GREVGRP for “GRE-verbal group.” This variable will be created so that:

  • If you do not know participants’ GRE verbal test scores, they will be assigned a score of “.” (for “missing data”).

  • If participants’ scores are less than 500 on the GRE verbal test, they will be assigned a score of 1 for GREVGRP.

  • If the participant’s score is 500 or greater on the GRE verbal test, the participant will have a score of 2 for GREVGRP.

Assume that the variable GREVERBAL already exists in your dataset and that it contains each participant’s score for the GRE verbal test. You can use it to create the new variable GREVGRP by writing the following statements:

GREVGRP = .;
IF GREVERBAL LT 500 THEN GREVGRP = 1;
IF GREVERBAL GE 500 THEN GREVGRP = 2;

The preceding statements tell SAS to create a new variable called GREVGRP and begin by setting everyone’s score as equal to “.” (i.e., missing). If participants’ scores for GREVERBAL are less than 500, then their score for GREVGRP will be equal to 1. If participants’ scores for GREVERBAL are greater than or equal to 500, then their score for GREVGRP is equal to 2.

Using ELSE Statements

In reality, you can perform the preceding operations more efficiently by using the ELSE statement. The general form for using the ELSE statement, in conjunction with the IF-THEN statement, is presented as follows:

IF  expression  THEN  statement  ;
   ELSE  IF  expression  THEN  statement;

The ELSE statement provides alternative actions that SAS can take when the original IF expression is not true. For example, consider the following:

GREVGRP = .;
IF GREVERBAL LT 500 THEN GREVGRP = 1;
ELSE IF GREVERBAL GE 500 THEN GREVGRP = 2;

The preceding tells SAS to create a new variable called GREVGRP and initially assign all participants a value of “missing.” If a given participant has a GREVERBAL score less than 500, the system assigns that participant a score of 1 for GREVGRP. Otherwise, if the participant has a GREVERBAL score greater than or equal to 500, then the system assigns that participant a score of 2 for GREVGRP.

Obviously, the preceding statements are identical to the earlier statements that created GREVGRP, except that the word ELSE is added to the beginning of the third line. In fact, these two approaches actually result in assigning exactly the same values for GREVGRP to each participant. So, what is the advantage of including the ELSE statement? The answer has to do with efficiency. When an ELSE statement is included, the actions specified by that statement are executed only if the expression in the preceding IF statement is not true.

For example, consider the situation in which participant 1 has a GREVERBAL score less than 500. Line 2 in the preceding statements assigns that participant a score of 1 for GREVGRP. SAS then ignores line 3 (because it contains the ELSE statement), thus saving computing time. If line 3 did not contain the word ELSE, SAS would have executed the command, checking to see whether the GREVERBAL score for participant 1 is greater than or equal to 500 (which is actually unnecessary, given what was learned in line 2).

A word of caution regarding missing data is required at this point. Notice that line 2 of the preceding program assigns participants to group 1 (under GREVGRP) if their values for GREVERBAL are less than 0. Unfortunately, a value of “missing” (i.e., a value of “.”) for GREVERBAL is viewed as being less than 500 (actually, it is viewed as being less than 0) by SAS. This means that participants with missing data for GREVERBAL are assigned to group 1 under GREVGRP by line 2 of the preceding program. This is not desirable.

To prevent this from happening, you can rewrite the program in the following way:

GREVGRP = .;
IF GREVERBAL GT 0 AND GREVERBAL LT 500 THEN GREVGRP = 1;
ELSE IF GREVERBAL GE 500 THEN GREVGRP = 2;

Line 2 of the program now tells SAS to assign participants to group 1 only if their values for GREVERBAL are both greater than 0 and less than 500. This modification involves the use of the conditional AND statement, which is discussed in greater detail in the following section.

Finally, remember that the ELSE statement should be used only in conjunction with a preceding IF statement. In addition, always remember to place the ELSE statement immediately following the relevant IF statement.

Using the Conditional Statements AND and OR

As the preceding section indicates, you can also use the conditional statement AND within an IF-THEN statement or an ELSE statement. For example, consider the following:

GREVGRP = .;
IF GREVERBAL GT 0 AND GREVERBAL LT 500 THEN GREVGRP = 1;
ELSE IF GREVERBAL GE 500 THEN GREVGRP = 2;

The second statement in the preceding program tells SAS that if GREVERBAL is greater than 0 and less than 500, then a score of 1 is given to participants for the GREVGRP variable. This means that all are given a value of 1 only if they are both over 0 and under 500. What happens to those who have a score of 0 or less for GREVERBAL? They are given a value of “.” for GREVGRP. That is, they are classified as having a missing value for GREVGRP. This is because they (along with everyone else) were initially given a value of “.” in the first statement, and neither of the later statements replaces that “.” with 1 or 2. However, for those with GREVERBAL scores greater than 0, one of the subsequent statements replaces “.” with either 1 or 2.

You can also use the conditional statement OR within an IF-THEN statement or an ELSE statement. For example, assume that you have a variable in your dataset called ETHNIC. With this variable, participants were assigned the value 5 if they are Caucasian, 6 if they are African American, or 7 if they are Asian American. Assume that you now wish to create a new variable called MAJORITY. Participants are assigned a value of 1 for this variable if they are in the majority group (i.e., if they are Caucasians), and they are assigned a value of 2 for this variable if they are in a minority group (i.e., if they are either African Americans or Asian Americans). This variable is created with the following statements:

MAJORITY=.;
IF ETHNIC = 5 THEN MAJORITY = 1;
ELSE IF ETHNIC = 6 OR ETHNIC = 7 THEN MAJORITY = 2;

In the preceding statements, all participants are first assigned a value of “missing” for MAJORITY. If their value for ETHNIC is 5, their value for MAJORITY changes to 1 and SAS ignores the following ELSE statement. If their value for ETHNIC is not 5, then SAS proceeds to the ELSE statement. There, if participants’ value for ETHNIC is either 6 or 7, they are then assigned a value of 2 for MAJORITY.

Working with Character Variables

When working with character variables (i.e., variables in which the values consist of letters rather than numbers), you must enclose values within single quotation marks in the IF-THEN and ELSE statements. For example, suppose you want to create a new variable called SEXGRP. With this variable, males are given a score of 1 and females are given a score of 2. The variable SEX already exists in your dataset, and it is a character variable in which males are coded with the letter “M” and females are coded with the letter “F.” You can create the new SEXGRP variable using the following statements:

SEXGRP = .;
IF SEX = 'M' THEN SEXGRP = 1;
ELSE IF SEX = 'F' THEN SEXGRP = 2;

Using the IN Operator

The IN operator makes it easy to determine whether a given value is among a specified list of values. Because of this, a single IF statement including the IN operator can perform comparisons that could otherwise require a large number of IF statements. The general form for using the IN operator is as follows:

IF  variable  IN  value-1,value-2, ...value-n  THEN  statement;

Notice that each value in the preceding list must be separated by a comma.

For example, assume that you have a variable in your dataset called MONTH. The values assumed by this variable are the numbers 1 through 12. With these values, 1 represents January, 2 represents February, 3 represents March, and so forth. Assume that these values for MONTH indicate the month in which a given participant was born, and that you have data for 100 participants.

Imagine that you now wish to create a new variable called SEASON. This variable will indicate the season in which each participant was born. Participants are assigned values for SEASON according to the following guidelines:

  • Participants are assigned a value of 1 for SEASON if they were born in January, February, or March (months 1, 2, 3).

  • Participants are assigned a value of 2 for SEASON if they were born in April, May, or June (months 4, 5, 6).

  • Participants are assigned a value of 3 for SEASON if they were born in July, August, or September (months 7, 8, 9).

  • Participants are assigned a value of 4 for SEASON if they were born in October, November, or December (months 10, 11, 12).

One way to create the new SEASON variable involves using four IF-THEN statements, as shown here:

SEASON = .;
IF MONTH = 1  OR MONTH = 2  OR MONTH = 3  THEN SEASON = 1;
IF MONTH = 4  OR MONTH = 5  OR MONTH = 6  THEN SEASON = 2;
IF MONTH = 7  OR MONTH = 8  OR MONTH = 9  THEN SEASON = 3;
IF MONTH = 10 OR MONTH = 11 OR MONTH = 12 THEN SEASON = 4;

However, the same results can be achieved somewhat more easily by using the IN operator within IF-THEN statements, as shown here:

SEASON = .;
IF MONTH IN (1,2,3)    THEN SEASON = 1;
IF MONTH IN (4,5,6)    THEN SEASON = 2;
IF MONTH IN (7,8,9)    THEN SEASON = 3;
IF MONTH IN (10,11,12) THEN SEASON = 4;

In the preceding example, all variable values are numbers. However, the IN operator can also be used with character variables. As always, it is necessary to enclose all character variable values within single quotation marks. For example, assume that MONTH is actually a character variable that assumes values such as “Jan,” “Feb,” “Mar,” and so forth. Assume further that SEASON assumes the values “Winter,” Spring,” Summer,” and “Fall.” Under these circumstances, the preceding statements would be modified in the following way:

SEASON = '.';
IF MONTH IN ('Jan', 'Feb', 'Mar') THEN SEASON = 'Winter';
IF MONTH IN ('Apr', 'May', 'Jun') THEN SEASON = 'Spring';
IF MONTH IN ('Jul', 'Aug', 'Sep') THEN SEASON = 'Summer';
IF MONTH IN ('Oct', 'Nov', 'Dec') THEN SEASON = 'Fall';

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.206.8