The techniques described to this point in this chapter are designed to help you transform data within a single dataset (e.g., to recode a variable within a single dataset). However, you frequently need to perform transformations that involve combining more than one dataset to create a new dataset. For example, concatenating involves creating a new dataset by combining two or more previously existing datasets. With concatenation, the same variables typically appear in both of the previously existing datasets, but the two sets contain data from different participants. By concatenating the two previously existing sets, you create a new set that contains data from all participants.
In contrast, merging combines datasets in a different way. With merging, each previously existing dataset typically contains data from the same participants. However, the different, previously existing sets usually contain different variables. By merging these sets, you can create a new dataset that contains all variables found in the previously existing datasets. For example, assume that you conduct a study with 100 participants. Dataset A contains each participant’s age, while dataset B contains questionnaire responses from the same 100 participants. By merging datasets A and B, you can create a new dataset called C that, again, contains just 100 observations. A given observation in dataset C contains a given participant’s age as well as the questionnaire responses made by that participant. Now that the datasets are merged, it is possible to correlate participant age with responses to the questionnaire. These coefficients could not be calculated when AGE was in one dataset and the questionnaire responses were in another.
Imagine that you are conducting research that involves the Graduate Record Examinations (GRE). You obtain data from four participants: John; Sally; Miguel; and Mats. You enter information about these four participants into a SAS dataset called A. This dataset contains three variables:
NAME, which contains the participant’s first name;
GREVERBAL, which contains the participant’s score on the GRE verbal test;
GREMATH, which contains the participant’s score on the GRE math test.
The contents of dataset A appear below in Table 4.1. You can see that John has a score of 520 for GREVERBAL and a score of 500 for GREMATH, Sally had a score of 610 for GREVERBAL and 640 for GREMATH, and so forth.
Name | Greverbal | Gremath |
---|---|---|
John | 520 | 500 |
Sally | 610 | 640 |
Miguel | 490 | 470 |
Mats | 550 | 560 |
Imagine that later you create a second dataset called B that contains data from four different participants: Susan; Jiri; Cheri; and Zdeno. Values for these participants for GREVERBAL and GREMATH appear in Table 4.2.
Name | Greverbal | Gremath |
---|---|---|
Susan | 710 | 650 |
Jiri | 450 | 400 |
Cheri | 570 | 600 |
Zdeno | 680 | 700 |
Assume that you would like to perform some analyses on a single dataset that contains scores from all eight of these participants. But you encounter a problem; the values in dataset A were entered differently from the values of dataset B, making it impossible to read data from both sets with a single INPUT statement. For example, perhaps you entered GREVERBAL in columns 10 to 12 in dataset A, but entered it in columns 11 to 13 in dataset B. Because the variable was entered in different columns in the two datasets, it is not possible to write a single INPUT statement that will input this variable (assuming that you use a formatted input approach).
One way to deal with this problem is to input A and B as separate datasets and then concatenate them to create a single dataset that contains all eight observations. You can then perform analyses on the new dataset. The following is the general form for concatenating multiple datasets into a single dataset:
DATA new-dataset-name; SET dataset-1 dataset-2 ... dataset-n;
In the present situation, you want to concatenate two datasets (A and B) to create a new dataset named C. This could be done in the following statements:
DATA C; SET A B;
The entire program follows that places these statements in context. This program
inputs dataset A;
inputs dataset B;
concatenates A and B to create C;
uses PROC PRINT to print the contents of dataset C. (PROC PRINT is discussed in greater detail in Chapter 5, “Exploring Data with PROC MEANS, PROC FREQ, PROC PRINT, and PROC UNIVARIATE.”)
1 DATA A; 2 INPUT #1 @1 NAME $7. 3 @10 GREVERBAL 3. 4 @14 GREMATH 3. ; 5 6 DATALINES; 7 John 520 500 8 Sally 610 640 9 Miguel 490 470 10 Mats 550 560 11 ; 12 RUN; 13 14 DATA B; 15 INPUT #1 @1 NAME $7. 16 @11 GREVERBAL 3. 17 @15 GREMATH 3. ; 18 19 DATALINES; 20 Susan 710 650 21 Jiri 450 400 22 Cheri 570 600 23 Zdeno 680 700 24 ; 25 RUN; 26 27 DATA C; 28 SET A B; 29 30 PROC PRINT DATA=C 31 RUN;
In the preceding program, dataset A is input in program lines 1 through 12, and dataset B is input in lines 14 through 25. In lines 27 and 28, the two datasets are concatenated to create dataset C. In lines 30 and 31, PROC PRINT is used to print the contents of dataset C, and the results of this procedure are reproduced as Output 4.1. The results of Output 4.1 show that dataset C contains eight observations: the four observations from dataset A along with the four observations from dataset B. To perform additional statistical analyses on this combined dataset, you would specify DATA=C in the PROC statement of your SAS program.
OBS NAME GREVERBAL GREMATH 1 John 520 500 2 Sally 610 640 3 Miguel 490 470 4 Mats 550 560 5 Susan 710 650 6 Jiri 450 400 7 Cheri 570 600 8 Zdeno 680 700 |
As stated earlier, you would normally merge datasets when
you are working with two datasets;
both datasets contain information for the same participants, but one dataset contains one set of variables, while the other dataset contains a different set of variables.
Once these two datasets are merged, you have a single dataset that contains all variables. Having all dataset variables in one dataset allows you to assess the associations among variables, should you want to do so.
As an illustration, assume that your sample consists of just four participants: John; Sally; Miguel; and Mats. Assume that you have obtained the social security number for each participant, and that these numbers are included as a SAS variable named SOCSEC in both previously existing datasets. In dataset D, you have GRE verbal test scores and GRE math test scores for these participants (represented as SAS variables GREVERBAL and GREMATH, respectively). In dataset E, you have college cumulative grade point average for the same four participants (represented as GPA). Table 4.3 and Table 4.4 show the content of these two datasets.
Name | Socsec | Greverbal | Gremath |
---|---|---|---|
John | 232882121 | 520 | 500 |
Sally | 222773454 | 610 | 640 |
Miguel | 211447653 | 490 | 470 |
Mats | 222671234 | 550 | 560 |
Name | Socsec | GPA |
---|---|---|
John | 232882121 | 2.70 |
Sally | 222773454 | 3.25 |
Miguel | 211447653 | 2.20 |
Mats | 222671234 | 2.50 |
Assume that, in conducting your research, you would like to compute the correlation coefficient between GREVERBAL and GPA. (Let’s forget for the moment that you really shouldn’t perform a correlation using such a small sample!) Computing this correlation should be possible because you do have values for these two variables for all four of your participants. However, you cannot compute this correlation until both variables appear in the same dataset. Therefore, it is necessary to merge the variables contained in datasets D and E.
There are actually two ways of merging datasets. Perhaps the simplest way is the one-to-one approach. With one-to-one merging, observations are merged according to their order of appearance in the datasets. For example, imagine that you were to merge datasets D and E using one-to-one merging. In doing this, SAS would take the first observation from dataset D and pair it with the first observation from dataset E. The result would become the first observation in the new dataset (dataset F). If the observations in datasets D and E were in exactly the same sequence, this method would work fine. Unfortunately, if any of the observations were out of sequence, or if one dataset contained more observations than another, then this approach could result in the incorrect pairing of observations. For this reason, we recommend a different strategy for merging: the match-merging approach next described.
Match-merging seems to be the method that is least likely to produce undesirable results and errors. With match-merging, both datasets must contain a common variable, so that values for this common variable can be used to combine observations from the two previously existing datasets into observations for the new dataset (often the participant identification number). For example, consider datasets D and E from Table 4.3 and Table 4.4. The variable SOCSEC appears in both of these datasets; thus, it is a common variable. When SAS uses match-merging to merge these two datasets according to values on SOCSEC, it will:
read the social security number for the first participant in dataset D;
look for a participant in dataset E who has the same social security number;
merge the information from that participant’s observation in dataset D with his or her information from dataset E (if it finds a participant in dataset E with the same social security number);
combine the information into a single observation in the new dataset, F;
repeat this process for all participants.
As the preceding description suggests, the variable that you use as your common variable must be chosen carefully. Ideally, each participant should be assigned a unique value for this common variable. This means that no two participants should have the same value for the common variable. This objective would be achieved when social security numbers are used as the common variable, because no two people have the same social security number (assuming that the data are entered correctly).
The SAS procedure for match-merging datasets is somewhat more complex than for concatenating datasets. In part, this is because both previously existing datasets must be sorted according to values for the common variable prior to merging. This means that the observations must be rearranged in a consistent order with respect to values for the common variable. Fortunately, this is easy to do with PROC SORT, a SAS procedure that allows you to sort variables. This section shows how PROC SORT can be used to achieve this.
The general form for match-merging two previously existing datasets is presented as follows:
PROC SORT DATA=dataset-1; BY common-variable; RUN; PROC SORT DATA=dataset-2; BY common-variable; RUN; DATA new-dataset-name; MERGE dataset-1 dataset-2; BY common-variable; RUN;
To illustrate, assume that you want to match-merge datasets D and E from Table 4.3 and Table 4.4; to do this, use SOCSEC as the common variable. In the following program, these two datasets are entered, sorted, and then merged using the match-merge approach:
1 DATA D; 2 INPUT #1 @1 NAME $9. 3 @10 SOCSEC 9. 4 @20 GREVERBAL 4. 5 @23 GREMATH 4. ; 6 7 DATALINES; 8 John 232882121 520 500 9 Sally 222773454 610 640 10 Miguel 211447653 490 470 11 Mats 222671234 550 560 12 ; 13 RUN; 14 15 16 DATA E; 17 INPUT #1 @1 NAME $9. 18 @10 SOCSEC 9. 19 @20 GPA 4. ; 20 21 DATALINES; 22 John 232882121 2.70 23 Sally 222773454 3.25 24 Miguel 211447653 2.20 25 Mats 222671234 2.50 26 ; 27 RUN; 28 29 PROC SORT DATA=D; 30 BY SOCSEC; 31 RUN; 32 33 PROC SORT DATA=E; 34 BY SOCSEC; 35 RUN; 36 37 DATA F; 38 MERGE D E; 39 BY SOCSEC; 40 RUN; 41 42 PROC PRINT DATA=F; 43 RUN;
In the preceding program, dataset D was input in lines 1 to 11, and dataset E was input in lines 16 through 27. In lines 29 through 35, both datasets were sorted according to values for SOCSEC, and the two datasets were merged according to values of SOCSEC in lines 37 through 40. Finally, the PROC PRINT on lines 42 and 43 requests a printout of the raw data contained in the new dataset.
Output 4.2 contains the results of PROC PRINT, which printed the raw data now contained in dataset F. You can see that each observation in this new dataset now contains the merged data from the two previous datasets D and E. For example, the line for the participant named Miguel now contains his scores on the verbal and math sections of the GRE (which came from dataset D), as well as his grade point average score (which came from dataset E). The same is true for the remaining participants. It now is possible to correlate GREVERBAL with GPA, if that analysis were desired.
The SAS System Obs NAME SOCSEC GREVERBAL GREMATH GPA 1 Miguel 211447653 490 470 2.20 2 Mats 222671234 550 560 2.50 3 Sally 222773454 610 640 3.25 4 John 232882121 520 500 2.70 |
Notice that the observations in Output 4.2 are not in the same order in which they appeared in Tables 4.3 and 4.4. This is because they have now been sorted according to values for SOCSEC by the PROC SORT statements in the preceding SAS program.
3.129.210.91