Concatenating and Merging Datasets

The techniques described to this point in this chapter are designed to help you transform data within a single dataset (e.g., to recode a variable within a single dataset). However, you frequently need to perform transformations that involve combining more than one dataset to create a new dataset. For example, concatenating involves creating a new dataset by combining two or more previously existing datasets. With concatenation, the same variables typically appear in both of the previously existing datasets, but the two sets contain data from different participants. By concatenating the two previously existing sets, you create a new set that contains data from all participants.

In contrast, merging combines datasets in a different way. With merging, each previously existing dataset typically contains data from the same participants. However, the different, previously existing sets usually contain different variables. By merging these sets, you can create a new dataset that contains all variables found in the previously existing datasets. For example, assume that you conduct a study with 100 participants. Dataset A contains each participant’s age, while dataset B contains questionnaire responses from the same 100 participants. By merging datasets A and B, you can create a new dataset called C that, again, contains just 100 observations. A given observation in dataset C contains a given participant’s age as well as the questionnaire responses made by that participant. Now that the datasets are merged, it is possible to correlate participant age with responses to the questionnaire. These coefficients could not be calculated when AGE was in one dataset and the questionnaire responses were in another.

Concatenating Datasets

Imagine that you are conducting research that involves the Graduate Record Examinations (GRE). You obtain data from four participants: John; Sally; Miguel; and Mats. You enter information about these four participants into a SAS dataset called A. This dataset contains three variables:

  • NAME, which contains the participant’s first name;

  • GREVERBAL, which contains the participant’s score on the GRE verbal test;

  • GREMATH, which contains the participant’s score on the GRE math test.

The contents of dataset A appear below in Table 4.1. You can see that John has a score of 520 for GREVERBAL and a score of 500 for GREMATH, Sally had a score of 610 for GREVERBAL and 640 for GREMATH, and so forth.

Table 4.1. Contents of Dataset A
NameGreverbalGremath
John520500
Sally610640
Miguel490470
Mats550560

Imagine that later you create a second dataset called B that contains data from four different participants: Susan; Jiri; Cheri; and Zdeno. Values for these participants for GREVERBAL and GREMATH appear in Table 4.2.

Table 4.2. Contents of Dataset B
NameGreverbalGremath
Susan710650
Jiri450400
Cheri570600
Zdeno680700

Assume that you would like to perform some analyses on a single dataset that contains scores from all eight of these participants. But you encounter a problem; the values in dataset A were entered differently from the values of dataset B, making it impossible to read data from both sets with a single INPUT statement. For example, perhaps you entered GREVERBAL in columns 10 to 12 in dataset A, but entered it in columns 11 to 13 in dataset B. Because the variable was entered in different columns in the two datasets, it is not possible to write a single INPUT statement that will input this variable (assuming that you use a formatted input approach).

One way to deal with this problem is to input A and B as separate datasets and then concatenate them to create a single dataset that contains all eight observations. You can then perform analyses on the new dataset. The following is the general form for concatenating multiple datasets into a single dataset:

DATA  new-dataset-name;
     SET  dataset-1  dataset-2 ... dataset-n;

In the present situation, you want to concatenate two datasets (A and B) to create a new dataset named C. This could be done in the following statements:

     DATA C;
        SET A B;

The entire program follows that places these statements in context. This program

  • inputs dataset A;

  • inputs dataset B;

  • concatenates A and B to create C;

  • uses PROC PRINT to print the contents of dataset C. (PROC PRINT is discussed in greater detail in Chapter 5, “Exploring Data with PROC MEANS, PROC FREQ, PROC PRINT, and PROC UNIVARIATE.”)

 1     DATA  A;
 2        INPUT  #1   @1   NAME      $7.
 3                    @10  GREVERBAL  3.
 4                    @14  GREMATH    3.  ;
 5
 6     DATALINES;
 7     John     520 500
 8     Sally    610 640
 9     Miguel   490 470
10     Mats     550 560
11     ;
12     RUN;
13
14     DATA  B;
15        INPUT  #1   @1   NAME       $7.
16                    @11  GREVERBAL   3.
17                    @15  GREMATH     3.  ;
18
19     DATALINES;
20     Susan     710 650
21     Jiri      450 400
22     Cheri     570 600
23     Zdeno     680 700
24     ;
25     RUN;
26
27     DATA  C;
28        SET  A  B;
29
30     PROC PRINT  DATA=C
31     RUN;

In the preceding program, dataset A is input in program lines 1 through 12, and dataset B is input in lines 14 through 25. In lines 27 and 28, the two datasets are concatenated to create dataset C. In lines 30 and 31, PROC PRINT is used to print the contents of dataset C, and the results of this procedure are reproduced as Output 4.1. The results of Output 4.1 show that dataset C contains eight observations: the four observations from dataset A along with the four observations from dataset B. To perform additional statistical analyses on this combined dataset, you would specify DATA=C in the PROC statement of your SAS program.

Output 4.1. Results of Performing PROC PRINT on Dataset C
OBS    NAME      GREVERBAL      GREMATH

 1     John            520          500
 2     Sally           610          640
 3     Miguel          490          470
 4     Mats            550          560
 5     Susan           710          650
 6     Jiri            450          400
 7     Cheri           570          600
 8     Zdeno           680          700

Merging Datasets

As stated earlier, you would normally merge datasets when

  • you are working with two datasets;

  • both datasets contain information for the same participants, but one dataset contains one set of variables, while the other dataset contains a different set of variables.

Once these two datasets are merged, you have a single dataset that contains all variables. Having all dataset variables in one dataset allows you to assess the associations among variables, should you want to do so.

As an illustration, assume that your sample consists of just four participants: John; Sally; Miguel; and Mats. Assume that you have obtained the social security number for each participant, and that these numbers are included as a SAS variable named SOCSEC in both previously existing datasets. In dataset D, you have GRE verbal test scores and GRE math test scores for these participants (represented as SAS variables GREVERBAL and GREMATH, respectively). In dataset E, you have college cumulative grade point average for the same four participants (represented as GPA). Table 4.3 and Table 4.4 show the content of these two datasets.

Table 4.3. Contents of Dataset D
NameSocsecGreverbalGremath
John232882121520500
Sally222773454610640
Miguel211447653490470
Mats222671234550560

Table 4.4. Contents of Dataset E
NameSocsecGPA
John2328821212.70
Sally2227734543.25
Miguel2114476532.20
Mats2226712342.50

Assume that, in conducting your research, you would like to compute the correlation coefficient between GREVERBAL and GPA. (Let’s forget for the moment that you really shouldn’t perform a correlation using such a small sample!) Computing this correlation should be possible because you do have values for these two variables for all four of your participants. However, you cannot compute this correlation until both variables appear in the same dataset. Therefore, it is necessary to merge the variables contained in datasets D and E.

There are actually two ways of merging datasets. Perhaps the simplest way is the one-to-one approach. With one-to-one merging, observations are merged according to their order of appearance in the datasets. For example, imagine that you were to merge datasets D and E using one-to-one merging. In doing this, SAS would take the first observation from dataset D and pair it with the first observation from dataset E. The result would become the first observation in the new dataset (dataset F). If the observations in datasets D and E were in exactly the same sequence, this method would work fine. Unfortunately, if any of the observations were out of sequence, or if one dataset contained more observations than another, then this approach could result in the incorrect pairing of observations. For this reason, we recommend a different strategy for merging: the match-merging approach next described.

Match-merging seems to be the method that is least likely to produce undesirable results and errors. With match-merging, both datasets must contain a common variable, so that values for this common variable can be used to combine observations from the two previously existing datasets into observations for the new dataset (often the participant identification number). For example, consider datasets D and E from Table 4.3 and Table 4.4. The variable SOCSEC appears in both of these datasets; thus, it is a common variable. When SAS uses match-merging to merge these two datasets according to values on SOCSEC, it will:

  • read the social security number for the first participant in dataset D;

  • look for a participant in dataset E who has the same social security number;

  • merge the information from that participant’s observation in dataset D with his or her information from dataset E (if it finds a participant in dataset E with the same social security number);

  • combine the information into a single observation in the new dataset, F;

  • repeat this process for all participants.

As the preceding description suggests, the variable that you use as your common variable must be chosen carefully. Ideally, each participant should be assigned a unique value for this common variable. This means that no two participants should have the same value for the common variable. This objective would be achieved when social security numbers are used as the common variable, because no two people have the same social security number (assuming that the data are entered correctly).

The SAS procedure for match-merging datasets is somewhat more complex than for concatenating datasets. In part, this is because both previously existing datasets must be sorted according to values for the common variable prior to merging. This means that the observations must be rearranged in a consistent order with respect to values for the common variable. Fortunately, this is easy to do with PROC SORT, a SAS procedure that allows you to sort variables. This section shows how PROC SORT can be used to achieve this.

The general form for match-merging two previously existing datasets is presented as follows:

PROC SORT  DATA=dataset-1;
   BY  common-variable;
RUN;

PROC SORT  DATA=dataset-2;
   BY  common-variable;
RUN;

DATA  new-dataset-name;
   MERGE  dataset-1  dataset-2;
   BY  common-variable;
RUN;

To illustrate, assume that you want to match-merge datasets D and E from Table 4.3 and Table 4.4; to do this, use SOCSEC as the common variable. In the following program, these two datasets are entered, sorted, and then merged using the match-merge approach:

 1     DATA  D;
 2        INPUT  #1   @1  NAME       $9.
 3                    @10 SOCSEC      9.
 4                    @20 GREVERBAL   4.
 5                    @23 GREMATH     4.  ;
 6
 7     DATALINES;
 8     John     232882121 520 500
 9     Sally    222773454 610 640
10     Miguel   211447653 490 470
11     Mats     222671234 550 560
12     ;
13     RUN;
14
15
16     DATA  E;
17        INPUT  #1  @1   NAME    $9.
18                   @10  SOCSEC   9.
19                   @20  GPA      4.  ;
20
21     DATALINES;
22     John     232882121 2.70
23     Sally    222773454 3.25
24     Miguel   211447653 2.20
25     Mats     222671234 2.50
26     ;
27     RUN;
28
29     PROC SORT  DATA=D;
30        BY  SOCSEC;
31     RUN;
32
33     PROC SORT  DATA=E;
34        BY  SOCSEC;
35     RUN;
36
37     DATA  F;
38        MERGE  D  E;
39        BY  SOCSEC;
40     RUN;
41
42     PROC PRINT  DATA=F;
43     RUN;

In the preceding program, dataset D was input in lines 1 to 11, and dataset E was input in lines 16 through 27. In lines 29 through 35, both datasets were sorted according to values for SOCSEC, and the two datasets were merged according to values of SOCSEC in lines 37 through 40. Finally, the PROC PRINT on lines 42 and 43 requests a printout of the raw data contained in the new dataset.

Output 4.2 contains the results of PROC PRINT, which printed the raw data now contained in dataset F. You can see that each observation in this new dataset now contains the merged data from the two previous datasets D and E. For example, the line for the participant named Miguel now contains his scores on the verbal and math sections of the GRE (which came from dataset D), as well as his grade point average score (which came from dataset E). The same is true for the remaining participants. It now is possible to correlate GREVERBAL with GPA, if that analysis were desired.

Output 4.2. Results of Performing PROC PRINT on Dataset F
                      The SAS System

Obs     NAME       SOCSEC     GREVERBAL    GREMATH      GPA
 1     Miguel    211447653       490         470       2.20
 2     Mats      222671234       550         560       2.50
 3     Sally     222773454       610         640       3.25
 4     John      232882121       520         500       2.70

Notice that the observations in Output 4.2 are not in the same order in which they appeared in Tables 4.3 and 4.4. This is because they have now been sorted according to values for SOCSEC by the PROC SORT statements in the preceding SAS program.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.184.142