Combining Data with the DATA Step Match-Merge

The DATA Step Match-Merge

You should already know how to merge multiple data sets in the DATA step when there is a BY variable that is common to each of the input data sets. When you use the MERGE statement to perform a table lookup operation, your lookup values must be stored in one or more SAS data sets. This technique requires that both the base table and the lookup table or tables be either sorted by or indexed on the BY variable or variables.
You can specify any number of input data sets in the MERGE statement as long as all input data sets have a common BY variable. The MERGE statement can combine data sets of any size. The MERGE statement is capable of returning multiple values. You can use multiple BY variables to perform lookups that are dependent on more than one variable. The MERGE statement returns both matches and nonmatches by default, but you can use DATA step syntax to return only exact matches or to include only specific variables from the lookup table.
CAUTION:
Although you can use the MERGE statement to combine data from sources that have any type of relationship, this technique might not produce the desired results when you are working with a many-to-many match. When the data sets are merged in a DATA step, the observations are matched and combined sequentially. Once an observation is read, it is never reread. That is, the DATA step MERGE statement does not create a Cartesian product. Therefore, the DATA step MERGE statement is probably not an appropriate technique to use for performing lookup operations when you are working with a many-to-many match.

Working with Multiple Lookup Tables

Sometimes you might need to combine data from three or more related SAS data sets in order to create one new data set. For example, the three data sets listed below all contain different data about a fictional airline's flights and airports. Sasuser.Acities contains data about various airports, Sasuser.Revenue contains data about the revenue generated by various flights, and Sasuser.Expenses contains data about the expenses incurred by various flights. The variables in each of these data sets are listed here.
Working with Multiple Lookup Tables
Suppose you want to create a new data set, named Sasuser.Alldata, that contains data from each of these three input data sets. As shown below, the Sasuser.Alldata data set contains the new variable Profit, which is calculated from the revenue values that are stored in Sasuser.Revenue and the expense values that are stored in Sasuser.Expenses.
Working with Multiple Lookup Tables
You can specify any number of input data sets in the MERGE statement as long as all input data sets have a common BY variable. However, you can see from the data set variable lists above that these three data sets do not have one common variable. We will consider a method for performing a match-merge on these three data sets.
Although the three data sets Sasuser.Acities, Sasuser.Revenue, and Sasuser.Expenses do not have a common BY variable, there are several variables that are common to two of the three data sets. As shown below, Date and FlightID are both common to Revenue and Expenses. The variable Code in the Acities data set and the variable Dest in the Revenue data set, while named differently, contain the same data with the same type and length.
Working with Multiple Lookup Tables
Notice that Code in Acities and Dest in Revenue are listed as corresponding to one another even though they have different names. When you are looking for common variables between data sets, the variable names are not important since they can be changed with the RENAME= option in the MERGE statement. Instead, you should look for variables that record the same information and that have the same type in each input data set. Common variables do not need to have the same length, although you should remember that the length of the variable in the first-listed data set will determine the length of the variable in the output data set.
Note: Any variables that have the same name in multiple data sets in the MERGE statement must also have the same type. If any variables in different input data sets have identical names but do not have identical types, ERROR and WARNING messages are written to the SAS log, and the match-merge fails.
In this case, both Code in Acities and Dest in Revenue record the three-letter abbreviation of an airport.
Tip
You can use PROC CONTENTS to view information about variables such as type, length, and description.
Since there are variables that are common to two different pairs of the three data sets shown above, you can combine these data sets into one data set by using the MERGE statement in two subsequent DATA steps. That is, you can perform one match-merge on two of the data sets to create one new data set that combines data from both. Then you can perform another match-merge on the new data set and the remaining original data set. Consider the following example.

Example

In the following program, both Sasuser.Expenses and Sasuser.Revenue are sorted by FlightID and Date and are placed into temporary data sets in preparation for the merge. Then these two sorted data sets are merged in a DATA step that creates a temporary output data set named Revexpns. In order to reduce the total number of variables in the output data set, a new variable named Profit is created, and the variables that are used to create Profit are dropped from Revexpns.
proc sort data=sasuser.expenses out=expenses;
   by flightid date;
run;

proc sort data=sasuser.revenue out=revenue;
   by flightid date;
run;

data revexpns (drop=rev1st revbusiness revecon expenses);
   merge expenses(in=e) revenue(in=r);
   by flightid date;
   if e and r;
   Profit=sum(rev1st, revbusiness, revecon, -expenses);
run;
Note: The use of the temporary IN= variables E and R in the IF statement above ensures that only observations that contain data from each of the two input data sets are included in the output data set.
In the following program, the output data set named Revexpns is sorted by Dest. Sasuser.Actities is sorted by Code and is placed in a temporary data set. Remember that Dest and Code are corresponding variables even though they have different names.
The sorted data sets are then merged in a DATA step. Since two data sets must have at least one variable that matches exactly in order to be merged, the RENAME= option renames Code to Dest in the output data set. The DATA step merges Revexpns and Acities into a new output data set named Alldata.
proc sort data=revexpns;
   by dest;
run;

proc sort data=sasuser.acities out=acities;
   by code;
run;

data sasuser.alldata;
   merge revexpns(in=r) acities
         (in=a rename=(code=dest) keep=city name code);
   by dest;
   if r and a;
run;

proc print data=sasuser.alldata(obs=5) noobs;
   title 'Result of Merging Three Data Sets';
   format Date date9.;
run;
The PROC PRINT step prints the first five observations in the Sasuser.Alldata data set that is created in this example, as shown here.
Working with Multiple Lookup Tables
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.103.204