Match-Merge Processing

The Basics of Match-Merge Processing

The match-merging examples in this book are straightforward. However, match-merging can be more complex, depending on your data and on the output data set that you want to create. To predict the results of match-merges correctly, you need to understand how the DATA step performs match-merges.
When you submit a DATA step, it is processed in two phases:
  • the compilation phase, in which SAS checks the syntax of the SAS statements and compiles them (translates them into machine code). During this phase, SAS also sets up descriptor information for the output data set and creates the PDV.
  • the execution phase in which the DATA step reads data and executes any subsequent programming statements. When the DATA step executes, data values are read into the appropriate variables in the PDV. From here, the variables are written to the output data set as a single observation.

The Compilation Phase: Setting Up a New Data Set

To prepare to merge data sets, SAS does the following:
  • reads the descriptor portions of the data sets that are listed in the MERGE statement
  • reads the rest of the DATA step program
  • creates the PDV for the merged data set
  • assigns a tracking pointer to each data set that is listed in the MERGE statement
If there are variables with the same name in more than one data set, then the variable from the first data set (the order in which the data sets are listed in the MERGE statement) determines the length of the variable.
Figure 10.11 The Compilation Phase: Setting Up the New Data Set
The Compilation Phase: Setting Up the New Data Set
After reading the descriptor portions of the data sets Clients and Amounts, SAS does the following:
  1. creates a PDV for the new Claims data set. The PDV contains all variables from the two data sets. Note that although Name appears in both input data sets, it appears in the PDV only once.
  2. assigns tracking pointers to Clients and Amounts.

The Execution Phase: Match-Merging Observations

After compiling the DATA step, SAS sequentially match-merges observations by moving the pointers down each observation of each data set and checking to see whether the BY values match.
  • If the BY values match, the observations are read into the PDV in the order in which the data sets appear in the MERGE statement. Values of any same-named variable are overwritten by values of the same-named variable in subsequent observations. SAS writes the combined observation to the new data set and retains the values in the PDV until the BY value changes in all the data sets.
    The Execution Phase: Match-Merging Observations
  • If the BY values do not match, SAS determines which BY value comes first and reads the observation that contains this value into the PDV. Then the contents of the PDV are written.
    The Execution Phase: PDV
  • When the BY value changes in all the input data sets, the PDV is initialized to missing.
Initializing the PDV to Missing
The DATA step merge continues to process every observation in each data set until it has processed all observations in all data sets.

Handling Unmatched Observations and Missing Values

By default, all observations that are read into the PDV, including observations that have missing data and no matching BY values, are written to the output data set. If you specify a subsetting IF statement to select observations, then only those that meet the IF condition are written.
  • If an observation contains missing values for a variable, then the observation in the output data set contains the missing values as well. Observations that have missing values for the BY variable appear at the top of the output data set because missing values sort first in ascending order.
    Handling Unmatched Observations and Missing Values
  • If an input data set does not have a matching BY value, then the observation in the output data set contains missing values for the variables that are unique to that input data set.
    Handling Unmatched Observations and Missing Values
  • The last observation in Cert.Clients would be added after the last observation in Cert.Amounts.
    The last observation in Cert.Clients would be added after the last observation in Cert.Amounts.
The PROC PRINT output is displayed below. Use the FORMAT statement for the date variable in the PRINT procedure. To learn how to apply a format, see SAS Formats and Informats.
proc print data=work.claims noobs;
  format date date9.;
run;
Figure 10.12 PROC PRINT Output of Merged Data
PROC PRINT output of merged data
Last updated: August 23, 2018
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.62.94