Example 3.2 Merging Observations from Multiple Data Sets by a Common Variable

Goal

Combine observations from multiple data sets based on a variable common to each contributing data set.

Example Features

Featured StepDATA step
Featured Step Options and StatementsMatch-merge with MERGE and BY statement
Related TechniquePROC SQL, multiway join with WHERE clause
A Closer LookUnderstanding How the DATA Step Processes a Match-Merge

Input Data Sets

Data set AGENTS contains the names and IDs of five real estate agents. Data set PROPERTIES contains a list of properties that have sold. The agent responsible for the sale is identified by the ID variable. Data set RATES contains the commission rates for the five agents with the agent identified by the ID variable.

The variable in common is ID. The data sets must be sorted or indexed by the values of ID. Each value of ID occurs only once in data sets AGENTS and RATES, but it might occur multiple times in PROPERTIES. All five ID values are present in the three data sets.

         AGENTS

Obs    id    lname
 1     991   Wong
 2    5144   Johnson
 3    8170   Cross
 4    8173   Chan
 5    9381   Ames


          PROPERTIES

Obs   id      salesprice
 1    991    $345,000.00
 2   5144     $80,000.00
 3   5144    $644,000.00
 4   8170     $72,000.00
 5   8170     $64,000.00
 6   8170    $426,500.00
 7   8173    $198,400.00
 8   9381    $278,000.00

      RATES

Obs    id     rate
 1     991   0.040
 2    5144   0.035
 3    8170   0.080
 4    8173   0.060
 5    9381   0.050

Resulting Data Set

Output 3.2 PAYOUTS Data Set

      Example 3.2 PAYOUTS Data Set Created with DATA Step

Obs     id     lname        salesprice     rate      commission

 1      991    Wong        $345,000.00    0.040      $13,800.00
 2     5144    Johnson      $80,000.00    0.035       $2,800.00
 3     5144    Johnson     $644,000.00    0.035      $22,540.00
 4     8170    Cross        $72,000.00    0.080       $5,760.00
 5     8170    Cross        $64,000.00    0.080       $5,120.00
 6     8170    Cross       $426,500.00    0.080      $34,120.00
 7     8173    Chan        $198,400.00    0.060      $11,904.00
 8     9381    Ames        $278,000.00    0.050      $13,900.00


Example Overview

The following DATA step demonstrates how to match-merge more than two data sets. The previous examples match-merged only two data sets or tables at a time.

The three input data sets contain information about real estate agents in data set AGENTS, properties that the agents have sold in data set PROPERTIES, and the agents' commission rates in RATES. The variable that links the three data sets is ID.

The objective of this program is to create a new data set that matches each individual with the correct sale and rate based on corresponding ID values. The DATA step match-merges the data sets AGENTS, PROPERTIES, and RATES to create a single data set, PAYOUTS, which contains the variables from the three data sets plus a new variable, COMMISSION. The values for COMMISSION are computed by multiplying SALESPRICE from data set PROPERTIES by RATE from data set RATES.

Data set PROPERTIES contains multiple occurrences of some values of ID, while AGENTS and RATES contain only one occurrence of each ID value. Because values of LNAME and RATE (which are read from AGENTS and RATES) are automatically retained across the BY group, multiple observations with the same value for ID found in PROPERTIES contain the correct values for LNAME and RATE.

Program

Create PAYOUTS. Combine observations from the three data sets based on matching the values of ID. Compute a new variable by using information merged from two of the input data sets.

data payouts;
  merge agents sales rates;
  by id;

  commission=rate*salesprice;
  format commission salesprice dollar12.2;

run;

Related Technique

This PROC SQL step creates a table with contents identical to the data set PAYOUTS that was created in the main example. It performs a multiway join with a WHERE clause to link the three tables by column ID. In processing a multiway join, PROC SQL joins two tables at a time and performs the joins in the most efficient order. The order it chooses minimizes the size of the Cartesian product.

Create table PAYOUTS. Specify the columns to select from the three tables. Create a new column by using columns from different tables. List the three tables to join. Specify how to link the three tables.

proc sql;
  create table payouts as
    select agents.id,lname,
           salesprice format=dollar12.2, rate,
           salesprice*rate as commission
                                      format=dollar12.2
                  from agents, properties, rates
           where agents.id=properties.id and
                 properties.id=rates.id;
quit;

A Closer Look

Understanding How the DATA Step Processes a Match-Merge

This section shows how the DATA step in the main example creates the Program Data Vector (PDV) and match-merges the three input data sets.

During the compilation phase, SAS reads the descriptor portions of the input data sets and creates the Program Data Vector (PDV). Also, SAS determines the BY groups in each input data set for the variables that are listed in the BY statement. The five unique ID values define five BY groups.



SAS looks at the first BY group in each input data set to determine whether the BY values match. If so, SAS reads the first observation of that BY group from the first input data set that has that BY value and records the values in the PDV. In this example, all three data sets have the same BY value of ID=991.



Because the BY values match in all three data sets, SAS reads the first observation of the same BY group from the second and third input data sets. It moves the variable values for SALESPRICE and RATE from PROPERTIES and RATES to the PDV, computes COMMISSION, and outputs the observation to data set PAYOUTS.



The first iteration is complete. SAS returns to the top of the DATA step to begin the second iteration and determines that there are no more observations in any of the three data sets with an ID=991. It clears the PDV and examines the next BY group.



The next BY group is ID=5144, and there are observations in all three data sets for this BY value. The variable values are moved into the PDV, COMMISSION is computed, and the observation is written to PAYOUTS.



SAS returns to the top of the DATA step to begin the third iteration. Data sets AGENTS and RATES do not have any more observations for ID=5144, the current BY value. There is a second observation for ID=5144 in PROPERTIES. The variable values for ID=5144 that are contributed from AGENTS and RATES remain in the PDV while the new value for SALESPRICE is moved into the PDV. The DATA step computes COMMISSION by using the new value for SALESPRICE and outputs the observation to PAYOUTS.



SAS continues to match-merge observations until all observations from the three input data sets have been read and written to the output data set.

The five BY values all have matches in the three data sets. If a BY value does not match, SAS will read the input data set with the lowest BY value. The PDV and the output data set will contain missing values for variables that are unique to the other data set.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.199.182