Match a master data set and transaction data set where the master data set has only unique combinations of the values of the matching variables and the transaction data set can have multiple combinations. Keep only the observations from the master data set in the output data set. Order the observations in the transaction data set so that a specific observation from the transaction data set is kept in the output data set when there are multiple matches for a combination of the values of the matching variables.
Examples 6.1, 6.2, 6.3, 6.4, and 6.5 illustrate ways to update common variables when combining data sets. The five examples apply the methods to the same master data set. Different transaction data sets are used in Examples 6.1, 6.2, and 6.3. Examples 6.4 and 6.5 use the same transaction data set and produce different resulting data sets.
Featured Step | DATA step |
Featured Step Options and Statements | Match-merge with MERGE and BY statements IN= data set option |
Related Technique | PROC SQL, left join, COALESCE function, subquery |
Master data set BOOKLIST contains information about the chapters of three books. Transaction data set SUBMISSIONS contains updates of information to apply to BOOKLIST.
BOOKLIST Obs bookid booktitle chapter author duedate editor 1 NF0586 Current Narratives 1 Smith, Rebecca 09/04/2010 Williams, Stephanie 2 NF0586 Current Narratives 2 Williams, Susan 09/04/2010 Williams, Stephanie 3 NF0586 Current Narratives 3 Torres, Christopher 09/11/2010 Williams, Stephanie 4 NF0586 Current Narratives 4 Torres, Christopher 09/11/2010 Williams, Stephanie 5 NF0586 Current Narratives 5 Powell, George 09/11/2010 Williams, Stephanie 6 NF0586 Current Narratives 6 Thompson, Tonya 09/11/2010 Williams, Stephanie 7 NF0586 Current Narratives 7 Allen, Linda 09/11/2010 Williams, Stephanie 8 NF0586 Current Narratives 8 Johnson, Tammy 09/11/2010 Williams, Stephanie 9 NF0586 Current Narratives 9 Kelly, Melissa 09/11/2010 Williams, Stephanie 10 NF0586 Current Narratives 10 Thompson, Tonya 09/11/2010 Williams, Stephanie 11 NF2413 Political Comments 1 Jones, Robin 07/31/2010 White, Michelle 12 NF2413 Political Comments 2 Sanchez, Brandon 08/07/2010 White, Michelle 13 NF2413 Political Comments 3 Jones, Robin 07/31/2010 White, Michelle 14 NF2413 Political Comments 4 Perez, Joshua 07/31/2010 White, Michelle 15 NF2413 Political Comments 5 Williams, Nicholas 07/31/2010 White, Michelle 16 NF2413 Political Comments 6 Patterson, Mary 08/14/2010 White, Michelle 17 NF2413 Political Comments 7 Torres, Christopher 08/07/2010 White, Michelle 18 NF2413 Political Comments 8 Robinson, Bonnie 08/07/2010 White, Michelle 19 NF2413 Political Comments 9 Brown, Patricia 08/07/2010 White, Michelle 20 NF8141 Favorite Essays 1 Clark, Todd 10/02/2010 Patterson, Daniel 21 NF8141 Favorite Essays 2 Barnes, David 10/02/2010 Patterson, Daniel 22 NF8141 Favorite Essays 3 Young, Richard 09/18/2010 Patterson, Daniel 23 NF8141 Favorite Essays 4 Barnes, David 10/02/2010 Patterson, Daniel 24 NF8141 Favorite Essays 5 Anderson, Daniel 09/18/2010 Patterson, Daniel 25 NF8141 Favorite Essays 6 Anderson, Daniel 09/18/2010 Patterson, Daniel 26 NF8141 Favorite Essays 7 Morris, Laura 09/18/2010 Patterson, Daniel 27 NF8141 Favorite Essays 8 Powell, George 09/18/2010 Patterson, Daniel
Data set BOOKLIST has unique occurrences of the combinations of BOOKID and CHAPTER. Data set SUBMISSIONS contains multiple occurrences for two combinations of BOOKID and CHAPTER.
SUBMISSIONS Obs bookid chapter author draftdate 1 NF0586 3 Torres, Christopher 05/13/2010 2 NF0586 3 Torres, Christopher 06/17/2010 3 NF0586 3 Torres, Christopher 06/30/2010 4 NF2413 2 Sanchez, Brandon 04/22/2010 5 NF2413 2 Sanchez, Brandon 06/02/2010 6 NF2413 8 Robinson, Bonnie 04/01/2010 7 NF8141 5 Anderson, Daniel 05/26/2010 8 NF8141 6 Anderson, Daniel 07/01/2010
Output 6.5 LASTDRAFT Data SetExample 6.5 LASTDRAFT Data Set Created with DATA Step Obs bookid booktitle chapter author duedate editor draftdate 1 NF0586 Current Narratives 1 Smith, Rebecca 09/04/2010 Williams, Stephanie . 2 NF0586 Current Narratives 2 Williams, Susan 09/04/2010 Williams, Stephanie . 3 NF0586 Current Narratives 3 Torres, Christopher 09/11/2010 Williams, Stephanie 06/30/2010 4 NF0586 Current Narratives 4 Torres, Christopher 09/11/2010 Williams, Stephanie . 5 NF0586 Current Narratives 5 Powell, George 09/11/2010 Williams, Stephanie . 6 NF0586 Current Narratives 6 Thompson, Tonya 09/11/2010 Williams, Stephanie . 7 NF0586 Current Narratives 7 Allen, Linda 09/11/2010 Williams, Stephanie . 8 NF0586 Current Narratives 8 Johnson, Tammy 09/11/2010 Williams, Stephanie . 9 NF0586 Current Narratives 9 Kelly, Melissa 09/11/2010 Williams, Stephanie . 10 NF0586 Current Narratives 10 Thompson, Tonya 09/11/2010 Williams, Stephanie . 11 NF2413 Political Comments 1 Jones, Robin 07/31/2010 White, Michelle . 12 NF2413 Political Comments 2 Sanchez, Brandon 08/07/2010 White, Michelle 06/02/2010 13 NF2413 Political Comments 3 Jones, Robin 07/31/2010 White, Michelle . 14 NF2413 Political Comments 4 Perez, Joshua 07/31/2010 White, Michelle . 15 NF2413 Political Comments 5 Williams, Nicholas 07/31/2010 White, Michelle . 16 NF2413 Political Comments 6 Patterson, Mary 08/14/2010 White, Michelle . 17 NF2413 Political Comments 7 Torres, Christopher 08/07/2010 White, Michelle . 18 NF2413 Political Comments 8 Robinson, Bonnie 08/07/2010 White, Michelle 04/01/2010 19 NF2413 Political Comments 9 Brown, Patricia 08/07/2010 White, Michelle . 20 NF8141 Favorite Essays 1 Clark, Todd 10/02/2010 Patterson, Daniel . 21 NF8141 Favorite Essays 2 Barnes, David 10/02/2010 Patterson, Daniel . 22 NF8141 Favorite Essays 3 Young, Richard 09/18/2010 Patterson, Daniel . 23 NF8141 Favorite Essays 4 Barnes, David 10/02/2010 Patterson, Daniel . 24 NF8141 Favorite Essays 5 Anderson, Daniel 09/18/2010 Patterson, Daniel 05/26/2010 25 NF8141 Favorite Essays 6 Anderson, Daniel 09/18/2010 Patterson, Daniel 07/01/2010 26 NF8141 Favorite Essays 7 Morris, Laura 09/18/2010 Patterson, Daniel . 27 NF8141 Favorite Essays 8 Powell, George 09/18/2010 Patterson, Daniel |
This example illustrates match-merge processing when your transaction data set contains multiple observations with the same matching variable values and you want to update one observation in the master data set. When there are multiple observations in the transaction data set for a combination of matching variable values, only one observation from the group updates the master data set. The DATA step is written so that all observations found in the master data set are copied to the output data set whether or not they have a match in the transaction data set.
The transaction data set contributes values to the output data set only for matched observations in the transaction data set. When a BY group has multiple observations in the transaction data set, only the values from the first observation in the BY group in the transaction data set update the matching observation in the master data set. Prior to the DATA step, PROC SORT arranges the observations in the transaction data set so that a specific observation is placed first in the BY group when there are multiple observations with the same combination of values of the matching variables.
The DATA step also adds to the output data set all variables that are unique to the transaction data set. Missing values are assigned to these variables for observations in the master data set without matches in the transaction data set.
Data set BOOKLIST is the master data set, and it tracks information for the chapters of three books. Data set SUBMISSIONS contains updates of information to apply to BOOKLIST.
The DATA step uses the MERGE and BY statements to combine the two data sets. The IN= data set option that is applied to BOOKLIST defines the temporary variable INLIST, which the program tests to determine whether data set BOOKLIST contributes to the current observation. The subsetting IF statement tests INLIST and writes out observations only if they are found in BOOKLIST. This means that the DATA step does not output an observation in SUBMISSIONS that has a combination of BOOKID and CHAPTER not found in BOOKLIST.
Compared to Example 6.4, this DATA step adds a statement to reset INLIST to zero with each iteration of the DATA step. Example 6.4 demonstrated that by default the value of INLIST is set at the start of each BY group and remains constant during processing of the BY group. This example instead forces INLIST to be reset to zero with each iteration of the DATA step. This action causes only the first observation in the BY group to be matched to the observation in BOOKLIST because the INLIST variable has a value of 1 only when processing the first occurrence of the combination of BOOKID and CHAPTER in BOOKLIST.
Before processing the DATA step, both data sets BOOKLIST and SUBMISSIONS must either be sorted or indexed by BOOKID and CHAPTER. Data set SUBMISSIONS adds a third BY variable, DRAFTDATE, to the PROC SORT step. Observations are arranged in descending order of DRAFTDATE within each combination of BOOKID and CHAPTER. Because each BY group is sorted in reverse chronological order by DRAFTDATE, the most recent DRAFTDATE value is the one that is saved in LASTDRAFT when there are multiple observations in a BY group in SUBMISSIONS.
Example 6.4 and Example 6.5 use the same transaction data set SUBMISSIONS, and produce different resulting data sets. The resulting data set for Example 6.4 has an observation for each matching observation from SUBMISSIONS. Thus, the data set has more observations than in BOOKLIST because of multiple observations for two combinations of BOOKID and CHAPTER values found in SUBMISSIONS. However, the resulting data set for Example 6.5, has only one observation from each of the two specific combinations of BOOKID and CHAPTER values found in SUBMISSIONS that have multiple observations. Therefore, the resulting data set in Example 6.5 has the same number of observations as found in BOOKLIST.
Sort the observations in SUBMISSIONS by the matching variables BOOKID and CHAPTER. Sort the observations within each combination of BOOKID and CHAPTER in reverse chronological order by DRAFTDATE. Create data set LASTDRAFT. Reset INLIST to 0 at the top of the DATA step so that a previous value of 1 is not retained throughout the processing of a BY group. Combine the two data sets. Place the master data set first in the MERGE statement. Define temporary variable INLIST so that the IF statement can tests its value to determine whether the current observation has information contributed from BOOKLIST. Match the two data sets by BOOKID and CHAPTER. Keep all observations from BOOKLIST.
proc sort data=submissions; by bookid chapter descending draftdate; run; data lastdraft; inlist=0; merge booklist(in=inlist) submissions; by bookid chapter; if inlist; run;
The following PROC SQL step creates a table equivalent to the data set that was created by the DATA step in the main example. A left join combines table BOOKLIST with the results of a query applied to SUBMISSIONS. This subquery selects rows from SUBMISSIONS with the most recent value for DRAFTDATE from each combination of BOOKID and CHAPTER values. The MAX aggregate function finds the most recent value of DRAFTDATE.
As with Example 6.4, the table alias that is attached to the three columns in common is for table BOOKLIST. Because the step saves all rows present in BOOKLIST and only those rows from the query applied to SUBMISSIONS that have a match in BOOKLIST, the values for the columns in common should come from BOOKLIST.
Create table LASTDRAFT. Specify the columns to save in the output table. Specify the origin of the three columns in common between the two tables being joined by preceding the column name with the table alias. Combine table BOOKLIST with the results of a query applied to SUBMISSIONS. Specify an alias for BOOKLIST. Specify the subquery. Select rows from SUBMISSIONS with the most recent value of DRAFTDATE for unique combinations of BOOKID and CHAPTER. Group the rows in SUBMISSIONS by the values of BOOKID and CHAPTER. Specify an alias for the subquery. Specify the columns that match the table and results of the subquery.
proc sql; create table lastdraft as select b.bookid, booktitle, b.chapter, b.author, duedate, editor, draftdate from booklist b left join (select bookid, chapter, max(draftdate) as draftdate format=mmddyy10. from submissions group by bookid,chapter) c on b.bookid=c.bookid and b.chapter=c.chapter; quit;
18.116.26.118