Example 6.5 Saving Observations from Only the Master Data Set When the Transaction Data Set Contains Duplicates

Goal

Match a master data set and transaction data set where the master data set has only unique combinations of the values of the matching variables and the transaction data set can have multiple combinations. Keep only the observations from the master data set in the output data set. Order the observations in the transaction data set so that a specific observation from the transaction data set is kept in the output data set when there are multiple matches for a combination of the values of the matching variables.

Examples 6.1, 6.2, 6.3, 6.4, and 6.5 illustrate ways to update common variables when combining data sets. The five examples apply the methods to the same master data set. Different transaction data sets are used in Examples 6.1, 6.2, and 6.3. Examples 6.4 and 6.5 use the same transaction data set and produce different resulting data sets.

Example Features

Featured StepDATA step
Featured Step Options and StatementsMatch-merge with MERGE and BY statements IN= data set option
Related TechniquePROC SQL, left join, COALESCE function, subquery

Input Data Sets

Master data set BOOKLIST contains information about the chapters of three books. Transaction data set SUBMISSIONS contains updates of information to apply to BOOKLIST.

                                         BOOKLIST

 Obs bookid     booktitle      chapter author                 duedate       editor
   1 NF0586 Current Narratives     1   Smith, Rebecca       09/04/2010 Williams, Stephanie
   2 NF0586 Current Narratives     2   Williams, Susan      09/04/2010 Williams, Stephanie
   3 NF0586 Current Narratives     3   Torres, Christopher  09/11/2010 Williams, Stephanie
   4 NF0586 Current Narratives     4   Torres, Christopher  09/11/2010 Williams, Stephanie
   5 NF0586 Current Narratives     5   Powell, George       09/11/2010 Williams, Stephanie
   6 NF0586 Current Narratives     6   Thompson, Tonya      09/11/2010 Williams, Stephanie
   7 NF0586 Current Narratives     7   Allen, Linda         09/11/2010 Williams, Stephanie
   8 NF0586 Current Narratives     8   Johnson, Tammy       09/11/2010 Williams, Stephanie
   9 NF0586 Current Narratives     9   Kelly, Melissa       09/11/2010 Williams, Stephanie
  10 NF0586 Current Narratives    10   Thompson, Tonya      09/11/2010 Williams, Stephanie
  11 NF2413 Political Comments     1   Jones, Robin         07/31/2010 White, Michelle
  12 NF2413 Political Comments     2   Sanchez, Brandon     08/07/2010 White, Michelle
  13 NF2413 Political Comments     3   Jones, Robin         07/31/2010 White, Michelle
  14 NF2413 Political Comments     4   Perez, Joshua        07/31/2010 White, Michelle
  15 NF2413 Political Comments     5   Williams, Nicholas   07/31/2010 White, Michelle
  16 NF2413 Political Comments     6   Patterson, Mary      08/14/2010 White, Michelle
  17 NF2413 Political Comments     7   Torres, Christopher  08/07/2010 White, Michelle
  18 NF2413 Political Comments     8   Robinson, Bonnie     08/07/2010 White, Michelle
  19 NF2413 Political Comments     9   Brown, Patricia      08/07/2010 White, Michelle
  20 NF8141 Favorite Essays        1   Clark, Todd          10/02/2010 Patterson, Daniel
  21 NF8141 Favorite Essays        2   Barnes, David        10/02/2010 Patterson, Daniel
  22 NF8141 Favorite Essays        3   Young, Richard       09/18/2010 Patterson, Daniel
  23 NF8141 Favorite Essays        4   Barnes, David        10/02/2010 Patterson, Daniel
  24 NF8141 Favorite Essays        5   Anderson, Daniel     09/18/2010 Patterson, Daniel
  25 NF8141 Favorite Essays        6   Anderson, Daniel     09/18/2010 Patterson, Daniel
  26 NF8141 Favorite Essays        7   Morris, Laura        09/18/2010 Patterson, Daniel
  27 NF8141 Favorite Essays        8   Powell, George       09/18/2010 Patterson, Daniel

Data set BOOKLIST has unique occurrences of the combinations of BOOKID and CHAPTER. Data set SUBMISSIONS contains multiple occurrences for two combinations of BOOKID and CHAPTER.

                         SUBMISSIONS

 Obs    bookid    chapter          author             draftdate
  1     NF0586       3       Torres, Christopher     05/13/2010
  2     NF0586       3       Torres, Christopher     06/17/2010
  3     NF0586       3       Torres, Christopher     06/30/2010
  4     NF2413       2       Sanchez, Brandon        04/22/2010
  5     NF2413       2       Sanchez, Brandon        06/02/2010
  6     NF2413       8       Robinson, Bonnie        04/01/2010
  7     NF8141       5       Anderson, Daniel        05/26/2010
  8     NF8141       6       Anderson, Daniel        07/01/2010

Resulting Data Set

Output 6.5 LASTDRAFT Data Set

                      Example 6.5 LASTDRAFT Data Set Created with DATA Step

Obs bookid     booktitle      chapter author                 duedate       editor          draftdate

  1 NF0586 Current Narratives     1   Smith, Rebecca      09/04/2010 Williams, Stephanie           .
  2 NF0586 Current Narratives     2   Williams, Susan     09/04/2010 Williams, Stephanie           .
  3 NF0586 Current Narratives     3   Torres, Christopher 09/11/2010 Williams, Stephanie  06/30/2010
  4 NF0586 Current Narratives     4   Torres, Christopher 09/11/2010 Williams, Stephanie           .
  5 NF0586 Current Narratives     5   Powell, George      09/11/2010 Williams, Stephanie           .
  6 NF0586 Current Narratives     6   Thompson, Tonya     09/11/2010 Williams, Stephanie           .
  7 NF0586 Current Narratives     7   Allen, Linda        09/11/2010 Williams, Stephanie           .
  8 NF0586 Current Narratives     8   Johnson, Tammy      09/11/2010 Williams, Stephanie           .
  9 NF0586 Current Narratives     9   Kelly, Melissa      09/11/2010 Williams, Stephanie           .
 10 NF0586 Current Narratives    10   Thompson, Tonya     09/11/2010 Williams, Stephanie           .
 11 NF2413 Political Comments     1   Jones, Robin        07/31/2010 White, Michelle               .
 12 NF2413 Political Comments     2   Sanchez, Brandon    08/07/2010 White, Michelle      06/02/2010
 13 NF2413 Political Comments     3   Jones, Robin        07/31/2010 White, Michelle               .
 14 NF2413 Political Comments     4   Perez, Joshua       07/31/2010 White, Michelle               .
 15 NF2413 Political Comments     5   Williams, Nicholas  07/31/2010 White, Michelle               .
 16 NF2413 Political Comments     6   Patterson, Mary     08/14/2010 White, Michelle               .
 17 NF2413 Political Comments     7   Torres, Christopher 08/07/2010 White, Michelle               .
 18 NF2413 Political Comments     8   Robinson, Bonnie    08/07/2010 White, Michelle      04/01/2010
 19 NF2413 Political Comments     9   Brown, Patricia     08/07/2010 White, Michelle               .
 20 NF8141 Favorite Essays        1   Clark, Todd         10/02/2010 Patterson, Daniel             .
 21 NF8141 Favorite Essays        2   Barnes, David       10/02/2010 Patterson, Daniel             .
 22 NF8141 Favorite Essays        3   Young, Richard      09/18/2010 Patterson, Daniel             .
 23 NF8141 Favorite Essays        4   Barnes, David       10/02/2010 Patterson, Daniel             .
 24 NF8141 Favorite Essays        5   Anderson, Daniel    09/18/2010 Patterson, Daniel    05/26/2010
 25 NF8141 Favorite Essays        6   Anderson, Daniel    09/18/2010 Patterson, Daniel    07/01/2010
 26 NF8141 Favorite Essays        7   Morris, Laura       09/18/2010 Patterson, Daniel             .
 27 NF8141 Favorite Essays        8   Powell, George      09/18/2010 Patterson, Daniel


Example Overview

This example illustrates match-merge processing when your transaction data set contains multiple observations with the same matching variable values and you want to update one observation in the master data set. When there are multiple observations in the transaction data set for a combination of matching variable values, only one observation from the group updates the master data set. The DATA step is written so that all observations found in the master data set are copied to the output data set whether or not they have a match in the transaction data set.

The transaction data set contributes values to the output data set only for matched observations in the transaction data set. When a BY group has multiple observations in the transaction data set, only the values from the first observation in the BY group in the transaction data set update the matching observation in the master data set. Prior to the DATA step, PROC SORT arranges the observations in the transaction data set so that a specific observation is placed first in the BY group when there are multiple observations with the same combination of values of the matching variables.

The DATA step also adds to the output data set all variables that are unique to the transaction data set. Missing values are assigned to these variables for observations in the master data set without matches in the transaction data set.

Data set BOOKLIST is the master data set, and it tracks information for the chapters of three books. Data set SUBMISSIONS contains updates of information to apply to BOOKLIST.

The DATA step uses the MERGE and BY statements to combine the two data sets. The IN= data set option that is applied to BOOKLIST defines the temporary variable INLIST, which the program tests to determine whether data set BOOKLIST contributes to the current observation. The subsetting IF statement tests INLIST and writes out observations only if they are found in BOOKLIST. This means that the DATA step does not output an observation in SUBMISSIONS that has a combination of BOOKID and CHAPTER not found in BOOKLIST.

Compared to Example 6.4, this DATA step adds a statement to reset INLIST to zero with each iteration of the DATA step. Example 6.4 demonstrated that by default the value of INLIST is set at the start of each BY group and remains constant during processing of the BY group. This example instead forces INLIST to be reset to zero with each iteration of the DATA step. This action causes only the first observation in the BY group to be matched to the observation in BOOKLIST because the INLIST variable has a value of 1 only when processing the first occurrence of the combination of BOOKID and CHAPTER in BOOKLIST.

Before processing the DATA step, both data sets BOOKLIST and SUBMISSIONS must either be sorted or indexed by BOOKID and CHAPTER. Data set SUBMISSIONS adds a third BY variable, DRAFTDATE, to the PROC SORT step. Observations are arranged in descending order of DRAFTDATE within each combination of BOOKID and CHAPTER. Because each BY group is sorted in reverse chronological order by DRAFTDATE, the most recent DRAFTDATE value is the one that is saved in LASTDRAFT when there are multiple observations in a BY group in SUBMISSIONS.

Example 6.4 and Example 6.5 use the same transaction data set SUBMISSIONS, and produce different resulting data sets. The resulting data set for Example 6.4 has an observation for each matching observation from SUBMISSIONS. Thus, the data set has more observations than in BOOKLIST because of multiple observations for two combinations of BOOKID and CHAPTER values found in SUBMISSIONS. However, the resulting data set for Example 6.5, has only one observation from each of the two specific combinations of BOOKID and CHAPTER values found in SUBMISSIONS that have multiple observations. Therefore, the resulting data set in Example 6.5 has the same number of observations as found in BOOKLIST.

Program

Sort the observations in SUBMISSIONS by the matching variables BOOKID and CHAPTER. Sort the observations within each combination of BOOKID and CHAPTER in reverse chronological order by DRAFTDATE. Create data set LASTDRAFT. Reset INLIST to 0 at the top of the DATA step so that a previous value of 1 is not retained throughout the processing of a BY group. Combine the two data sets. Place the master data set first in the MERGE statement. Define temporary variable INLIST so that the IF statement can tests its value to determine whether the current observation has information contributed from BOOKLIST. Match the two data sets by BOOKID and CHAPTER. Keep all observations from BOOKLIST.

proc sort data=submissions;
  by bookid chapter descending draftdate;
run;




data lastdraft;
  inlist=0;


  merge booklist(in=inlist)
        submissions;




  by bookid chapter;

  if inlist;

run;

Related Technique

The following PROC SQL step creates a table equivalent to the data set that was created by the DATA step in the main example. A left join combines table BOOKLIST with the results of a query applied to SUBMISSIONS. This subquery selects rows from SUBMISSIONS with the most recent value for DRAFTDATE from each combination of BOOKID and CHAPTER values. The MAX aggregate function finds the most recent value of DRAFTDATE.

As with Example 6.4, the table alias that is attached to the three columns in common is for table BOOKLIST. Because the step saves all rows present in BOOKLIST and only those rows from the query applied to SUBMISSIONS that have a match in BOOKLIST, the values for the columns in common should come from BOOKLIST.

Create table LASTDRAFT. Specify the columns to save in the output table. Specify the origin of the three columns in common between the two tables being joined by preceding the column name with the table alias. Combine table BOOKLIST with the results of a query applied to SUBMISSIONS. Specify an alias for BOOKLIST. Specify the subquery. Select rows from SUBMISSIONS with the most recent value of DRAFTDATE for unique combinations of BOOKID and CHAPTER. Group the rows in SUBMISSIONS by the values of BOOKID and CHAPTER. Specify an alias for the subquery. Specify the columns that match the table and results of the subquery.

proc sql;
  create table lastdraft as
    select b.bookid, booktitle, b.chapter,
           b.author, duedate, editor,
           draftdate


      from booklist b
        left join


      (select bookid, chapter,
            max(draftdate) as draftdate format=mmddyy10.
         from submissions


         group by bookid,chapter)

                 c
      on b.bookid=c.bookid and b.chapter=c.chapter;

quit;

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.137.91