Example 6.3 Merging Data Sets and Conditionally Overwriting Common Variables

Goal

Merge two data sets so that the values of common variables in the master data set are updated with the values of the common variables in another data set for all matching observations. When an observation in the master data set does not have a match in the second data set, do not overwrite values for the common variables with missing values.

Examples 6.1, 6.2, 6.3, 6.4, and 6.5 illustrate ways to update common variables when combining data sets. The five examples apply the methods to the same master data set. Different transaction data sets are used in Examples 6.1, 6.2, and 6.3. Examples 6.4 and 6.5 use the same transaction data set and produce different resulting data sets.

Example Features

Featured StepDATA step
Featured Step Options and StatementsMatch-merge with MERGE and BY statements
Related TechniquePROC SQL, full join, COALESCE function
A Closer LookUnderstanding Why the UPDATE Statement Would Not Produce the Required Output Data Set in This Example
Illustrating How SAS Merges Data Sets and Conditionally Overwrites Common Variables

Input Data Sets

Master data set BOOKLIST contains information about the chapters of three books. Transaction data set BOOKCHANGES contains updates of information to apply to BOOKLIST.

                                         BOOKLIST

Obs bookid     booktitle      chapter author                 duedate       editor
  1 NF0586 Current Narratives     1   Smith, Rebecca      09/04/2010 Williams, Stephanie
  2 NF0586 Current Narratives     2   Williams, Susan     09/04/2010 Williams, Stephanie
  3 NF0586 Current Narratives     3   Torres, Christopher 09/11/2010 Williams, Stephanie
  4 NF0586 Current Narratives     4   Torres, Christopher 09/11/2010 Williams, Stephanie
  5 NF0586 Current Narratives     5   Powell, George      09/11/2010 Williams, Stephanie
  6 NF0586 Current Narratives     6   Thompson, Tonya     09/11/2010 Williams, Stephanie
  7 NF0586 Current Narratives     7   Allen, Linda        09/11/2010 Williams, Stephanie
  8 NF0586 Current Narratives     8   Johnson, Tammy      09/11/2010 Williams, Stephanie
  9 NF0586 Current Narratives     9   Kelly, Melissa      09/11/2010 Williams, Stephanie
 10 NF0586 Current Narratives    10   Thompson, Tonya     09/11/2010 Williams, Stephanie
 11 NF2413 Political Comments     1   Jones, Robin        07/31/2010 White, Michelle
 12 NF2413 Political Comments     2   Sanchez, Brandon    08/07/2010 White, Michelle
 13 NF2413 Political Comments     3   Jones, Robin        07/31/2010 White, Michelle
 14 NF2413 Political Comments     4   Perez, Joshua       07/31/2010 White, Michelle
 15 NF2413 Political Comments     5   Williams, Nicholas  07/31/2010 White, Michelle
 16 NF2413 Political Comments     6   Patterson, Mary     08/14/2010 White, Michelle
 17 NF2413 Political Comments     7   Torres, Christopher 08/07/2010 White, Michelle
 18 NF2413 Political Comments     8   Robinson, Bonnie    08/07/2010 White, Michelle
 19 NF2413 Political Comments     9   Brown, Patricia     08/07/2010 White, Michelle
 20 NF8141 Favorite Essays        1   Clark, Todd         10/02/2010 Patterson, Daniel
 21 NF8141 Favorite Essays        2   Barnes, David       10/02/2010 Patterson, Daniel
 22 NF8141 Favorite Essays        3   Young, Richard      09/18/2010 Patterson, Daniel
 23 NF8141 Favorite Essays        4   Barnes, David       10/02/2010 Patterson, Daniel
 24 NF8141 Favorite Essays        5   Anderson, Daniel    09/18/2010 Patterson, Daniel
 25 NF8141 Favorite Essays        6   Anderson, Daniel    09/18/2010 Patterson, Daniel
 26 NF8141 Favorite Essays        7   Morris, Laura       09/18/2010 Patterson, Daniel
 27 NF8141 Favorite Essays        8   Powell, George      09/18/2010 Patterson, Daniel

Data set BOOKLIST has multiple occurrences for each of the three values of BOOKID. One of the three values of BOOKID in BOOKLIST does not have a match in BOOKCHANGES.

The data sets must be sorted or indexed by the values of BOOKID because this is the variable that will match the two data sets.

              BOOKCHANGES

Obs   bookid     duedate       editor
 1    NF2413  09/18/2010   Zhang, Amy
 2    NF8141  10/02/2010   McHale, Andrew

Resulting Data Set

Output 6.3 NEWBOOKLIST Data Set

               Example 6.3 NEWBOOKLIST Data Set Created with DATA Step

Obs bookid     booktitle      chapter author                 duedate editor

  1 NF0586 Current Narratives     1   Smith, Rebecca       09/04/2010 Williams, Stephanie
  2 NF0586 Current Narratives     2   Williams, Susan      09/04/2010 Williams, Stephanie
  3 NF0586 Current Narratives     3    Torres, Christopher 09/11/2010 Williams, Stephanie
  4 NF0586 Current Narratives     4    Torres, Christopher 09/11/2010 Williams, Stephanie
  5 NF0586 Current Narratives     5   Powell, George       09/11/2010 Williams, Stephanie
  6 NF0586 Current Narratives     6   Thompson, Tonya      09/11/2010 Williams, Stephanie
  7 NF0586 Current Narratives     7   Allen, Linda         09/11/2010 Williams, Stephanie
  8 NF0586 Current Narratives     8   Johnson, Tammy       09/11/2010 Williams, Stephanie
  9 NF0586 Current Narratives     9   Kelly, Melissa       09/11/2010 Williams, Stephanie
 10 NF0586 Current Narratives    10   Thompson, Tonya      09/11/2010 Williams, Stephanie
 11 NF2413 Political Comments     1   Jones, Robin         09/18/2010 Zhang, Amy
 12 NF2413 Political Comments     2   Sanchez, Brandon     09/18/2010 Zhang, Amy
 13 NF2413 Political Comments     3   Jones, Robin         09/18/2010 Zhang, Amy
 14 NF2413 Political Comments     4   Perez, Joshua        09/18/2010 Zhang, Amy
 15 NF2413 Political Comments     5    Williams, Nicholas  09/18/2010 Zhang, Amy
 16 NF2413 Political Comments     6   Patterson, Mary      09/18/2010 Zhang, Amy
 17 NF2413 Political Comments     7    Torres, Christopher 09/18/2010 Zhang, Amy
 18 NF2413 Political Comments     8   Robinson, Bonnie     09/18/2010 Zhang, Amy
 19 NF2413 Political Comments     9   Brown, Patricia      09/18/2010 Zhang, Amy
 20 NF8141 Favorite Essays        1   Clark, Todd          10/02/2010 McHale, Andrew
 21 NF8141 Favorite Essays        2   Barnes, David        10/02/2010 McHale, Andrew
 22 NF8141 Favorite Essays        3   Young, Richard       10/02/2010 McHale, Andrew
 23 NF8141 Favorite Essays        4   Barnes, David        10/02/2010 McHale, Andrew
 24 NF8141 Favorite Essays        5   Anderson, Daniel     10/02/2010 McHale, Andrew
 25 NF8141 Favorite Essays        6   Anderson, Daniel     10/02/2010 McHale, Andrew
 26 NF8141 Favorite Essays        7   Morris, Laura        10/02/2010 McHale, Andrew
 27 NF8141 Favorite Essays        8   Powell, George       10/02/2010 McHale, Andrew


Example Overview

This example shows you how to update selected variables in common when match-merging a master data set and another data set. The master data set can have duplicate occurrences of the matching variable while the second data set has only unique occurrences of the matching variable. This example illustrates how SAS updates the Program Data Vector (PDV) during match-merges and how you can control the process with your code.

Data set BOOKLIST is the master data set, and it tracks information about the chapters of three books. Data set BOOKCHANGES contains updates of information to apply to BOOKLIST.

The DATA step uses the MERGE and BY statements to update variables DUEDATE and EDITOR in BOOKLIST with values from BOOKCHANGES. The two data sets are matched by the values of variable BOOKID. The DATA step accomplishes two tasks:

  • For matches by BOOKID when there are multiple occurrences of BOOKID, the values for DUEDATE and EDITOR in BOOKCHANGES are assigned to all matching observations in BOOKLIST.

  • For nonmatches by BOOKID, the original values of DUEDATE and EDITOR in BOOKLIST are preserved.

Before the DATA step executes, the RENAME= option applied to master data set BOOKLIST renames the common variables DUEDATE and EDITOR to HOLDDATE and HOLDEDITOR, respectively. This action and the subsequent IF-THEN block prevent the replacement of existing values for DUEDATE and EDITOR in BOOKLIST with missing values when there are nonmatches. The renaming also causes all matching observations to be updated with new values for DUEDATE and EDITOR. Without the renaming and the IF-THEN block later in the DATA step, only the first observation in a BY group would be updated with the new values.

For diagrams of the processing steps of this program, see the "A Closer Look" section. Because variables BOOKTITLE, CHAPTER, and AUTHOR are found only in BOOKLIST, you do not need to include any additional programming to save their original values.

Assume both data sets were sorted or indexed by BOOKID prior to the DATA step.

Program

Create NEWBOOKLIST. Combine BOOKLIST and BOOKCHANGES. Rename the two common variables, DUEDATE and EDITOR, in BOOKLIST to preserve their original values in BOOKLIST. Define temporary variable INUPD so that statements can test whether BOOKCHANGES contributes to the current observation. Match the data sets by BOOKID. Drop the variables that came from BOOKLIST because they are needed only during execution of the DATA step to preserve the original values of DUEDATE and EDITOR in BOOKLIST. When BOOKCHANGES does not contribute to the current observation, reset the values of DUEDATE and EDITOR with their original values, which were preserved in HOLDDATE and HOLDEDITOR.

data newbooklist;
  merge booklist(rename=(duedate=holddate
                         editor=holdeditor))
        bookchanges(in=inupd);





    by bookid;
  drop holddate holdeditor;




  if not inupd then do;
    duedate=holddate;
    editor=holdeditor;
  end;


run;

Related Technique

The following PROC SQL step creates a table equivalent to the data set that was created by the DATA step in the main example. A full join combines the two tables so that all matching and nonmatching rows from BOOKLIST and NEWBOOKLIST are in the output table along with updates of specific rows that were contributed from BOOKCHANGES.

The COALESCE function performs the update of DUEDATE and EDITOR. The first argument to each COALESCE function call is the column from BOOKCHANGES. The second argument is the column from BOOKLIST. The COALESCE function checks each of its arguments from left to right until it finds a nonmissing value. When it finds a match on BOOKID in BOOKCHANGES, the values that were contributed from BOOKCHANGES are stored in NEWBOOKLIST. When it does not find a match for BOOKID in BOOKCHANGES, the first argument to each COALESCE function is missing so it checks whether the second argument is nonmissing. In this example, the second arguments for those rows without a match in BOOKCHANGES are all nonmissing so PROC SQL saves the values from master table BOOKLIST in NEWBOOKLIST.

Create table NEWBOOKLIST. Specify the columns to save in table NEWBOOKLIST. Precede the column names with the alias of the table from which they'r selected. Apply the COALESCE function to the columns that will be updated. Specify as the first argument to each COALESCE function call the column that was contributed from BOOKCHANGES. Name the columns that result from COALESCE the same as that in BOOKLIST. Combine the two tables with a full join so that all matching and nonmatching rows are present in the output table. Specify an alias for each of the two tables. Specify the column that matches the two tables.

proc sql;
  create table newbooklist as
    select coalesce(u.bookid,b.bookid) as bookid,
         booktitle, chapter, author,
         coalesce(u.duedate,b.duedate) as duedate
                            format=mmddyy10.,
         coalesce(u.editor,b.editor) as editor

      from booklist b
        full join
      bookchanges u


      on b.bookid=u.bookid;

quit;

A Closer Look

Understanding Why the UPDATE Statement Would Not Produce the Required Output Data Set in This Example

Example 6.1 used the UPDATE statement to perform an updating action that was similar to the main program in this example. The difference is that the master data set in the main DATA step in Example 6.1 did not have multiple observations per BY group while this example does. You would not produce the same output data set if you used the UPDATE statement in this example instead of the MERGE statement. The UPDATE statement is not designed to process multiple observations per BY group in the master data set.

Example 6.1 compares similar features of the MERGE and UPDATE statements in "Comparing the UPDATE and MERGE Statements" in the "A Closer Look" section. The section also discusses the differences in using the MERGE and UPDATE statements when processing data sets that have multiple observations in a BY group. For more information, see "Using UPDATE and MERGE When There Are Multiple Observations in a BY Group" in the "A Closer Look" section for Example 6.1.

Illustrating How SAS Merges Data Sets and Conditionally Overwrites Common Variables

This section shows how the DATA step in the main example match-merges data sets BOOKLIST and BOOKCHANGES to update the information in BOOKLIST.

During the compilation phase, SAS reads the descriptor portions of the input data sets and creates the Program Data Vector (PDV). Also, SAS determines the BY groups in each input data set for the variables that are listed in the BY statement. The three unique BOOKID values define three BY groups.

The PDV in this example has allocated space for variables HOLDDATE and HOLDEDITOR, which are the renamed versions of DUEDATE and EDITOR from data set BOOKLIST. It also has allocated space for variables DUEDATE and EDITOR from data set BOOKCHANGES.

SAS looks at the first BY group in each input data set to determine whether the BY values match. In this example, the first BY group, which is BOOKID= "NF0586", is found only in BOOKLIST. Its row is highlighted in the following display. The second and third BY groups, BOOKID= "NF2413" and BOOKID= "NF8141", are found in both input data sets.



The BY group for BOOKID= "NF0586" in BOOKLIST does not have a match in BOOKCHANGES. Therefore, no values are moved from BOOKCHANGES into variables DUEDATE and EDITOR and their values are missing in the PDV.

Program Data Vector (PDV)
INUPD

BOOKID

BOOKTITLE

CHAPTER

AUTHOR

HOLDDATE

HOLDEDITOR

DUEDATE

EDITOR

0

NF0586

Current
Narratives

1

Smith,
Rebecca

09/04/2009

Williams,
Stephanie

.

 

The value of INUPD is 0 (not true) so the IF-THEN block executes and moves the values from HOLDDATE and HOLDEDITOR into variables DUEDATE and EDITOR, respectively. This process continues for all 10 observations in the first BY group. Variables HOLDDATE and HOLDEDITOR are not saved in the output data set NEWBOOKLIST. Variable INUPD is also not saved in NEWBOOKLIST because it is a temporary variable defined with the IN= data set option in the MERGE statement.

The observations for this first BY group remain the same.

The DATA step now processes the second BY group, BOOKID= "NF2413". This BY group is found in both input data sets. The rows that SAS loads into the PDV at the beginning of this second BY group are highlighted in the following display.



The DATA step moves the values of DUEDATE and EDITOR found in BOOKCHANGES into the PDV.

Program Data Vector (PDV)
INUPD

BOOKID

BOOKTITLE

CHAPTER

AUTHOR

HOLDDATE

HOLDEDITOR

DUEDATE

EDITOR

1

NF2413

Political
Comments

1

Jones, Robin

07/31/2009

White,
Michelle

09/18/2009

Zhang,
Amy


The IF-THEN block does not execute because the value of INUPD is 1(true). Because variables HOLDDATE and HOLDEDITOR are dropped from the output data set, the original values for DUEDATE and EDITOR for this BY group are not saved in the output data set NEWBOOKLIST. Instead the values for DUEDATE and EDITOR for BOOKID= "NF2413" in the output data set are the values for DUEDATE and EDITOR from the transaction data set.

The DATA step moves to the second observation in the BOOKID= "NF2413" BY group in data set BOOKCHANGES. Because there is only one observation in the BOOKID= "NF2413" BY group in BOOKCHANGES, the values for DUEDATE and EDITOR that were contributed from that data set remain in the PDV.

Program Data Vector (PDV)
INUPD

BOOKID

BOOKTITLE

CHAPTER

AUTHOR

HOLDDATE

HOLDEDITOR

DUEDATE

EDITOR

1

NF2413

Political
Comments

2

Sanchez,
Brandon

08/07/2009

White,
Michelle

09/18/2009

Zhang,
Amy


The same process is repeated for the remaining seven observations in BY group BOOKID= "NF2413". All nine observations found in BOOKLIST now have the same new values for DUEDATE and EDITOR in NEWBOOKLIST.

Lastly, the DATA step processes the third BY group for BOOKID. It saves the new information found for BOOKID= "NF8141" for DUEDATE and EDITOR in output data set NEWBOOKLIST.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.54.7