Update a master data set with values from another data set. The second data set contains missing values. Control which missing values in the second data set can overwrite existing values in the master data set.
Examples 6.1, 6.2, 6.3, 6.4, and 6.5 illustrate ways to update common variables when combining data sets. The five examples apply the methods to the same master data set. Different transaction data sets are used in Examples 6.1, 6.2, and 6.3. Examples 6.4 and 6.5 use the same transaction data set and produce different resulting data sets.
Featured Step | DATA step |
Featured Step Options and Statements | MISSING statement
UPDATE and BY statements Special missing values |
Related Technique | PROC SQL, full join, CASE expression, COALESCE function, special missing values |
Master data set BOOKLIST contains information about the chapters of three books. Transaction data set BOOKMISS contains updates of information that should be applied to BOOKLIST. Some of the values for DUEDATE are missing, and some values are coded with the special missing value, "_" (underscore).
BOOKLIST Obs bookid booktitle chapter author duedate editor 1 NF0586 Current Narratives 1 Smith, Rebecca 09/04/2010 Williams, Stephanie 2 NF0586 Current Narratives 2 Williams, Susan 09/04/2010 Williams, Stephanie 3 NF0586 Current Narratives 3 Torres, Christopher 09/11/2010 Williams, Stephanie 4 NF0586 Current Narratives 4 Torres, Christopher 09/11/2010 Williams, Stephanie 5 NF0586 Current Narratives 5 Powell, George 09/11/2010 Williams, Stephanie 6 NF0586 Current Narratives 6 Thompson, Tonya 09/11/2010 Williams, Stephanie 7 NF0586 Current Narratives 7 Allen, Linda 09/11/2010 Williams, Stephanie 8 NF0586 Current Narratives 8 Johnson, Tammy 09/11/2010 Williams, Stephanie 9 NF0586 Current Narratives 9 Kelly, Melissa 09/11/2010 Williams, Stephanie 10 NF0586 Current Narratives 10 Thompson, Tonya 09/11/2010 Williams, Stephanie 11 NF2413 Political Comments 1 Jones, Robin 07/31/2010 White, Michelle 12 NF2413 Political Comments 2 Sanchez, Brandon 08/07/2010 White, Michelle 13 NF2413 Political Comments 3 Jones, Robin 07/31/2010 White, Michelle 14 NF2413 Political Comments 4 Perez, Joshua 07/31/2010 White, Michelle 15 NF2413 Political Comments 5 Williams, Nicholas 07/31/2010 White, Michelle 16 NF2413 Political Comments 6 Patterson, Mary 08/14/2010 White, Michelle 17 NF2413 Political Comments 7 Torres, Christopher 08/07/2010 White, Michelle 18 NF2413 Political Comments 8 Robinson, Bonnie 08/07/2010 White, Michelle 19 NF2413 Political Comments 9 Brown, Patricia 08/07/2010 White, Michelle 20 NF8141 Favorite Essays 1 Clark, Todd 10/02/2010 Patterson, Daniel 21 NF8141 Favorite Essays 2 Barnes, David 10/02/2010 Patterson, Daniel 22 NF8141 Favorite Essays 3 Young, Richard 09/18/2010 Patterson, Daniel 23 NF8141 Favorite Essays 4 Barnes, David 10/02/2010 Patterson, Daniel 24 NF8141 Favorite Essays 5 Anderson, Daniel 09/18/2010 Patterson, Daniel 25 NF8141 Favorite Essays 6 Anderson, Daniel 09/18/2010 Patterson, Daniel 26 NF8141 Favorite Essays 7 Morris, Laura 09/18/2010 Patterson, Daniel 27 NF8141 Favorite Essays 8 Powell, George 09/18/2010 Patterson, Daniel
Both data sets have unique occurrences of the combinations of values of BOOKID and CHAPTER.
BOOKMISS Obs bookid chapter author duedate 1 NF0586 3 _ 2 NF0586 4 _ 3 NF2413 4 Loren, Marie . 4 NF2413 7 _ 5 NF8141 5 10/03/2010 6 NF8141 6 10/03/2010
Output 6.2 REVISEDMISS Data SetExample 6.2 REVISEDMISS Data Set Created with DATA Step Obs bookid booktitle chapter author duedate editor 1 NF0586 Current Narratives 1 Smith, Rebecca 09/04/2010 Williams, Stephanie 2 NF0586 Current Narratives 2 Williams, Susan 09/04/2010 Williams, Stephanie 3 NF0586 Current Narratives 3 Torres, Christopher . Williams, Stephanie 4 NF0586 Current Narratives 4 Torres, Christopher . Williams, Stephanie 5 NF0586 Current Narratives 5 Powell, George 09/11/2010 Williams, Stephanie 6 NF0586 Current Narratives 6 Thompson, Tonya 09/11/2010 Williams, Stephanie 7 NF0586 Current Narratives 7 Allen, Linda 09/11/2010 Williams, Stephanie 8 NF0586 Current Narratives 8 Johnson, Tammy 09/11/2010 Williams, Stephanie 9 NF0586 Current Narratives 9 Kelly, Melissa 09/11/2010 Williams, Stephanie 10 NF0586 Current Narratives 10 Thompson, Tonya 09/11/2010 Williams, Stephanie 11 NF2413 Political Comments 1 Jones, Robin 07/31/2010 White, Michelle 12 NF2413 Political Comments 2 Sanchez, Brandon 08/07/2010 White, Michelle 13 NF2413 Political Comments 3 Jones, Robin 07/31/2010 White, Michelle 14 NF2413 Political Comments 4 Loren, Marie 07/31/2010 White, Michelle 15 NF2413 Political Comments 5 Williams, Nicholas 07/31/2010 White, Michelle 16 NF2413 Political Comments 6 Patterson, Mary 08/14/2010 White, Michelle 17 NF2413 Political Comments 7 Torres, Christopher . White, Michelle 18 NF2413 Political Comments 8 Robinson, Bonnie 08/07/2010 White, Michelle 19 NF2413 Political Comments 9 Brown, Patricia 08/07/2010 White, Michelle 20 NF8141 Favorite Essays 1 Clark, Todd 10/02/2010 Patterson, Daniel 21 NF8141 Favorite Essays 2 Barnes, David 10/02/2010 Patterson, Daniel 22 NF8141 Favorite Essays 3 Young, Richard 09/18/2010 Patterson, Daniel 23 NF8141 Favorite Essays 4 Barnes, David 10/02/2010 Patterson, Daniel 24 NF8141 Favorite Essays 5 Anderson, Daniel 10/03/2010 Patterson, Daniel 25 NF8141 Favorite Essays 6 Anderson, Daniel 10/03/2010 Patterson, Daniel 26 NF8141 Favorite Essays 7 Morris, Laura 09/18/2010 Patterson, Daniel 27 NF8141 Favorite Essays 8 Powell, George 09/18/2010 Patterson, Daniel |
This example shows you how to allow some missing values to replace existing values when updating a master data set. The UPDATE and BY statements update the information in the master data set for observations that are matched by the values of the BY variables. By default, the UPDATE statement does not replace existing values with missing values. Example 6.1 discussed in the "A Closer Look" section the use of UPDATE statement option UPDATEMODE= to control this replacement action.
This example does not use the UPDATEMODE= option to selectively replace existing values in the master data set with missing values. Instead it maintains the default value of MISSINGCHECK for option UPDATEMODE=, and it uses special missing values and regular missing values to replace some existing values with missing values.
Data set BOOKLIST is the master data set, and it tracks information about the chapters of three books.
The transaction data set BOOKMISS contains values for DUEDATE that have regular missing values and values that have the underscore character (_) assigned as a special missing value. Special missing values apply to numeric variables. The design of the UPDATE statement with UPDATEMODE=MISSINGCHECK in effect causes an existing value for a variable in the master data set to be overwritten with a regular missing value. This occurs if a special missing value has been assigned to the variable in its matched observation in the transaction data set. In this example, a regular missing value in the transaction data set still does not overwrite an existing value in the master data set.
The DATA step updates two variables, AUTHOR and DUEDATE, in data set BOOKLIST with information from BOOKMISS. Observations are matched by two BY variables, BOOKID and CHAPTER.
Missing values for DUEDATE are present in the first four observations of transaction data set BOOKMISS. Special missing values are found in observations 1, 2, and 4. Observations 5 and 6 have regular missing values for variable AUTHOR and new values for variable DUEDATE. Observation 3 has a new value for variable AUTHOR and a regular missing value for DUEDATE.
Assume both data sets were sorted or indexed by BOOKID and CHAPTER prior to the DATA step.
The DATA step in this example is identical to the one in Example 6.1. The difference between the two examples is the assignment of the special missing value prior to the DATA step.
Precede the DATA step that creates BOOKMISS with the MISSING statement. Assign the underscore (_) character as a special missing value. Create data set BOOKMISS. Note the underscores in the DUEDATE fields in rows 1, 2, and 4. Note the regular missing value for DUEDATE in row 3.
Create data set REVISEDMISS. Update data set BOOKLIST with information in BOOKMISS. Place the master data set first in the UPDATE statement. Specify how to match observations in the two data sets.
missing _; data bookmiss; input bookid $ 1-6 chapter author $ 11-30 duedate : mmddyy10.; format duedate mmddyy10.; datalines; NF0586 3 _ NF0586 4 _ NF2413 4 Loren, Marie . NF2413 7 _ NF8141 5 10/03/2010 NF8141 6 10/03/2010 ;;;; data revisedmiss; update booklist bookmiss; by bookid chapter; run;
The PROC SQL step that follows creates a table equivalent to the REVISEDMISS data set that was created by the DATA step in the main example. It uses CASE expressions and the COALESCE function to determine how to update the BOOKLIST table. The COALESCE function returns the first nonmissing value in the arguments that are supplied to it.
The two input tables are combined with a full join. All matching and nonmatching rows in both tables are present in REVISEDMISS along with the updates to specific rows and columns as specified in BOOKMISS.
As in the preceding DATA step, the following PROC SQL step distinguishes between regular and special missing values so that some missing values can replace existing values in the master data set when updating the master data set. The code is written so that a column with a special missing value in the transaction data set is assigned a regular missing value in the output table. A regular missing value in a column in the transaction data set does not replace an existing value in the master data set.
The CASE expression tests when the value for DUEDATE in the transaction table has the special missing value of underscore. When testing a numeric variable, the value is written as the regular missing value of period followed by the special missing value. When a value for DUEDATE is the special missing value of underscore, PROC SQL assigns a regular missing value to the value of DUEDATE in the output table. Otherwise, PROC SQL applies the COALESCE function to the two values of DUEDATE from the two input tables BOOKMISS and BOOKLIST in that order. When the value of DUEDATE in BOOKMISS is a regular missing value, the COALESCE function returns the value for DUEDATE from BOOKLIST if it is nonmissing. If both are missing, the result is missing.
Create table REVISEDMISS. Apply the COALESCE function to the two matching columns so that if there are nonmatches in the transaction table, the nonmissing unmatched value is copied to the output table. If the value for DUEDATE in the transaction table is the special missing value of underscore (_), assign a regular missing value to DUEDATE. Otherwise, assign it the first nonmissing value of the two arguments. End the CASE expression. Format the values of DUEDATE in the output table.
Combine the two tables with a full join so that all matching and nonmatching rows are present in the output table. Specify an alias for each of the two tables. Specify the columns that match the two tables.
proc sql; create table revisedmiss as select coalesce(m.bookid,b.bookid) as bookid, booktitle, coalesce(m.chapter,b.chapter) as chapter, coalesce(m.author,b.author) as author, case when m.duedate=._ then . else coalesce(m.duedate,b.duedate) end as duedate format=mmddyy10., editor from booklist b full join bookmiss m on b.bookid=m.bookid and b.chapter=m.chapter; quit;
18.119.111.70