Perform a fuzzy merge by merging observations from two data sets based on the formatted version of the matching variable rather than the stored value.
Featured Step | DATA step |
Featured Step Options and Statements | BY statement with GROUPFORMAT option |
Related Technique | PROC SQL, left join, ON clause with functions |
Data set PRIZES contains a list of four prizes and the last day of the month in which they are awarded. Data set PARTICIPANTS contains a list of participants who will receive a prize dependent on a completion date.
PRIZES Obs completed prize 1 01/31/2009 Restaurant Certificate 2 02/28/2009 Audio Player 3 03/31/2009 Theater Tickets 4 04/30/2009 Baseball Tickets
PARTICIPANTS Obs name completed 1 Moore, Kathryn 12/27/2008 2 Jackson, Barbara 01/15/2009 3 Brown, Shannon 03/23/2009 4 Williams, Debra 03/26/2009 5 Harris, Joseph 02/01/2009 6 Brown, Patricia 01/08/2009 7 Johnson, Christopher 02/17/2009 8 Rodriguez, Shawn 03/31/2009 9 Gonzalez, Patrick 01/14/2009 10 Wright, Nicholas 03/02/2009 11 Jones, Robert 02/28/2009 12 Miller, Christopher 03/25/2009
Output 3.4 PARTIC_PRIZES Data SetExample 3.4 PARTIC_PRIZES Data Set Created with DATA Step Obs completed prize name 1 DEC2008 (unknown) Moore, Kathryn 2 JAN2009 Restaurant Certificate Brown, Patricia 3 JAN2009 Restaurant Certificate Gonzalez, Patrick 4 JAN2009 Restaurant Certificate Jackson, Barbara 5 FEB2009 Audio Player Harris, Joseph 6 FEB2009 Audio Player Johnson, Christopher 7 FEB2009 Audio Player Jones, Robert 8 MAR2009 Theater Tickets Wright, Nicholas 9 MAR2009 Theater Tickets Brown, Shannon 10 MAR2009 Theater Tickets Miller, Christopher 11 MAR2009 Theater Tickets Williams, Debra 12 MAR2009 Theater Tickets Rodriguez, Shawn |
This example demonstrates how you can match-merge two data sets based on the formatted values of a variable instead of the exact values of the variable.
The DATA step combines two data sets, PRIZES and PARTICIPANTS, by the formatted values of one variable, COMPLETED. The goal is to find the prize to be awarded to a participant based on the month and year in which the participant finished a project. Variable COMPLETED in data set PRIZES is the last day of the month in which a specific prize is awarded. The actual date of completion for a participant is stored in variable COMPLETED in data set PARTICIPANTS.
Only two of the values of COMPLETED in PARTICIPANTS are exactly equal to values of COMPLETED in PRIZES. The values need to be grouped to make matches between the two data sets. A FORMAT statement for COMPLETED formats its values with the MONYY format, which groups the date values by month and year.
The BY statement in the DATA step includes the GROUPFORMAT option. This option causes the matching between the two data sets to be done by the formatted values of the BY variable COMPLETED rather than its internal value. If you omit the GROUPFORMAT option, only two observations in PARTICIPANTS match to an observation in PRIZES. These two observations are the ones with dates of completion on the last day of a month: Robert Jones on February 28, 2009, and Shawn Rodriguez on March 31, 2009.
Note that GROUPFORMAT can be specified anywhere in the BY statement and will apply to all BY variables.
The DATA step creates two temporary variables, INZ and INP, with the IN= data set options on data sets PRIZES and PARTICIPANTS, respectively. IF-THEN statements test the values of INZ and INP to determine how to handle observations found in only one data set.
Sort both data sets by the matching variable.
Create data set PARTIC_PRIZES. Combine the data sets. Add the IN= data set option to each input data set so that the data set origin of the current observation can be determined. Specify the matching variable. Specify with GROUPFORMAT that matches be made by the formatted values of COMPLETED and not its internal values. Group the values of COMPLETED by month and year. Keep all the observations that are found in PARTICIPANTS whether or not they have a match in PRIZES. Assign a value to variable PRIZE when a particular's completion date does not have a match in PRIZES.
proc sort data=prizes; by completed; run; proc sort data=participants; by completed; run; data partic_prizes; merge prizes(in=inz) participants(in=inp); by completed groupformat; format completed monyy7.; if inp; if not inz then prize='(unknown)'; run;
The related technique combines the two tables, PRIZES and PARTICIPANTS, with a PROC SQL left join. It creates a table equivalent to the data set in the main example.
The ON clause matches the rows by the formatted values of COMPLETED. As in the DATA step, the PROC SQL step uses format MONYY to group the values.
Create table PARTIC_PRIZES. Specify the columns to include in PARTIC_PRIZES. Specify the table on the left side of the join as the one from which to return the values of COMPLETED.
Return the first nonmissing value in the pair of arguments. For rows in PARTICIPANTS without a match in PRIZES, assign the value "(unknown)" to column PRIZE. Keep all rows in PARTICIPANTS by performing a left join of the two tables.
Match the rows by the formatted values of COMPLETED. Order the rows in the output table by COMPLETED.
proc sql; create table partic_prizes as select p.completed format=monyy7., name, coalesce(prize,'(unknown)') as Prize from participants p left join prizes z on put(p.completed,monyy7.)= put(z.completed,monyy7.) order by p.completed; quit;
3.135.198.174