Example 3.4 Combining Observations by the Formatted Value of a Variable

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Example 3.4 Combining Observations by the Formatted Value of a Variable

Goal

Perform a fuzzy merge by merging observations from two data sets based on the formatted version of the matching variable rather than the stored value.

Example Features

Featured Step	DATA step
Featured Step Options and Statements	BY statement with GROUPFORMAT option
Related Technique	PROC SQL, left join, ON clause with functions

Input Data Sets

Data set PRIZES contains a list of four prizes and the last day of the month in which they are awarded. Data set PARTICIPANTS contains a list of participants who will receive a prize dependent on a completion date.

                PRIZES

 Obs   completed  prize
  1   01/31/2009  Restaurant Certificate
  2   02/28/2009  Audio Player
  3   03/31/2009  Theater Tickets
  4   04/30/2009  Baseball Tickets

                 PARTICIPANTS

 Obs  name                  completed
  1   Moore, Kathryn       12/27/2008
  2   Jackson, Barbara     01/15/2009
  3   Brown, Shannon       03/23/2009
  4   Williams, Debra      03/26/2009
  5   Harris, Joseph       02/01/2009
  6   Brown, Patricia      01/08/2009
  7   Johnson, Christopher 02/17/2009
  8   Rodriguez, Shawn     03/31/2009
  9   Gonzalez, Patrick    01/14/2009
 10   Wright, Nicholas     03/02/2009
 11   Jones, Robert        02/28/2009
 12   Miller, Christopher  03/25/2009

Resulting Data Set

Output 3.4 PARTIC_PRIZES Data Set

   Example 3.4 PARTIC_PRIZES Data Set Created with DATA Step

Obs   completed   prize                    name

  1    DEC2008    (unknown)                Moore, Kathryn
  2    JAN2009    Restaurant Certificate   Brown, Patricia
  3    JAN2009    Restaurant Certificate   Gonzalez, Patrick
  4    JAN2009    Restaurant Certificate   Jackson, Barbara
  5    FEB2009    Audio Player             Harris, Joseph
  6    FEB2009    Audio Player             Johnson, Christopher
  7    FEB2009    Audio Player             Jones, Robert
  8    MAR2009    Theater Tickets          Wright, Nicholas
  9    MAR2009    Theater Tickets          Brown, Shannon
 10    MAR2009    Theater Tickets          Miller, Christopher
 11    MAR2009    Theater Tickets          Williams, Debra
 12    MAR2009    Theater Tickets          Rodriguez, Shawn

Example Overview

This example demonstrates how you can match-merge two data sets based on the formatted values of a variable instead of the exact values of the variable.

The DATA step combines two data sets, PRIZES and PARTICIPANTS, by the formatted values of one variable, COMPLETED. The goal is to find the prize to be awarded to a participant based on the month and year in which the participant finished a project. Variable COMPLETED in data set PRIZES is the last day of the month in which a specific prize is awarded. The actual date of completion for a participant is stored in variable COMPLETED in data set PARTICIPANTS.

Only two of the values of COMPLETED in PARTICIPANTS are exactly equal to values of COMPLETED in PRIZES. The values need to be grouped to make matches between the two data sets. A FORMAT statement for COMPLETED formats its values with the MONYY format, which groups the date values by month and year.

The BY statement in the DATA step includes the GROUPFORMAT option. This option causes the matching between the two data sets to be done by the formatted values of the BY variable COMPLETED rather than its internal value. If you omit the GROUPFORMAT option, only two observations in PARTICIPANTS match to an observation in PRIZES. These two observations are the ones with dates of completion on the last day of a month: Robert Jones on February 28, 2009, and Shawn Rodriguez on March 31, 2009.

Note that GROUPFORMAT can be specified anywhere in the BY statement and will apply to all BY variables.

The DATA step creates two temporary variables, INZ and INP, with the IN= data set options on data sets PRIZES and PARTICIPANTS, respectively. IF-THEN statements test the values of INZ and INP to determine how to handle observations found in only one data set.

Program

Sort both data sets by the matching variable.

Create data set PARTIC_PRIZES. Combine the data sets. Add the IN= data set option to each input data set so that the data set origin of the current observation can be determined. Specify the matching variable. Specify with GROUPFORMAT that matches be made by the formatted values of COMPLETED and not its internal values. Group the values of COMPLETED by month and year. Keep all the observations that are found in PARTICIPANTS whether or not they have a match in PRIZES. Assign a value to variable PRIZE when a particular's completion date does not have a match in PRIZES.

proc sort data=prizes;
  by completed;
run;
proc sort data=participants;
  by completed;
run;
data partic_prizes;
  merge prizes(in=inz) participants(in=inp);


  by completed groupformat;



  format completed monyy7.;

  if inp;

  if not inz then prize='(unknown)';
                            
run;

Related Technique

The related technique combines the two tables, PRIZES and PARTICIPANTS, with a PROC SQL left join. It creates a table equivalent to the data set in the main example.

The ON clause matches the rows by the formatted values of COMPLETED. As in the DATA step, the PROC SQL step uses format MONYY to group the values.

Create table PARTIC_PRIZES. Specify the columns to include in PARTIC_PRIZES. Specify the table on the left side of the join as the one from which to return the values of COMPLETED.

Return the first nonmissing value in the pair of arguments. For rows in PARTICIPANTS without a match in PRIZES, assign the value "(unknown)" to column PRIZE. Keep all rows in PARTICIPANTS by performing a left join of the two tables.

Match the rows by the formatted values of COMPLETED. Order the rows in the output table by COMPLETED.

proc sql;
  create table partic_prizes as
    select p.completed format=monyy7., name,


      coalesce(prize,'(unknown)') as Prize



    from participants p
       left join
     prizes z
       on put(p.completed,monyy7.)=
                             put(z.completed,monyy7.)
     order by p.completed;

quit;

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Example 3.4 Combining Observations by the Formatted Value of a Variable

Create new playlist

Sign In

Sign Up

Example 3.4 Combining Observations by the Formatted Value of a Variable

Goal

Example Features

Input Data Sets

Resulting Data Set

Output 3.4 PARTIC_PRIZES Data Set

Example Overview

Program

Related Technique

Table of Contents for
Example 3.4 Combining Observations by the Formatted Value of a Variable