Example 8.3 Separating Completely Duplicate Observations from Unique Observations

Goal

Identify observations that are a complete copy of another observation. Write the unique observations to one data set and write the duplicate observations to another data set.

Example Features

Featured StepPROC SORT
Featured Step Options and StatementsDUPOUT=, NODUPRECS, and OUT= options
Related TechniqueDATA step, BY statement and LAG function

Input Data Set

Data set PMTEMPS records the hourly afternoon weather data for three locations. Note that the 17:00 Rooftop measurement is recorded identically three times. The 18:00 Rooftop, 16:00 Beach, and 16:00 Downtown measurements are recorded twice. The 14:00 and 15:00 Downtown measurements are identical except for the time of the measurement.

                            PMTEMPS

   Obs    location    temptime    tempf    dewpoint    relhum
     1    Rooftop      13:00        73        50         44
     2    Rooftop      14:00        72        52         49
     3    Rooftop      15:00        72        53         51
     4    Rooftop      16:00        74        51         45
     5    Rooftop      17:00        75        51         43
     6    Rooftop      17:00        75        51         43
     7    Rooftop      17:00        75        51         43
     8    Rooftop      18:00        74        50         43
     9    Rooftop      18:00        74        50         43
    10    Beach        13:00        67        56         68
    11    Beach        14:00        71        58         63
    12    Beach        15:00        74        59         60
    13    Beach        16:00        76        58         54
    14    Beach        16:00        76        58         54
    15    Beach        17:00        79        58         48
    16    Beach        18:00        81        59         47
    17    Downtown     13:00        68        62         81
    18    Downtown     14:00        69        62         78
    19    Downtown     15:00        69        62         78
    20    Downtown     16:00        71        63         76
    21    Downtown     16:00        71        63         76
    22    Downtown     17:00        70        64         81
    23    Downtown     18:00        73        64         73

Resulting Data Sets

Output 8.3a UNIQUETEMPS Data Set

    Example 8.3 UNIQUETEMPS Data Set Created with PROC SORT

   Obs    location    temptime    tempf    dewpoint    relhum
     1    Beach        13:00        67        56         68
     2    Beach        14:00        71        58         63
     3    Beach        15:00        74        59         60
     4    Beach        16:00        76        58         54
     5    Beach        17:00        79        58         48
     6    Beach        18:00        81        59         47
     7    Downtown     13:00        68        62         81
     8    Downtown     14:00        69        62         78
     9    Downtown     15:00        69        62         78
    10    Downtown     16:00        71        63         76
    11    Downtown     17:00        70        64         81
    12    Downtown     18:00        73        64         73
    13    Rooftop      13:00        73        50         44
    14    Rooftop      14:00        72        52         49
    15    Rooftop      15:00        72        53         51
    16    Rooftop      16:00        74        51         45
    17    Rooftop      17:00        75        51         43
    18    Rooftop      18:00        74        50         43


Output 8.3b DUPTEMPS Data Set

      Example 8.3 DUPTEMPS Data Set Created with PROC SORT

   Obs    location    temptime    tempf    dewpoint    relhum
    1     Beach        16:00        76        58         54
    2     Downtown     16:00        71        63         76
    3     Rooftop      17:00        75        51         43
    4     Rooftop      17:00        75        51         43
    5     Rooftop      18:00        74        50         43


Example Overview

This example demonstrates how to separate unique observations from duplicate observations. With the DUPOUT=, NODUPRECS, and OUT= options, PROC SORT can write unique observations to one data set and duplicate observations to another. The NODUPRECS option finds multiple copies of an observation by comparing an observation to the previous observation. The first observation in a group of identical observations is written to the OUT= data set. Subsequent copies are written to the DUPOUT= data set.

The identification of duplicate observations does depend on how you specify the BY statement. Because PROC SORT identifies a duplicate observation by comparing an observation to the previous observation, it is possible that the input data set might be in such an order that the sort specified in the BY statement does not arrange the duplicates consecutively. In this situation, you could end up with a data set that still had complete duplicate observations.

Usually you sort a data set by its key identifier variables when using NODUPRECS. This will usually make it more likely that PROC SORT finds and removes all duplicate observations from the input data set.

Data set PMTEMPS has hourly weather observations for one afternoon for three locations. The goal is to clean the data set by removing duplicate observations. The key variables in PMTEMPS are LOCATION and TEMPTIME.

The EQUALS/NOEQUALS PROC SORT option can also affect the results of the NODUPRECS option. It specifies the order of the observations in the output data set and is not used in this example. Its default setting is EQUALS. For observations with identical BY-variable values, EQUALS maintains the relative order of the observations within the input data set in the output data set. The NOEQUALS option does not necessarily preserve this order in the output data set. For information about this option, see SAS documentation on PROC SORT.

Program

Sort data set PMTEMPS. Write unique observations in a BY group or observations that are first in a group of identical observations to the OUT= data set. Write the observations that are duplicate copies of the first in a group of identical observations to the DUPOUT= data set. Specify the BY variables. Write the BY statement so that identical observations will be arranged consecutively.

proc sort data=pmtemps
          noduprecs
          out=uniquetemps


          dupout=duptemps;


  by location temptime;


run;

Related Technique

The following DATA step produces two data sets equivalent to the data sets that were produced by PROC SORT in the preceding main example. It performs a similar process as PROC SORT with the NODUPRECS option by comparing an observation to the previous observation. As with PROC SORT, it is possible that your code might not find all duplicate observations because of the arrangement of the observations. Ensure that you sort your data set in a way that the duplicates would be arranged consecutively. Typically, you would order your data by its key identifiers. The key identifiers in data set PMTEMPS are LOCATION and TEMPTIME.

PROC SORT sorts data set PMTEMPS by LOCATION and TEMPTIME prior to the DATA step so that observations recorded at the same place and time are consecutive.

Assignment statements apply the LAGn function to each of the five variables. Because no numeric suffix is specified, the calls to the LAGn function retrieve values from the previous observation. An IF statement in the DATA step compares the variables that contain the lagged values to the variables in the observation that is currently being processed.

A queue of lagged values is initialized with the first call to a LAG function. Each iteration of the DATA step must execute the LAG function so that the queue is populated correctly.

If the values of all variables in an observation are equal to the values in the previous observation, the observation is output to DUPTEMPS. The first observation in a group of consecutive identical observations is always output to UNIQUETEMPS. This happens because when compared to the previous observation, the current observation's values are not identical because the previous observation is in a different group.

Note that the DATA step does not include a BY statement. The observations are arranged by LOCATION and TEMPTIME prior to the DATA step, but they are not processed in BY groups by the DATA step.

The PROC SORT step is easier to code than the multiple comparisons in the IF statement in the DATA step. However, you might need to use a DATA step because of additional processing you need to include.

Arrange the observations so that identical observations for a location and time of measurement are arranged consecutively. Create data sets UNIQUETEMPS and DUPTEMPS. Read each observation in PMTEMPS. Drop the variables that hold the lagged variables because they are needed only during execution of the DATA step. Retrieve variable values from the previous observation.

Test the values of all variable values in the observation that is currently being processed to their lagged values. Output an observation to DUPTEMPS if all comparisons show that the values are equal. Output an observation to UNIQUETEMPS if any of the comparison tests show that a value is unequal.

proc sort data=pmtemps;
  by location temptime;
run;

data uniquetemps duptemps;

  set pmtemps;
  drop lagloc lagtemptime lagtempf lagdewpoint
       lagrelhum;

  lagloc=lag(location);
  lagtemptime=lag(temptime);
  lagtempf=lag(tempf);
  lagdewpoint=lag(dewpoint);
  lagrelhum=lag(relhum);
  if lagloc=location and lagtemptime=temptime and
     lagtempf=tempf and lagdewpoint=dewpoint and
     lagrelhum=relhum then output duptemps;



  else output uniquetemps;


run;

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.22.163