Example 3.3 Combining Observations When Variable Values Do Not Match Exactly

Goal

Perform a fuzzy merge by merging observations from two data sets based on data values that do not match exactly.

Example Features

Featured StepPROC SQL
Featured Step Options and StatementsSELECT statement with FULL JOIN and ON clause

Input Tables

Table BATCH_ONE contains times and sample numbers for seven samples. Table BATCH_TWO contains times and sample numbers for nine samples.


        BATCH_ONE

Obs            samptime    sample
 1   23NOV2009:09:01:00      100
 2   23NOV2009:10:03:00      101
 3   23NOV2009:10:58:00      102
 4   23NOV2009:11:59:00      103
 5   23NOV2009:13:00:00      104
 6   23NOV2009:14:02:00      105
 7   23NOV2009:16:00:00      106


           BATCH_TWO

Obs             samptime    sample
  1   23NOV2009:09:00:00      200
  2   23NOV2009:09:59:00      201
  3   23NOV2009:11:04:00      202
  4   23NOV2009:12:02:00      203
  5   23NOV2009:14:01:00      204
  6   23NOV2009:14:59:00      205
  7   23NOV2009:15:59:00      206
  8   23NOV2008:16:59:00      207
  9   23NOV2008:18:00:00      208

Resulting Table

Output 3.3 MATCH_APPROX Table

     Example 3.3 MATCH_APPROX Table Created with PROC SQL

  Obs           samptime1 sample1            samptime2 sample2

   1   23NOV2009:09:01:00   100     23NOV2009:09:00:00   200
   2   23NOV2009:10:03:00   101     23NOV2009:09:59:00   201
   3   23NOV2009:11:59:00   103     23NOV2009:12:02:00   203
   4   23NOV2009:14:02:00   105     23NOV2009:14:01:00   204
   5   23NOV2009:16:00:00   106     23NOV2009:15:59:00   206
   6                    .     .     23NOV2009:11:04:00   202
   7                    .     .     23NOV2009:14:59:00   205
   8                    .     .     23NOV2009:16:59:00   207
   9                    .     .     23NOV2009:18:00:00   208
  10   23NOV2009:10:58:00   102                      .     .
  11   23NOV2009:13:00:00   104                      .     .


Example Overview

This example joins two tables with PROC SQL by using a full join. It matches rows where the difference in the values of the matching column is within an acceptable range.

Table BATCH_ONE contains times and sample numbers for seven samples. Table BATCH_TWO contains times and sample numbers for nine samples. The goal is to match rows where the difference in time between the two values of SAMPTIME is less than five minutes.

The full join results in an internal table that matches every row in BATCH_ONE with every row in BATCH_TWO. The ON clause subsets the internal table by those rows where there is less than a five-minute time difference.

This join is a full outer join, which returns rows that satisfy the condition in the ON clause. In addition, a full outer join returns all of the rows from each table that do not match a row from the other table, based on the condition in the ON clause. For example, samples 202, 205, 207, and 208 in table BATCH_TWO do not have a match within five minutes in BATCH_ONE and are included in output table MATCH_APPROX. The columns that originate from BATCH_ONE for these rows have missing values. Similarly, samples 102 and 104 in BATCH_ONE do not have a match in BATCH_TWO. The columns that originate from BATCH_TWO for these rows have missing values.

Because the columns have the same names in both input tables, they are referenced with the table alias in the SELECT and ON clauses. The SELECT clause assigns the columns new names. This ensures that the output table has values from both sources and PROC SQL does not overlay any values.

Program

Create table MATCH_APPROX. Select all columns from both input tables. Rename the columns so that values are not overwritten when there is a match between BATCH_ONE and BATCH_TWO. Perform a full join. Assign an alias to each table.

Match rows where the absolute difference in SAMPTIME from the two tables is less than five minutes (300 seconds).

proc sql;
  create table match_approx as
    select one.samptime as samptime1,
           one.sample as sample1,
           two.samptime as samptime2,
           two.sample as sample2

      from batch_one one
             full join
           batch_two two
    on abs(one.samptime-two.samptime)<=300;


quit;

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.46.18