Comparing SQL Joins and DATA Step Match-Merges

A Brief Overview

DATA step match-merges and PROC SQL joins can produce the same results. However, there are important differences between these two techniques. For example, a join does not require that you sort the data first; a DATA step match-merge requires that the data be sorted.
It is useful to compare the use of SQL joins and DATA step match-merges in the following situations:
  • when all of the values of the selected variable match
  • when only some of the values of the selected variable match

When All of the Values Match

When all of the values of the BY variable match, you can use a PROC SQL inner join to produce the same results as a DATA step match-merge.
Suppose you want to combine tables Five and Six.
Figure 3.4 Tables Certadv.Five and Certadv.Six
Tables Certadv.Five and Certadv.Six
These two tables have column X in common, and all values of X in each row match across the two tables. Both tables are already sorted by X.
The following DATA step match-merge (followed by a PROC PRINT step) and the PROC SQL inner join produce identical reports.
DATA Step Match-Merge
PROC SQL Inner Join
data merged;
   merge certadv.five certadv.six;
   by x;
run;
proc print data=merged noobs;
   title 'Table Merged';
run;
PROC PRINT Result: DATA Step Match-Merge
proc sql;
title 'Table Merged';
   select five.x, a, b
      from certadv.five, certadv.six
      where five.x = six.x
      order by x;
quit;
PROC SQL Query Result
Note: The DATA step match-merge creates a data set. By contrast, the PROC SQL inner join, as shown here, creates only a report as output. To make these two programs completely identical, the PROC SQL inner join could be rewritten to create a table.
Note: If the order of rows in the output does not matter, the ORDER BY clause can be removed from the PROC SQL join. Without the ORDER BY clause, this join is more efficient, because PROC SQL does not need to make a second pass through the data.

When Only Some of the Values Match

When only some of the values of the BY variable match, you can use a PROC SQL full outer join to produce the same result as a DATA step match-merge. Unlike the DATA step match-merge, however, a PROC SQL outer join does not overlay the two common columns by default. To overlay common columns, you must use the COALESCE function in the PROC SQL full outer join.
Note: The COALESCE function can also be used with inner join operators.
Consider what happens when you use a PROC SQL full outer join without the COALESCE function. Suppose you want to combine tables Three and Four. These two tables have the column X in common, but most of the values of X do not match across tables. Both tables are already sorted by X. The following DATA step match-merge (followed by a PROC PRINT step) and the PROC SQL outer join combine these tables, but do not generate the same output. The COALESCE function can also be used with inner join operators.
DATA Step Match-Merge
PROC SQL Full Outer Join
data merged;
   merge certadv.three certadv.four;
   by x;
run;

proc print data=merged noobs;
   title 'Table Merged';
run;
PROC PRINT Output of DATA Step Match-Merge
proc sql;
title 'Table Merged';
   select three.x, a, b
      from certadv.three
      full join
      certadv.four
      on three.x = four.x
      order by x;
quit;
PROC SQL Query Result
The DATA step match-merge automatically overlays the common column, X. The PROC SQL outer join selects the value of X from just one of the tables, table Three, so that no X values from table Four are included in the PROC SQL output. However, the PROC SQL outer join cannot overlay the columns by default. The values that vary across the two merged tables are highlighted.

COALESCE Function Syntax

When you add the COALESCE function to the SELECT clause of the PROC SQL outer join, the PROC SQL outer join can produce the same result as a DATA step match-merge.
Syntax, COALESCE function in a SELECT clause:
SELECT COALESCE (column-1<,...column-n>)
column-1 through column-n
are the names of two or more columns to be overlaid. The COALESCE function requires that all arguments have the same data type.
The COALESCE function overlays the specified columns:
  • It checks the value of each column in the order in which the columns are listed.
  • It returns the first value that is a SAS nonmissing value.
Note: If all returned values are missing, COALESCE returns a missing value.

Example: Using the COALESCE Function

When the COALESCE function is added to the preceding PROC SQL full outer join, the DATA step match-merge (with PROC PRINT step) and the PROC SQL full outer join combine rows in the same way. The two programs, the tables, and the output are shown below.
DATA Step Match-Merge
PROC SQL Full Outer Join
data merged;
    merge certadv.three certadv.four;
    by x;
run;
proc print data=merged noobs;
    title 'Table Merged';
run;
PROC PRINT Output of DATA Step Match-Merge
proc sql;
 title 'Table Merged';
    select coalesce(three.x, four.x)
       as X, a, b
       from certadv.three
       full join
       certadv.four
       on three.x = four.x;
PROC SQL Query Result

Understanding the Advantages of PROC SQL Joins

DATA step match-merges and PROC SQL joins both have advantages and disadvantages. Here are some of the advantages of PROC SQL joins.
Advantage
Example
PROC SQL joins do not require sorted or indexed tables.
proc sql;
    select table1.x, a, b
       from table1
            full join
            table2
            on table1.x = table2.x;
quit;
Note that table-1 is sorted by column X and table-2 is not.
PROC SQL joins do not require that the columns in join expressions have the same name.
proc sql;
    select table1.x, lastname,
            status
       from table1, table2
       where table1.id =
             table2.custnum;
PROC SQL joins can use comparison operators other than the equal sign (=).
proc sql;
   select a.itemnumber, cost,
          price
      from table1 as a,
           table2 as b
      where a.itemnumber = b.itemnumber
            and a.cost>b.price;
quit;
Note: Join performance can be substantially improved when the tables are indexed on the columns on which the tables are being joined.
Last updated: October 16, 2019
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.97.208