Understanding the Methods for Combining SAS Data Sets

Generally SAS data sets are combined either vertically or horizontally.

  • When combined vertically, you concatenate the data sets one after another, or you interleave observations from the data sets in order by one or more variables. Observations are not combined side-by-side horizontally nor are they overlaid.

  • When combined horizontally, you usually match data sets by the values of key variables in common or by programmatically aligning observations when they do not have variables in common. The observations in the data sets can be aligned side-by-side and data in one data set can be overlaid by data from another.

You can use these methods to combine SAS data sets vertically:

  • concatenating

  • interleaving

You can use these methods to combine SAS data sets horizontally:

  • one-to-one reading

  • one-to-one merging

  • match-merging

  • updating

Figures 1.5 through 1.11 show basic illustrations of these methods.

The DATA step and PROC SQL can combine SAS data sets. The methods are demonstrated in the following figures primarily with basic DATA steps. A few include PROC SQL code as well.

Although not described in this section, the APPEND procedure and the APPEND statement in the DATASETS procedure can also concatenate SAS data sets. For more information, see examples throughout this book and SAS documentation.

Figure 1.5 shows vertically combining two data sets one after the other with the DATA step and with PROC SQL.

Figure 1.5. Vertical: Concatenating SAS Data Sets

Concatenating appends the observations from one data set to another data set.

The DATA step reads ONE sequentially until all observations have been processed. Then it reads TWO sequentially until all its observations have been processed.

DATA Step

data all;
  set one two;
run;

The OUTER UNION CORR set operator in PROC SQL concatenates the two tables.

PROC SQL

proc sql;
  create table all as
    select * from one
      outer union corr
    select * from two;
quit;

Figure 1.6 presents an example of vertically combining two data sets by interleaving the values of one variable. It shows how to do this with the DATA step and with PROC SQL.

Figure 1.6. Vertical: Interleaving SAS Data Sets

Interleaving intersperses observations from two or more data sets based on values of one or more common variables.

Assume that data sets ONE and TWO were sorted by YEAR prior to the DATA step.

DATA Step

data all;
  set one two;
  by year;
run;

The ORDER BY clause arranges the observations in the table ALL in the PROC SQL step. It is not necessary to sort the table before interleaving the rows.

PROC SQL

proc sql;
  create table all as
    select * from one
      outer union corr
    select * from two
    order by year;
quit;

The DATA step in Figure 1.7 combines two data sets horizontally by doing a one-to-one reading of two data sets.

Figure 1.7. Horizontal: One-to-One Reading

One-to-one reading combines observations from two or more data sets by creating observations that contain all of the variables from each contributing data set. Observations are combined based on their relative position in each data set. That is, the first observation in one data set is aligned with the first observation in the other data set, and so on.

The DATA step stops after it has read the last observation from the smallest data set. The fifth observation in Y is not present in XY.

DATA Step

data xy
  set x;
  set y;
run;

The DATA step in Figure 1.8 combines two data sets horizontally by performing a one-to-one merge of two data sets.

Figure 1.8. Horizontal: One-to-One Merging

One-to-one merging is the same as a one-to-one reading, except that all observations from the input data sets are read. Compared to Figure 1.7, the fifth observation from data set Y is now present in XY.

The DATA step uses the MERGE statement instead of the multiple SET statements as in Figure 1.7.

DATA Step

data xy;
  merge x y;
run;

Figures 1.7 and 1.8 show only DATA steps to perform one-to-one reading and one-to-one merging and do not show equivalent PROC SQL code. Conceptually, when PROC SQL joins two tables, every row in the first table is combined with every row in the second table. The result is called a Cartesian product. This kind of join can produce a very large table or report when joining tables in this way. More likely you will want to include ON clauses, WHERE clauses, and set operators in your SELECT statements that subset the Cartesian product.

Figures 1.7 and 1.8 do not contain any common variables that can be used to subset the results. Therefore, no equivalent PROC SQL code is included. Figure 1.9 presents a PROC SQL step that produces the Cartesian product of the two data sets used in both Figures 1.7 and 1.8.

Figure 1.9. Cartesian Product

A Cartesian product is produced when tables are combined with PROC SQL and no conditions are specified to subset the results.

PROC SQL Step

proc sql;
  create table xy as
    select * from x,y;
quit;

Figure 1.10 presents an example of combining two data sets horizontally by match-merging them by the values of a common variable. It shows how to do this with the DATA step and with PROC SQL.

Figure 1.10. Horizontal: Match-Merging

Match-merging combines observations from two or more data sets into a single observation in a new data set based on the values of one or more common variables.

Assume that data sets X and Y were sorted by DAY prior to the DATA step.

DATA Step

data xy;
  merge x y;
  by day;
run;

Tables X and Y do not have to be sorted before submitting the PROC SQL step. The COALESCE function saves the first nonmissing value of DAY from each of the two tables and the resulting column is named DAY.

PROC SQL Step

proc sql;
  create table xy as
    select coalesce(x.day,y.day) as day, xcoord, ycoord
      from x full join y
        on x.day=y.day;

The DATA step in Figure 1.11 combines two data sets horizontally by updating one data set with information in another.

Figure 1.11. Horizontal: Updating

Updating uses information from observations in a transaction data set to delete, add, or alter information in observations in a master data set.

Assume that MASTER and TRANSACTIONS were sorted by DT prior to the DATA step. Updating a data set with the DATA step requires that the data be sorted or indexed by the values of the common variable. You can update a master data set with the UPDATE or the MODIFY statements.

Note that by default UPDATE and MODIFY do not replace nonmissing values in a master data set with missing values in a transaction data set.

DATA Step

data master;
  update master transactions;
  by dt;
run;

The PROC SQL step performs a full join of the two tables. The COALESCE function saves the first nonmissing value of the columns that are supplied to it. Note that the order of the columns that are supplied to the COALESCE function represents the column from TRANSACTIONS first so that a nonmissing value updates the value in MASTER.

PROC SQL

proc sql;
  create table master as
    select coalesce(transactions.dt, master.dt) as date
                        format=mmddyy10.,
           coalesce(transactions.manager, master.manager) as
                       manager,
           coalesce(transactions.sales, master.sales) as sales
      from master full join transactions
        on transactions.dt=master.dt;
quit;

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.116.146