Combining SAS Data Sets: Methods (1/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Testing Your Program

As a final step in preparing your data sets, you should test your program. Create small

temporary SAS data sets that contain a sample of observations that test all of your

program's logic. If your logic is faulty and you get unexpected output, you can use the

DATA step debugger to debug your program. For complete information about the DATA

Step Debugger, see SAS Data Set Options: Reference.

Combining SAS Data Sets: Methods

Concatenating

Definition

Concatenating data sets is the combining of two or more data sets, one after the other,

into a single data set. The number of observations in the new data set is the sum of the

number of observations in the original data sets. The order of observations is sequential.

All observations from the first data set are followed by all observations from the second

data set, and so on.

In the simplest case, all input data sets contain the same variables. If the input data sets

contain different variables, observations from one data set have missing values for

variables defined only in other data sets. In either case, the variables in the new data set

are the same as the variables in the old data sets.

Syntax

Use this form of the SET statement to concatenate data sets:

SET data-set(s);

where

data-set

specifies any valid SAS data set name.

For a complete description of valid SAS data set names, see the SET statement in SAS

Statements: Reference.

DATA Step Processing during Concatenation

Compilation phase

SAS reads the descriptor information of each data set that is named in the SET

statement and then creates a program data vector that contains all the variables from

all data sets as well as variables created by the DATA step.

Execution — Step 1

SAS reads the first observation from the first data set into the program data vector. It

processes the first observation and executes other statements in the DATA step. It

then writes the contents of the program data vector to the new data set.

The SET statement does not reset the values in the program data vector to missing,

except for variables whose value is calculated or assigned during the DATA step.

Variables that are created by the DATA step are set to missing at the beginning of

each iteration of the DATA step. Variables that are read from a data set are not.

Execution — Step 2

SAS continues to read one observation at a time from the first data set until it finds

an end-of-file indicator. The values of the variables in the program data vector are

478 Chapter 21 • Reading, Combining, and Modifying SAS Data Sets

then set to missing, and SAS begins reading observations from the second data set,

and so on, until it reads all observations from all data sets.

Example 1: Concatenation Using the DATA Step

In this example, each data set contains the variables Common and Number, and the

observations are arranged in the order of the values of Common. Generally, you

concatenate SAS data sets that have the same variables. In this case, each data set also

contains a unique variable to show the effects of combining data sets more clearly. The

following shows the Animal and the Plant input data sets in the library that is referenced

by the libref Example:

Animal Plant

OBS Common Animal Number OBS Common Plant Number

1 a Ant 5 1 g Grape 69

2 b Bird 2 h Hazelnut 55

3 c Cat 17 3 i Indigo .

4 d Dog 9 4 j Jicama 14

5 e Eagle 5 k Kale 5

6 f Frog 76 6 l Lentil 77

The following program uses a SET statement to concatenate the data sets and then prints

the results:

data concatenation;

set animal plant;

run;

proc print data=concatenation;

var Common Animal Plant Number;

title 'Data Set CONCATENATION';

run;

Combining SAS Data Sets: Methods 479

Output 21.1 Concatenated Data Sets (DATA Step)

The resulting data set CONCATENATION has 12 observations, which is the sum of the

observations from the combined data sets. The program data vector contains all variables

from all data sets. The values of variables found in one data set but not in another are set

to missing.

Example 2: Concatenation Using SQL

You can also use the SQL language to concatenate tables. In this example, SQL reads

each row in both tables and creates a new table named Combined. The following shows

the YEAR1 and YEAR2 input tables:

YEAR1 YEAR2

Date1 Date2

2009

2010 2010

2011 2011

2012 2012

2013

2014

The following SQL code creates and prints the table Combined.

proc sql;

title 'SQL Table Combined';

create table combined as

select * from year1

480 Chapter 21 • Reading, Combining, and Modifying SAS Data Sets

union all

select * from year2;

select * from combined;

quit;

Output 21.2 Concatenated Tables (SQL)

Appending Files

Instead of concatenating data sets or tables, you can append them and produce the same

results as concatenation. SAS concatenates data sets (DATA step) and tables (SQL) by

reading each row of data to create a new file. To avoid reading all the records, you can

append the second file to the first file by using the APPEND procedure:

proc append base=year1 data=year2;

run;

The YEAR1 file contains all rows from both tables.

Note: You cannot use PROC APPEND to add observations to a SAS data set in a

sequential library.

Efficiency

If no additional processing is necessary, using PROC APPEND or the APPEND

statement in PROC DATASETS is more efficient than using a DATA step to concatenate

data sets.

Interleaving

Definition

Interleaving uses a SET statement and a BY statement to combine multiple data sets into

one new data set. The number of observations in the new data set is the sum of the

number of observations from the original data sets. However, the observations in the

new data set are arranged by the values of the BY variable or variables and, within each

Combining SAS Data Sets: Methods 481

BY group, by the order of the data sets in which they occur. You can interleave data sets

either by using a BY variable or by using an index.

Syntax

Use this form of the SET statement to interleave data sets when you use a BY variable:

SET data-set(s);

BY variable(s);

where

data-set

specifies a one-level name, a two-level name, or one of the special SAS data set

names.

variable

specifies each variable by which the data set is sorted. These variables are referred to

as BY variables for the current DATA or PROC step.

Use this form of the SET statement to interleave data sets when you use an index:

SET data-set-1 . . . data-set-n KEY= index;

where

data-set

specifies a one-level name, a two-level name, or one of the special SAS data set

names.

index

provides nonsequential access to observations in a SAS data set, which are based on

the value of an index variable or key.

For a complete description of the SET statement, including SET with the KEY= option,

see the SET statement in SAS Statements: Reference.

Sort Requirements

Before you can interleave data sets, the observations must be sorted or grouped by the

same variable or variables that you use in the BY statement, or you must have an

appropriate index for the data sets.

DATA Step Processing during Interleaving

Compilation phase

• SAS reads the descriptor information of each data set that is named in the SET

statement and then creates a program data vector that contains all the variables

from all data sets as well as variables created by the DATA step.

• SAS creates the FIRST.variable and LAST.variable for each variable listed in the

BY statement.

Execution — Step 1

SAS compares the first observation from each data set that is named in the SET

statement to determine which BY group should appear first in the new data set. It

reads all observations from the first BY group from the selected data set. If this BY

group appears in more than one data set, it reads from the data sets in the order in

which they appear in the SET statement. The values of the variables in the program

data vector are set to missing each time SAS starts to read a new data set and when

the BY group changes.

482 Chapter 21 • Reading, Combining, and Modifying SAS Data Sets

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Combining SAS Data Sets: Methods (1/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
Combining SAS Data Sets: Methods (1/6)