Create a subset of a SAS data set efficiently by selecting for processing only observations that meet a particular condition.
Featured Step | DATA step |
Featured Step Options and Statements | WHERE statement |
Related Technique | PROC SQL, WHERE clause |
A Closer Look | Comparing the WHERE Statement in the DATA Step and the Subsetting IF Statement |
Data set CELLPHONES contains the support records for one day. Only the first 25 of 10,000 observations are shown.
CELLPHONES Obs callhour calllength status reason 1 00 10 Resolved Text Msg Help 2 00 2 Transferred Other Feature Help 3 01 3 Resolved Start Service 4 01 2 Resolved Other 5 01 4 Resolved Voice Mail Help 6 01 4 Resolved Billing Change 7 01 3 Resolved Other Feature Help 8 02 3 Resolved Start Service 9 02 6 Transferred Other 10 02 3 Resolved Account Status 11 02 8 Resolved Account Status 12 02 5 Transferred Other 13 02 6 Transferred Ringtones Help 14 02 2 Transferred Other Feature Help 15 02 7 Resolved Account Status 16 02 1 Resolved Password Question 17 02 5 Resolved Other Feature Help 18 02 5 Resolved Address Change 19 02 4 Resolved Text Msg Help 20 03 7 Resolved Password Question 21 03 8 Resolved Start Service 22 03 12 Resolved Start Service 23 03 2 Resolved Start Service 24 03 5 Transferred Other 25 03 3 Resolved Other Feature Help . . .
Output 8.1 FEATUREQUESTIONS Data SetExample 8.1 FEATUREQUESTIONS Data Set Created with DATA Step Obs callhour calllength status reason 1 00 10 Resolved Text Msg Help 2 00 2 Transferred Other Feature Help 3 01 4 Resolved Voice Mail Help 4 01 3 Resolved Other Feature Help 5 02 6 Transferred Ringtones Help 6 02 2 Transferred Other Feature Help 7 02 5 Resolved Other Feature Help 8 02 4 Resolved Text Msg Help 9 03 3 Resolved Other Feature Help 10 03 1 Resolved Other Feature Help 11 03 2 Resolved Voice Mail Help 12 03 4 Resolved Other Feature Help 13 03 5 Resolved Ringtones Help 14 03 8 Transferred Text Msg Help 15 03 2 Resolved Other Feature Help 16 03 2 Resolved Ringtones Help 17 04 3 Resolved Text Msg Help 18 04 1 Transferred Other Feature Help 19 04 5 Transferred Other Feature Help 20 04 3 Transferred Text Msg Help 21 04 3 Resolved Text Msg Help 22 04 10 Resolved Text Msg Help 23 04 1 Resolved Other Feature Help 24 04 4 Resolved Voice Mail Help 25 04 5 Resolved Voice Mail Help . . . |
This example demonstrates how to efficiently create a subset of a data set by using the WHERE statement. This technique is efficient because a WHERE statement evaluates observations before the SET statement executes. Only those observations that satisfy the conditions in the WHERE statement are moved into the Program Data Vector (PDV).
Data set CELLPHONES contains 10,000 cell phone support records. The goal is to select all observations where the reason entered for the technical support call contains the text "Help".
Note that variations on the specification of the text, such as all lowercase, are not checked by this example's WHERE statement. It is common to uppercase a character variable when specifying a condition so that all possible spellings can be found. However, if you add functions to your WHERE statement, you will slow down the processing of the DATA step and reduce the advantage of using WHERE processing.
Create data set FEATUREQUESTIONS. Read observations from CELLPHONES. Specify that the SET statement read only those observations that satisfy the condition in the WHERE statement.
data featurequestions; set cellphones; where reason contains "Help"; run;
The PROC SQL step in this related technique creates a table equivalent to the data set that was created by the DATA step in the main example. This example selects all columns from CELLPHONES, including the column REASON that is specified on the WHERE clause.
Although not demonstrated in this example, you can specify columns on your WHERE clause that are not in the columns specified on the SELECT clause.
Create table FEATUREQUESTIONS. Select all columns from table CELLPHONES. Read only those rows that satisfy the condition in the WHERE statement.
proc sql; create table featurequestions as select * from cellphones where reason contains "Help"; quit;
You can create subsets of your data sets in the DATA step with either the WHERE statement or the subsetting IF statement. While they both test a condition to determine whether SAS should process an observation, only the WHERE statement can prevent observations from being read into the Program Data Vector (PDV), thereby potentially yielding significant savings in processing time.
WHERE Statement in the DATA Step: A WHERE statement tests the condition before an observation is read into the PDV. If the condition is true, SAS reads the observation into the PDV and processes it. If the condition is false, SAS does not read the observation into the PDV and moves on to evaluating the next observation. Preventing unneeded observations from being read into the PDV and then processed in the DATA step can save processing time. Other advantages of the WHERE statement over the subsetting IF statement include that it can be optimized with indexes on the variables that define the subsets and that more operators, such as LIKE and CONTAINS, can be used with the WHERE statement.
Subsetting IF Statement: A subsetting IF statement tests the condition after an observation is read into the PDV. If the condition is true, SAS continues processing this observation. If the condition is false, SAS discards the observation, ends the current iteration of the DATA step, and continues processing with the next observation. The subsetting IF statement does not use indexes even if they exist on the subsetting variables. A reason to use the subsetting IF statement instead of the WHERE statement is if you have complex conditions to test or complex calculations to perform that would make it difficult to write the WHERE statement.
Generally it is more efficient to use the WHERE statement, but if your data set is small (variables, or observations, or both), or if the subset is a large proportion of your data set, or both, it might not make much difference in processing time to use the subsetting IF statement. Also, if you need to use functions in your WHERE expression, the advantage of using WHERE might also diminish.
18.218.239.182