Using an Index to Combine Data

Overview

Suppose you want to combine data from two data sets, and one of the data sets is much larger than the other. Also, suppose you want to select only those observations from the larger data set that match an observation from the smaller data set according to the value of one or more common variables.
You should already know how to create an index on a SAS data set. You have learned that PROC SQL can take advantage of an index to improve performance on a join. You can also take advantage of an index in a DATA step to combine data from matching observations in multiple data sets if the index is built on variables that are common to all input data sets.
For example, suppose you want to combine data from the matching observations in Sasuser.Dnunder and Sasuser.Sale2000. Only a portion of the flights that are in Sasuser.Sale2000 (which has 156 observations) are also in Sasuser.Dnunder (which has only 57 observations). Suppose you want to select only the matching observations.
Assume that Sasuser.Sale2000 has a composite index named Flightdate associated with it. The values for Flightdate are unique and are based on the values of the variables FlightID and Date. You can use the FLIGHTDATE index to efficiently select only the matching observations via direct access.
Using an Index to Combine Data
The next few sections show how to use the Flightdate index to combine matching observations from the Sasuser.Sale2000 data set and the Sasuser.Dnunder data set.

The KEY= Option

You have seen how to use multiple SET statements in a DATA step in order to combine summary data and detail data in a new data set. You can also use multiple SET statements to read only the matching observations.
You specify the KEY= option in the SET statement to use an index to retrieve matching observations from the lookup data set.
General form, SET statement with KEY= option:
SET SAS-data-set-name KEY= index-name;
Here is an explanation of the syntax:
index-name
is the name of an index that is associated with the SAS-data-set-name data set.
To use the SET statement with the KEY= option to perform a lookup operation, your lookup values must be stored in a SAS data set that has an index. This technique is appropriate only when you are working with one-to-one matches, with a lookup table of any size. It is possible to return multiple values with this technique and use other DATA step syntax as well.
When SAS encounters a SET statement with the KEY= option, SAS uses the index to retrieve an observation with a key value that matches the key value from the PDV.
For example, if the Sasuser.Sale2000 data set has an index named Flightdate associated with it, the following SET statement uses the Flightdate index to locate observations in Sale2000 that have specific values for FlightID and Date:
set sasuser.sale2000 key=flightdate;
When the SET statement in the example above begins to execute, there must already be a value for FlightID and a value for Date in the PDV. SAS then uses the Flightdate index to retrieve an observation from Sasuser.Sale2000. This observation must have values for FlightID and Date that match the values for FlightID and Date that are already in the PDV.
In order to assign a key value in the PDV before the SET statement with the KEY= option executes, you precede that SET statement with another SET statement in the DATA step. Consider this example in context.

Example

Remember that you want to combine Sasuser.Sale2000 and Sasuser.Dnunder, and that Sasuser.Sale2000 has an index named Flightdate that is based on the values of the FlightID and Date variables. You can use two SET statements to combine these two data sets, and use the KEY= option in the second SET statement to take advantage of the index.
In the following example, these results occur:
  • the first SET statement reads an observation sequentially from the Sasuser.Dnunder data set. SAS writes the values from this observation to the PDV, and then moves to the second SET statement.
  • the second SET statement uses the Flightdate index on Sasuser.Sale2000 to find an observation in Sasuser.Sale2000 that has values for FlightID and Date that match the values of FlightID and Date that were populated by the first SET statement.
  • Work.Profit is the output data set.
CAUTION:
If you use the KEY= option to read a SAS data set, you cannot use WHERE processing on that data set in the same DATA step.
  1. This example shows the execution of a DATA step that uses two SET statements to combine data from two input data sets (Sasuser.Sale2000 and Sasuser.Dnunder) into one output data set (Work.Profit). The DATA step uses an index on the larger of the two input data sets, Sasuser.Sale2000, to find matching observations.
    data work.profit;
       set sasuser.dnunder;
       set sasuser.sale2000(keep=routeid flightid date rev1st
                            revbusiness revecon revcargo)
                            key=flightdate;
       Profit=sum(rev1st, revbusiness, revecon, revcargo,
                 -expenses);
    run;
  2. SAS reads the descriptor portions of the input data sets and creates the PDV.
  3. The first SET statement executes and creates the PDV. SAS reads the first observation in Sasuser.Dnunder into the PDV.
  4. When the second SET statement executes, the KEY= option uses the Flightdate index to directly access the observation in Sasuser.Sale2000 that has values for FlightID and Date that match the values already in the PDV. The matching observation is then read into the PDV.
  5. SAS calculates the value for Profit and records it in the PDV. Then, SAS writes the current observation from the PDV to the output data set.
  6. The DATA step continues to iterate. Only the variable Profit is reinitialized to missing. The first SET statement reads the second observation in Sasuser.Dnunder into the PDV, overwriting the previous values.
  7. The second SET statement uses the Flightdate index to find a matching observation in Sasuser.Sale2000. The matching observation is read into the PDV, overwriting the previous values. A new value for Profit is calculated and recorded. The current observation is written to the output data set.
  8. The DATA step continues to iterate until end of file on Sasuser.Dnunder.
Remember that when SAS encounters a SET statement with the KEY= option, the value of the key variable on which the KEY= index is built must already exist in the PDV. Therefore, it is very important for the two SET statements to be in the exact order shown.

Example

If you examine the Work.Profit output data set closely, you will notice that the final observation in the output data set contains values for several variables that are identical to values in the previous observation. This action happened when the second SET statement failed to find a matching observation in sasuser.sale2000.
The observation that contains unmatched data is printed to the log. As you can see in the log sample below, the unmatched observation includes an _Error_ variable whose value is 1, which indicates unmatched data. The _N_ variable indicates the iteration of the DATA step in which the error occurred.
Table 15.8 SAS Log
FlightID=IA11802 RouteID=0000108 Date=30DEC2000 Expenses=3720
Rev1st=1270 RevBusiness=. RevEcon=5292 RevCargo=1940 Profit=4782

_ERROR_=1 _IORC_=1230015 _N_=57
NOTE: There were 57 observations read from the data set
      SASUSER.DNUNDER.
NOTE: The data set WORK.PROFIT has 57 observations and 9 variables.
NOTE: DATA statement used (Total process time):
      real time           0.38 seconds
      cpu time            0.04 seconds
Notice that the observation that is printed in the SAS log above also contains a variable named _IORC_.

The _IORC_ Variable

When you use the KEY= option, SAS creates an automatic variable named _IORC_, which stands for INPUT/OUTPUT Return Code. You can use _IORC_ to determine whether the index search was successful. If the value of _IORC_ is zero, SAS found a matching observation. If the value of _IORC_ is not zero, SAS did not find a matching observation.
To prevent writing the data error to the log (and to your output data set), do the following:
  • check the value of _IORC_ to determine whether a match has been found
  • set _ERROR_ to 0 if there is no match
  • delete the nonmatching data or write the nonmatching data to an errors data set

Example

The following example uses the Flightdate index to combine data from Sasuser.Sale2000 with data from Sasuser.Dnunder, and writes the combined data to a new data set named Work.Profit3. Unmatched observations are written to Work.Errors. No observations should be written to the SAS log.
data work.profit3 work.errors;
   set sasuser.dnunder;
   set sasuser.sale2000(keep=routeid flightid date rev1st
       revbusiness revecon revcargo)key=flightdate;
   if _iorc_=0 then do;
      Profit=sum(rev1st, revbusiness, revecon, revcargo,
            -expenses);
      output work.profit3;
   end;
   else do;
      _error_=0;
      output work.errors;
   end;
run;
If you examine the results from the program above, you will notice that there is one fewer observation in the Work.Profit3 output data set than there was in the Work.Profit output data set. The unmatched observation is written to the Work.Errors data set.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.151.164