Example 3.11 Generating Every Combination of Observations between Data Sets Based on a Common Variable When an Index Is Available

Goal

Combine two data sets that have a common variable where the common variable has duplicate values in both data sets and at least one of the data sets is indexed by the common variable. Produce a data set that contains all possible combinations of the observations where the values from the common variable match. Use indexing features to match the two data sets.

Example Features

Featured StepDATA step
Featured Step Options and StatementsKEY= option in the SET statement Automatic variable _IORC_ and %SYRC autocall macro program
Related TechniquePROC SQL, WHERE clause and subqueries
A Closer LookChecking for Errors When Using the MODIFY Statement or the SET Statement with the KEY= Option Finding a Match for Consecutive Duplicate Values When the Data Set Is Indexed

Input Data Sets

Data set FURNITURE_SALES contains six sales records. Data set FURNITURE_STOCK contains information about four products. Each product has more than one piece. Variable PRODUCT is common to both data sets. Multiple occurrences of the values of PRODUCT occur in both data sets and some occurences of the values of PRODUCT are found in only one data set.

             FURNITURE_SALES

Obs    product     salesrep       orderno
 1       309      J.Corrigan     09173JC018
 2       310      K.Roland       09173KR001
 3       310      Y.Alvarez      09173YA015
 4       312      J.Corrigan     09173JC021
 5       313      J.Corrigan     09173JC031
 6       313      K.Roland       09173KR008

FURNITURE_SALES is sorted by PRODUCT. Data set FURNITURE_STOCK is indexed on PRODUCT.

                        FURNITURE_STOCK

Obs   product      product_desc       pieceid     piece_desc
  1     310     oak pedestal table    310.0103    tabletop
  2     310     oak pedestal table    310.0203    pedestal
  3     310     oak pedestal table    310.0303    two leaves
  4     311     upholstered chair     311.0103    chair base
  5     311     upholstered chair     311.0203    one cushion
  6     311     upholstered chair     311.0303    two arm covers
  7     312     brass floor lamp      312.0102    lamp base
  8     312     brass floor lamp      312.0202    lamp shade
  9     313     oak bookcase, short   313.0102    bookcase
 10     313     oak bookcase, short   313.0202    two shelves

Resulting Data Sets

Output 3.11a SHIPLIST Data Set

           Example 3.11 SHIPLIST Data Set Created with DATA Step

Obs product  salesrep   orderno      product_desc     pieceid  piece_desc

  1   310   K.Roland   09173KR001 oak pedestal table  310.0103 tabletop
  2   310   K.Roland   09173KR001 oak pedestal table  310.0203 pedestal
  3   310   K.Roland   09173KR001 oak pedestal table  310.0303 two leaves
  4   310   Y.Alvarez  09173YA015 oak pedestal table  310.0103 tabletop
  5   310   Y.Alvarez  09173YA015 oak pedestal table  310.0203 pedestal
  6   310   Y.Alvarez  09173YA015 oak pedestal table  310.0303 two leaves
  7   312   J.Corrigan 09173JC021 brass floor lamp    312.0102 lamp base
  8   312   J.Corrigan 09173JC021 brass floor lamp    312.0202 lamp shade
  9   313   J.Corrigan 09173JC031 oak bookcase, short 313.0102 bookcase
 10   313   J.Corrigan 09173JC031 oak bookcase, short 313.0202 two shelves
 11   313   K.Roland   09173KR008 oak bookcase, short 313.0102 bookcase
 12   313   K.Roland   09173KR008 oak bookcase, short 313.0202 two shelves


Output 3.11b NOSTOCKINFO Data Set

Example 3.11 NOSTOCKINFO Data Set Created with DATA Step

       Obs    product     salesrep      orderno

        1       309      J.Corrigan    09173JC018


Example Overview

As in Example 3.10, this example illustrates how to make all possible combinations of the observations from two data sets that match by a common variable. Here, one of the data sets is indexed, and the DATA step code uses indexing features. Additionally, both data sets can have duplicates of the values of the common variable.

Data set FURNITURE_SALES contains information about products sold. Data set FURNITURE_STOCK contains information about the products in stock. Both data sets identify the products by the values of variable PRODUCT. Each product has multiple pieces. Variable PIECEID in FURNITURE_STOCK contains both the product number and the piece number.

Product 309 in FURNITURE_SALES does not have a match in FURNITURE_STOCK. Product 311 in FURNITURE_STOCK does not have any sales data in FURNITURE_SALES.

The following DATA step has two objectives in combining FURNITURE_SALES and FURNITURE_STOCK by common variable PRODUCT:

  • generate a shipping list of items sold, which includes all the pieces for a product

  • generate a list of ordered items not found in FURNITURE_STOCK

The output data set, SHIPLIST, contains data for each sold product as identified in FURNITURE_SALES and this includes a list of all of the pieces that make up a product as stored in FURNITURE_STOCK. For example, the second observation in FURNITURE_SALES shows that product 310, an oak pedestal table, was sold. Data set FURNITURE_STOCK shows that product 310 consists of three pieces: a top, a base, and two leaves. Therefore, the resulting data set SHIPLIST contains three observations for the second sold item recorded in FURNITUTE_SALES.

The program requires that data set FURNITURE_SALES be sorted by or indexed on PRODUCT and that FURNITURE_STOCK be indexed on PRODUCT. The DATA step reads each observation sequentially from FURNITURE_SALES. The BY statement on PRODUCT enables BY-group processing of the observations in FURNITURE_SALES.

Because there can be multiple observations in FURNITURE_STOCK for one value of PRODUCT, the index on variable PRODUCT does not have unique keys.

A DO UNTIL loop executes for each observation in FURNITURE_SALES. This loop controls the lookup process for a match by PRODUCT in FURNITURE_STOCK. With the KEY=PRODUCT option in the SET statement, the DATA step reads matching observations directly from FURNITURE_STOCK and outputs all matches to SHIPLIST. A SELECT block within the loop evaluates the return codes from the indexed searches by using the %SYSRC autocall macro program. When no match occurs, the loop takes one of two actions depending on whether it has finished processing the current BY group in FURNITURE_SALES:

  • If the current observation is the last observation in FURNITURE_SALES for the current BY group, the DO UNTIL loop condition is met causing the loop to end and to return processing to the top of the DATA step to read the first observation from the next BY group in FURNITURE_SALES.

  • If the current observation is not the last observation in FURNITURE_SALES for the current BY group, the code resets the automatic variable _IORC_ to 0 and the flag variable MULTSAMEPROD to 1. Resetting_IORC_ to 0 keeps the loop executing and finds the multiple pieces per product from FURNITURE_STOCK. Assigning a value of 1 to MULTSAMEPROD causes the IF statement at the top of the DO UNTIL loop to assign a nonexistent PRODUCT value to the KEY=variable. Changing the value of the KEY= variable to this "dummy" number forces the index pointer to return to the beginning of the index so that multiple matches for one value of PRODUCT are found in FURNITURE_STOCK.

The variable INSTOCK that is defined in the DATA step tracks when a match for the current value of PRODUCT in FURNITURE_SALES is found in FURNITURE_STOCK. The value of INSTOCK is initialized to 0 prior to the DO UNTIL loop. If the value of INSTOCK is still 0 after the loop ends, the program concludes that the current value of PRODUCT from FURNITURE_SALES does not have a match in FURNITURE_STOCK. These nonmatched sales are written to data set NOSTOCKINFO and not to SHIPLIST.

Program

Define an index on PRODUCT for the data set that will be directly accessed. Allow nonunique keys in this simple index by omitting the UNIQUE option in the INDEX statement. Create data set SHIPLIST. Create data set NOSTOCKINFO. Keep in NOSTOCKINFO the variables found only in FURNITURE_SALES because this data set will contain observations for orders without a match in FURNITURE_STOCK. Read FURNITURE_SALES sequentially. Specify BY-group processing for data set FURNITURE_SALES. Initialize a flag variable that tracks whether there are multiple occurrences of the same PRODUCT value in FURNITURE_SALES. Initialize a flag variable that tracks whether a match is found in FURNITURE_STOCK for the current value of PRODUCT from FURNITURE_SALES. Execute a DO UNTIL loop for each observation in FURNITURE_SALES. Specify that it test whether the return code from the indexed search saved in automatic variable _IORC_ equates to the not found condition. Assign a value to PRODUCT that is nonexistent in FURNITURE_STOCK when a duplicate occurrence of a PRODUCT value has been found in FURNITURE_SALES. This action forces the pointer to return to the beginning of the index so that later observations in FURNITURE_SALES in the same BY group can find matches in FURNITURE_STOCK. Search from the current pointer position in FURNITURE_STOCK for a match by PRODUCT in FURNITURE_SALES. Direct processing of the results of the indexed search based on the value of the automatic variable _IORC_. Execute this DO group when the current observation from FURNITURE_SALES has a match in FURNITURE_STOCK. Set flag variable INSTOCK to 1. Output an observation to SHIPLIST. Execute this DO group when the current observation from FURNITURE_SALES does not have a match in FURNITURE_STOCK. Reset automatic variable _ERROR_ to 0 to prevent writing error messages to the SAS log when no match is found. When the search does not find a match, SAS sets _ERROR_ to 1. Execute this DO group to set flag variable MULTSAMEPROD and reset _IORC_ so that processing of multiple observations of a PRODUCT value from FURNITURE_SALES can occur. Set flag variable MULTSAMEPROD to 1 so that the IF statement at the top of the loop executes to assign a nonexistent key value to PRODUCT. Reset automatic variable _IORC_ to 0 so that the DO UNTIL loop continues to execute and repeat matching the multiple occurrences of the current value of PRODUCT in FURNITURE_SALES.

For an unexpected _IORC_ condition, write a message to the SAS log and stop the DATA step.

Output observations from FURNITURE_SALES that have no match in FURNITURE_STOCK.

proc datasets library=work;
  modify furniture_stock;
  index create product;
run;
quit;

data shiplist(drop=multsameprod instock)
     nostockinfo(keep=product salesrep orderno);



  set furniture_sales;

  by product;

  multsameprod=0;


  instock=0;



  do until(_iorc_=%sysrc(_dsenom));

    if multsameprod then product=0;



    set furniture_stock key=product;


    select (_iorc_);

      when (%sysrc(_sok)) do;
        instock=1;
        output shiplist;
      end;


      when (%sysrc(_dsenom)) do;


             _error_=0;



         if not last.product and multsameprod=0 then do;



          multsameprod=1;


          _iorc_=0;




        end;
      end;
      otherwise do;
        putlog 'ERROR: Unexpected ERROR: _IORC_= '
                              _iorc_;
        stop;
      end;
    end;
   end;
  if not instock then output nostockinfo;

run;

Related Technique

The following PROC SQL step creates two tables equivalent to the data sets that were created in the main example. A separate CREATE TABLE statement is needed for each table.

The SELECT statement that creates table SHIPLIST performs an inner join of the two tables on the values of the PRODUCT column.

The second SELECT statement specifies a subquery that selects the unique values of PRODUCT from FURNITURE_STOCK. The main query then uses the NOT IN condition test to select rows from FURNITURE_SALES that have values for PRODUCT not in the group of values that were returned by the subquery.

The PROC SQL step does not require that the tables be sorted or indexed. However, your PROC SQL programs might run more efficiently if the tables are sorted or indexed by the matching columns.

While the PROC SQL code is much simpler to follow than the preceding DATA step, it does not provide you with return codes from indexed searches that you can evaluate for error-checking purposes.

Create table SHIPLIST. Select all columns from both tables. Assign an alias to each table. Select rows that match by the values of PRODUCT. Create table NOSTOCKINFO. Select all columns from FURNITURE_SALES. Specify a subquery. Select rows from FURNITURE_SALES that have values for PRODUCT not found in FURNITURE_STOCK. Include the DISTINCT keyword so that the subquery returns the unique values of PRODUCT in FURNITURE_STOCK.

proc sql;
  create table shiplist as
    select * from furniture_sales as s,
                  furniture_stock as k
      where s.product=k.product;

  create table nostockinfo as
    select * from furniture_sales

     where product not in
         (select distinct product from furniture_stock);

quit;

A Closer Look

Checking for Errors When Using the MODIFY Statement or the SET Statement with the KEY= Option

The DATA step in the main example reads observations from the indexed data set, FURNITURE_STOCK, by using the SET statement and the KEY= option. The DO UNTIL loop that follows the SET statement examines the return codes that result from the keyed search. The return codes determine the next step the program takes.

When reading observations with the KEY= option in the SET statement or the MODIFY statement, it is important to check the results of the keyed search. Because these tools use nonsequential access methods, there is no guarantee that the keyed search will locate an observation that satisfies the request.

Error checking enables you to direct execution to specific paths depending on the outcome of the keyed search. When you include error checking, you can prevent your programs from ending abnormally. Your program will continue executing for expected conditions and terminate execution when unexpected results occur.

Two SAS tools make error checking easier when using the MODIFY statement or the SET statement with the KEY= option:

  • _IORC_ automatic variable

  • SYSRC autocall macro program

SAS automatically creates variable _IORC_ when you use the MODIFY statement or the SET statement with the KEY= option. The value assigned to _IORC_ is a numeric return code that indicates the status of the I/O operation from the most recently executed MODIFY statement or the SET statement with the KEY= option.

Checking the value of _IORC_ for abnormal I/O conditions enables you to detect them and direct execution down specific code paths instead of having the application terminate abnormally. For example, if the KEY= value does find a match, you might want to process the data that were obtained from the matched observation. If the value does not find a match in the data set being searched, you might want to only write a note to the SAS log.

Because the values of the _IORC_ automatic variable are internal and subject to change, SAS created the %SYSRC macro program to enable you to test for specific I/O conditions while protecting your code from future changes that SAS might make in the values it assigns to _IORC_. Using %SYSRC, you can check the value of _IORC_ by specifying a mnemonic, the most common of which are listed in Table 3.1.

For more information about the values of _IORC_ and mnemonics you can pass to %SYSRC, see SAS documentation.

Table 3.1. List of Most Common Mnemonic Values of _IORC_
Mnemonic ValueMeaning of Return Code
_DSENMRThe transaction data set observation does not exist in the master data set.
_DSEMTRMultiple transaction data set observations with the same BY variable value do not exist in the master data set. This return code occurs when consecutive observations with the same BY values do not find a match in the first data set. In this situation, the first observation to fail to find a match returns _DSENMR. Following observations return _DSEMTR.
_DSENOMNo matching observation was found in the master data set.
_SOKThe I/O operation was successful. When using the MODIFY statement or the SET statement with the KEY= option, it means a match was found.

Finding a Match for Consecutive Duplicate Values When the Data Set Is Indexed

Much of the logic of the DATA step in the main example focuses on the need to successfully match observations that contain consecutive duplicate values of the BY variable PRODUCT in FURNITURE_SALES with observations in FURNITURE_STOCK that contain the same value for PRODUCT. Unless you reposition the pointer at the beginning of the PRODUCT index for FURNITURE_STOCK, consecutive duplicate values of PRODUCT in FURNITURE_SALES will not be successfully matched.

The SELECT group in the DO UNTIL loop begins this process. When there are no more matches in the index on FURNITURE_STOCK for the current value of PRODUCT in FURNITURE_SALES, your code must determine whether there are more observations in the current BY group in FURNITURE_SALES. If there are more observations to process in the same BY group in FURNITURE_SALES and MULTSAMEPROD has not already been set to 1, assign values to automatic variable _IORC_ and the variable MULTSAMEPROD:

  when (%sysrc(_dsenom)) do;
    _error_=0;
    if not last.product and multsameprod=0 then do;
      multsameprod=1;
      _iorc_=0;
    end;
  end;

By resetting the value of _IORC_ to 0, you cause the DO UNTIL loop to iterate again:

do until(_iorc_=%sysrc(_dsenom));
  if multsameprod then product=0;
  set furniture_stock key=product;

Because MULTSAMEPROD is true (equals 1), PRODUCT is set to 0, a nonexisting, "dummy" value. When the SET statement executes again, the pointer is forced to the beginning of the index on FURNITURE_STOCK because the value of PRODUCT, which is the KEY= variable, has changed. No match is found for PRODUCT=0, so the DO UNTIL loop ends and processing returns to the top of the DATA step. Then the DATA step reads the next observation from FURNITURE_SALES:

  data shiplist(drop=multsameprod instock)
       nostockinfo(keep=product salesrep orderno);
    set furniture_sales;
    by product;

    multsameprod=0;

Because the pointer is at the beginning of the index on FURNITURE_STOCK, the observation in FURNITURE_SALES with a consecutive duplicate value for PRODUCT in FURNITURE_STOCK finds the appropriate match in FURNITURE_STOCK. MULTSAMEPROD is reset to 0 at the top of the DATA step so that its value does not trigger a change in the value of PRODUCT when it is not needed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.114.244