Example 3.9 Generating Every Combination of Rows (Cartesian Product) between Tables

Goal

Combine two tables that have no common columns in order to produce every possible combination of rows.

Example Features

Featured StepPROC SQL
Featured Step Options and StatementsCartesian product
Related TechniqueDATA step, multiple SET statements, NOBS= and POINT= options

Input Tables

Table TRIAL_A has the identifiers for a group of patients who are participating in a clinical trial. Table TRIAL_TESTS contains the lab tests for this trial. There are no columns in common between the two tables.

          TRIAL_A

Obs  patientid  patientinits
 1     KPGY         AHB
 2     MWGM         DOH
 3     PQZU         LRH
 4     FQ82         HCI
 5     EYPS         MEF

       TRIAL_TESTS

Obs  testcode  testtype
 1     L001    Cholesterol
 2     L002    Glucose
 3     L003    HDL
 4     L004    LDL
 5     L005    Triglycerides

Resulting Table

Output 3.9 ALLTESTS Table

       Example 3.9 ALLTESTS Table Created with PROC SQL

 Obs patientid patientinits testcode testtype         result
   1   KPGY        AHB        L001   Cholesterol   Not done yet
   2   KPGY        AHB        L002   Glucose       Not done yet
   3   KPGY        AHB        L003   HDL           Not done yet
   4   KPGY        AHB        L004   LDL           Not done yet
   5   KPGY        AHB        L005   Triglycerides Not done yet
   6   MWGM        DOH        L001   Cholesterol   Not done yet
   7   MWGM        DOH        L002   Glucose       Not done yet
   8   MWGM        DOH        L003   HDL           Not done yet
   9   MWGM        DOH        L004   LDL           Not done yet
  10   MWGM        DOH        L005   Triglycerides Not done yet
  11   PQZU        LRH        L001   Cholesterol   Not done yet
  12   PQZU        LRH        L002   Glucose       Not done yet
  13   PQZU        LRH        L003   HDL           Not done yet
  14   PQZU        LRH        L004   LDL           Not done yet
  15   PQZU        LRH        L005   Triglycerides Not done yet
  16   FQ82        HCI        L001   Cholesterol   Not done yet
  17   FQ82        HCI        L002   Glucose       Not done yet
  18   FQ82        HCI        L003   HDL           Not done yet
  19   FQ82        HCI        L004   LDL           Not done yet
  20   FQ82        HCI        L005   Triglycerides Not done yet
  21   EYPS        MEF        L001   Cholesterol   Not done yet
  22   EYPS        MEF        L002   Glucose       Not done yet
  23   EYPS        MEF        L003   HDL           Not done yet
  24   EYPS        MEF        L004   LDL           Not done yet
  25   EYPS        MEF        L005   Triglycerides Not done yet


Example Overview

This example shows you how to produce a table that has every possible combination of the rows in two tables that are being joined. This result is called a Cartesian product.

Note that a Cartesian product of large tables can produce huge tables, and usually you would add code to produce a subset of the Cartesian product. This example does not restrict the combinations so that you can see how a Cartesian product is formed.

Table TRIAL_A has the identifiers for a group of patients who are participating in a clinical trial. Table TRIAL_TESTS contains the lab tests for this trial. The two tables do not have any columns in common.

The goal of the following PROC SQL step is to combine each row in TRIAL_A with each row in TRIAL_TESTS so that there is one row for each combination of patient and lab test. The Cartesian product of the five rows in each table produces 25 rows in ALLTESTS.

The intention is to edit ALLTESTS later as each patient's test results are obtained. The SELECT statement adds one new column, RESULT, which PROC SQL initializes to the text "Not done yet."

While the Related Technique in a later section creates a data set equivalent to table ALLTESTS that was produced by PROC SQL, it is easier to code a Cartesian product with PROC SQL when there is minimal additional processing necessary.

Program

Create table ALLTESTS. Select all columns from the two input tables. Add the column, RESULT, to ALLTESTS and initialize its value. Specify the two input tables.

proc sql;
  create table alltests as
    select *,

           'Not done yet' length=12 as result

           from trial_a, trial_tests;
quit;

Related Technique

The following DATA step creates a data set equivalent to the table that was created by PROC SQL earlier. If you need the programming capabilities of the DATA step that are not easily coded in PROC SQL, you can adapt this code to form a Cartesian product with the DATA step.

You cannot combine the two data sets with a MERGE statement to produce a Cartesian product because a MERGE statement does a one-to-one match of the observations in the two data sets.

The DATA step uses both sequential and direct processing. It starts by sequentially reading each observation in TRIAL_A with the first SET statement. For each observation read from TRIAL_A, the DO loop iterates the number of times equal to the number of observations in TRIAL_TESTS. The SET statement in the DO loop reads each observation in TRIAL_TESTS by directly using the POINT= option. The OUTPUT statement in the DO loop outputs an observation for each observation in TRIAL_TESTS.

The upper bound of the DO loop is set to NTESTS, which equals the number of observations in TRIAL_TESTS. The second SET statement assigns a value to NTESTS using the NOBS= option. SAS assigns a value to a NOBS= variable during compilation of the DATA step before the DATA step executes.

The POINT= option is set to TEST, which is the index variable of the DO loop. The SET statement in the iterative DO loop directly accesses each observation in TRIAL_TESTS in observation order.

This technique of forming a Cartesian product requires the POINT= option in the second SET statement because this option prevents the end-of-file condition from being set on data set TRIAL_TESTS. By default, a DATA step stops when it detects an end-of-file condition. Without the POINT= option, the DATA step would stop when it reads the second observation in TRIAL_A because it detects that data set TRIAL_TESTS has already been completely read. The DATA step would combine the first observation from TRIAL_A with every observation in TRIAL_TESTS and would produce a data set that has the number of observations found in TRIAL_TESTS.

Create data set ALLTESTS. On each iteration of the DATA step, read one observation from TRIAL_A. Execute an iterative DO loop for each observation in TRIAL_A. Name the index variable the same as the POINT= variable in the following SET statement. Set the upper bound of the DO loop as the number of observations in TRIAL_TESTS, which is saved in NTESTS, the NOBS= variable that the following SET statement defines. Specify the data set to read directly.

Specify the index variable of the iterative DO loop as the POINT= variable. For each observation in TRIAL_A, output every observation in TRIAL_TESTS.

data alltests;
  set trial_a;

  do test=1 to ntests;



    set trial_tests nobs=ntests

        point=test;

    output;

  end;
run;

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.248.119