Combine two tables that have no common columns in order to produce every possible combination of rows.
Featured Step | PROC SQL |
Featured Step Options and Statements | Cartesian product |
Related Technique | DATA step, multiple SET statements, NOBS= and POINT= options |
Table TRIAL_A has the identifiers for a group of patients who are participating in a clinical trial. Table TRIAL_TESTS contains the lab tests for this trial. There are no columns in common between the two tables.
TRIAL_A Obs patientid patientinits 1 KPGY AHB 2 MWGM DOH 3 PQZU LRH 4 FQ82 HCI 5 EYPS MEF TRIAL_TESTS Obs testcode testtype 1 L001 Cholesterol 2 L002 Glucose 3 L003 HDL 4 L004 LDL 5 L005 Triglycerides
Output 3.9 ALLTESTS TableExample 3.9 ALLTESTS Table Created with PROC SQL Obs patientid patientinits testcode testtype result 1 KPGY AHB L001 Cholesterol Not done yet 2 KPGY AHB L002 Glucose Not done yet 3 KPGY AHB L003 HDL Not done yet 4 KPGY AHB L004 LDL Not done yet 5 KPGY AHB L005 Triglycerides Not done yet 6 MWGM DOH L001 Cholesterol Not done yet 7 MWGM DOH L002 Glucose Not done yet 8 MWGM DOH L003 HDL Not done yet 9 MWGM DOH L004 LDL Not done yet 10 MWGM DOH L005 Triglycerides Not done yet 11 PQZU LRH L001 Cholesterol Not done yet 12 PQZU LRH L002 Glucose Not done yet 13 PQZU LRH L003 HDL Not done yet 14 PQZU LRH L004 LDL Not done yet 15 PQZU LRH L005 Triglycerides Not done yet 16 FQ82 HCI L001 Cholesterol Not done yet 17 FQ82 HCI L002 Glucose Not done yet 18 FQ82 HCI L003 HDL Not done yet 19 FQ82 HCI L004 LDL Not done yet 20 FQ82 HCI L005 Triglycerides Not done yet 21 EYPS MEF L001 Cholesterol Not done yet 22 EYPS MEF L002 Glucose Not done yet 23 EYPS MEF L003 HDL Not done yet 24 EYPS MEF L004 LDL Not done yet 25 EYPS MEF L005 Triglycerides Not done yet |
This example shows you how to produce a table that has every possible combination of the rows in two tables that are being joined. This result is called a Cartesian product.
Note that a Cartesian product of large tables can produce huge tables, and usually you would add code to produce a subset of the Cartesian product. This example does not restrict the combinations so that you can see how a Cartesian product is formed.
Table TRIAL_A has the identifiers for a group of patients who are participating in a clinical trial. Table TRIAL_TESTS contains the lab tests for this trial. The two tables do not have any columns in common.
The goal of the following PROC SQL step is to combine each row in TRIAL_A with each row in TRIAL_TESTS so that there is one row for each combination of patient and lab test. The Cartesian product of the five rows in each table produces 25 rows in ALLTESTS.
The intention is to edit ALLTESTS later as each patient's test results are obtained. The SELECT statement adds one new column, RESULT, which PROC SQL initializes to the text "Not done yet."
While the Related Technique in a later section creates a data set equivalent to table ALLTESTS that was produced by PROC SQL, it is easier to code a Cartesian product with PROC SQL when there is minimal additional processing necessary.
Create table ALLTESTS. Select all columns from the two input tables. Add the column, RESULT, to ALLTESTS and initialize its value. Specify the two input tables.
proc sql; create table alltests as select *, 'Not done yet' length=12 as result from trial_a, trial_tests; quit;
The following DATA step creates a data set equivalent to the table that was created by PROC SQL earlier. If you need the programming capabilities of the DATA step that are not easily coded in PROC SQL, you can adapt this code to form a Cartesian product with the DATA step.
You cannot combine the two data sets with a MERGE statement to produce a Cartesian product because a MERGE statement does a one-to-one match of the observations in the two data sets.
The DATA step uses both sequential and direct processing. It starts by sequentially reading each observation in TRIAL_A with the first SET statement. For each observation read from TRIAL_A, the DO loop iterates the number of times equal to the number of observations in TRIAL_TESTS. The SET statement in the DO loop reads each observation in TRIAL_TESTS by directly using the POINT= option. The OUTPUT statement in the DO loop outputs an observation for each observation in TRIAL_TESTS.
The upper bound of the DO loop is set to NTESTS, which equals the number of observations in TRIAL_TESTS. The second SET statement assigns a value to NTESTS using the NOBS= option. SAS assigns a value to a NOBS= variable during compilation of the DATA step before the DATA step executes.
The POINT= option is set to TEST, which is the index variable of the DO loop. The SET statement in the iterative DO loop directly accesses each observation in TRIAL_TESTS in observation order.
This technique of forming a Cartesian product requires the POINT= option in the second SET statement because this option prevents the end-of-file condition from being set on data set TRIAL_TESTS. By default, a DATA step stops when it detects an end-of-file condition. Without the POINT= option, the DATA step would stop when it reads the second observation in TRIAL_A because it detects that data set TRIAL_TESTS has already been completely read. The DATA step would combine the first observation from TRIAL_A with every observation in TRIAL_TESTS and would produce a data set that has the number of observations found in TRIAL_TESTS.
Create data set ALLTESTS. On each iteration of the DATA step, read one observation from TRIAL_A. Execute an iterative DO loop for each observation in TRIAL_A. Name the index variable the same as the POINT= variable in the following SET statement. Set the upper bound of the DO loop as the number of observations in TRIAL_TESTS, which is saved in NTESTS, the NOBS= variable that the following SET statement defines. Specify the data set to read directly.
Specify the index variable of the iterative DO loop as the POINT= variable. For each observation in TRIAL_A, output every observation in TRIAL_TESTS.
data alltests; set trial_a; do test=1 to ntests; set trial_tests nobs=ntests point=test; output; end; run;
3.21.248.119