Combine rows from two tables based on specific criteria, even when there is no column common to the two tables.
Featured Step | PROC SQL |
Featured Step Options and Statements | Full join, WHERE clause with BETWEEN operator |
Related Technique | DATA step, multiple SET statements, POINT= and NOBS= options |
Tables BUILD_PROJECTS and BUILD_BILLS have no common column. The dates associated with each project in BUILD_PROJECTS do not overlap with any other project.
BUILD_PROJECTS Obs start_date end_date project 1 01/08/2010 01/27/2010 Basement 2 02/01/2010 02/12/2010 Frame 3 02/15/2010 02/20/2010 Roofing 4 02/22/2010 02/27/2010 Plumb 5 03/02/2010 03/05/2010 Wire 6 03/08/2010 03/29/2010 Brick BUILD_BILLS completion_ Obs workid date charge 1 1234 01/18/2010 $944.80 2 2225 02/18/2010 $1280.94 3 3879 03/04/2010 $888.90 4 8888 03/19/2010 $2280.87
Output 3.6 BUILD_COMPLETE TableExample 3.6 BUILD_COMPLETE Table Created with PROC SQL completion_ Obs start_date end_date project workid date charge 1 01/08/2010 01/27/2010 Basement 1234 01/18/2010 $944.80 2 02/15/2010 02/20/2010 Roofing 2225 02/18/2010 $1280.94 3 03/02/2010 03/05/2010 Wire 3879 03/04/2010 $888.90 4 03/08/2010 03/29/2010 Brick 8888 03/19/2010 $2280.87 |
The PROC SQL step in this example shows how you can combine two tables that do not have a variable in common.
Table BUILD_PROJECTS contains a list of start and end dates for six home repair projects. Table BUILD_BILLS contains a list of four charges with a completion date. The two data sets do not have a variable in common. None of the start and end dates of one project overlap with another.
The objective of this program is to bill charges to the correct phase of a construction project. A row in BUILD_BILLS is matched to a row in BUILD_PROJECTS when the completion date falls within the start and end dates of a project.
Conceptually, the PROC SQL join results in an internal table that matches every row in BUILD_PROJECTS with every row in BUILD_BILLS. From that internal table, the WHERE clause outputs the rows with a value of COMPLETION_DATE that is between START_DATE and END_DATE.
Create table BUILD_COMPLETE. Select all columns from both tables. Combine every row from BUILD_PROJECTS with every row from BUILD_BILLS. Select from the joined rows those where the completion date is between the start and end date of a project.
proc sql; create table build_complete as select * from build_projects, build_bills where completion_date between start_date and end_date; quit;
The DATA step in this related technique creates a data set equivalent to the table that was created by the PROC SQL step in the main example. It programmatically combines the two data sets by using an iterative DO-UNTIL loop and SET statements that control how the two input data sets are read. It does not use MERGE and BY statements to do this.
For each completion date in BUILD_BILLS, the DATA step looks for a range of dates in BUILD_PROJECTS in which the value of COMPLETION_DATE fits. All observations from BUILD_BILLS are output to BUILD_COMPLETE. Not necessarily all observations from BUILD_PROJECTS are output to BUILD_COMPLETE.
Each iteration of the DATA step reads one observation from the first data set, BUILD_BILLS, and each iteration also potentially reads all observations from the second, BUILD_COMPLETE. It starts with a simple SET statement that identifies the first data set. The iterative DO-UNTIL loop that follows processes the second data set.
The POINT= and OBS= data set options in the SET statement for the second data set access observations in BUILD_BILLS sequentially by observation number until a match is found or until all observations in BUILD_BILLS have been read. The POINT= option references the index variable I whose value directly accesses each observation in BUILD_BILLS by observation number. The NOBS= option assigns the number of observations in BUILD_BILLS to the variable N. This assignment occurs before the DATA step executes.
The iterative DO-UNTIL loop iterates once for each observation in BUILD_BILLS until a match is found. If a match is not found, the loop stops when it has iterated the number of times equal to the value saved in N.
When a match is found, the DATA step writes the matching observation to BUILD_COMPLETE, and it sets variable FOUND to 1. The condition of FOUND=1 stops the iterative DO-UNTIL loop so that no more observations are read from BUILD_BILLS. Variable FOUND is initialized to zero prior to the DO-UNTIL loop.
This program might be inefficient if the second data set is large because potentially this large data set could be read over and over again.
The program could be improved to include programming statements that write messages to the SAS log when an observation does not have a match in BUILD_BILLS or when multiple matches are found. Neither of these conditions exists in this example.
Create BUILD_COMPLETE. Read an observation from BUILD_PROJECTS.
Initialize FOUND whose value is used to control execution of the iterative DO-UNTIL loop.
Read observations from BUILD_BILLS until a match is found or until all observations have been read. Directly access an observation in BUILD_BILLS by pointing to the Ith observation. Save the total number of observations in BUILD_BILLS in variable N. Set the upper index value of the iterative DO-UNTIL loop to N. When the condition is met, set FOUND to 1 so that the iterative DO-UNTIL loop stops, and then write the matching observation to BUILD_COMPLETE.
data build_complete; set build_projects; drop found; found=0; do i=1 to n until (found); set build_bills point=i nobs=n; if start_date le completion_date le end_date then do; found=1; output; end; end; run;
18.222.168.152