Combine multiple tables where the matching column has different attributes such as length and can be stored as either numeric or character.
Featured Step | PROC SQL |
Featured Step Options and Statements | FULL OUTER JOIN COALESCE function Subquery |
Related Technique | PROC SORT and DATA step match-merge |
Table WEBCLASSES contains information about six classes. Table CLASS_UPDATES contains updates on some of the classes in WEBCLASSES, and it includes new classes. Table WEBCLASS_SCHEDULE contains scheduling information for classes and their section in fall 2009.
Column CLASSID is in common among the three tables. However, it is defined as character with different lengths in tables WEBCLASSES and CLASS_UPDATES, and it is defined as numeric in WEBCLASS_SCHEDULE.
WEBCLASSES Obs classid maxstudents class_title 1 101 25 Web Design 1 2 201 15 Web Design 2 3 301 15 Web Design 3 4 210 35 Increasing Website Traffic 5 203 20 Web Graphics 1 6 303 15 Web Graphics 2 CLASS_UPDATES Obs classid maxstudents class_title 1 101 30 Basic Web Design 2 201 20 Intermediate Web Design 3 301 15 Advanced Web Design 4 220 15 Internet Security 5 010 40 Keyboarding WEBCLASS_SCHEDULE Obs classid sessionid startdate 1 101 A 09/03/2009 2 101 B 10/05/2009 3 201 A 10/05/2009 4 210 A 09/15/2009 5 220 A 09/21/2009 6 10 A 09/02/2009 7 10 B 10/07/2009 8 10 C 10/27/2009
Output 3.5 ALLWEBCLASSES TableExample 3.5 ALLWEBCLASSES Table Created with PROC SQL Obs classid sessionid startdate class_title maxstudents 1 010 A 09/02/2009 Keyboarding 40 2 010 B 10/07/2009 Keyboarding 40 3 010 C 10/27/2009 Keyboarding 40 4 101 A 09/03/2009 Basic Web Design 30 5 101 B 10/05/2009 Basic Web Design 30 6 201 A 10/05/2009 Intermediate Web Design 20 7 203 none . Web Graphics 1 20 8 210 A 09/15/2009 Increasing Website Traffic 35 9 220 A 09/21/2009 Internet Security 15 10 301 none . Advanced Web Design 15 11 303 none . Web Graphics 2 15 |
This example shows you how to combine several tables where the matching column can have different attributes.
The goal in combining all three tables is to update information in WEBCLASSES with information from CLASS_UPDATES, and then join all the rows in these two tables with the rows in the third table, WEBCLASS_SCHEDULE. The resulting table should have a row for each class whether or not it's scheduled to be held. Classes with multiple sessions have a row for each session.
A PROC SQL step combines the three tables by matching column CLASSID, which has different attributes in each of the tables. CLASSID is defined as character in tables WEBCLASSES and CLASS_UPDATES and as numeric in table WEBCLASS_SCHEDULE. The length of the character version of CLASSID is 8 in WEBCLASSES and 3 in CLASS_UPDATES.
Because all matching and nonmatching rows from all three tables are required in the resulting table, a full outer join combines the three tables. And because a full outer join can combine only two tables at a time, two full outer joins are required to produce the final table. The result of a nested subquery that combines WEBCLASSES and CLASS_UPDATES with a full join is combined with WEBCLASS_SCHEDULE in the second full join.
The matching variable for both full joins is CLASSID. To ensure a length of 3 for CLASSID in the subquery, the LENGTH= option is added to the column definition of CLASSID. Joins can be completed only on columns with the same type. In order to perform the second full join, which is a join based on a numeric column and a character column, one of the columns must be converted to the type of the other. The column from WEBCLASS_SCHEDULE is converted to character by using the PUT function.
Create table ALLWEBCLASSES. Select columns from the subquery results (alias c) and the WEBCLASS_SCHEDULE table (alias s). Place CLASSID from the subquery first so that the resulting column's attributes are the same as in the subquery. Convert the value of CLASSID from WEBCLASS_SCHEDULE to character because the arguments to COALESCE must have the same type. Assign the text 'name' to column SESSIONID for classes that are not found in WEBCLASS_SCHEDULE.
Assign an alias.
Perform a full join of WEBCLASS_SCHEDULE and the result of the subquery.
Specify a subquery that joins WEBCLASSES and CLASS_UPDATES and enclose it in parentheses. Apply the COALESCE function to the two CLASSID columns so that nonmatching rows will have a value for CLASSID. Override the length of the first argument to COALESCE, which is 8, by assigning a length of 3 to the results of the function. Apply the COALESCE function to columns in common. List first in each call the columns from the table with the updates, CLASS_UPDATES, which ensures new information overwrites old information. Perform a full join. Assign aliases.
Specify how to match the rows in the subquery. Assign an alias to the subquery result. Specify how to match the rows of table WEBCLASS_SCHEDULE and the subquery. Convert the numeric CLASSID column in WEBCLASS_SCHEDULE to character. Order the output rows.
proc sql; create table allwebclasses as select coalesce(c.classid,put(s.classid,z3.-l)) as classid, coalesce(sessionid,'none') as sessionid, startdate, class_title, maxstudents from webclass_schedule s full join (select coalesce(old.classid,new.classid) as classid length=3, coalesce(new.maxstudents,old.maxstudents) as maxstudents, coalesce(new.class_title,old.class_title) as class_title from webclasses old full join class_updates new on old.classid=new.classid) c on c.classid=put(s.classid,z3.-l) order by classid, startdate; quit;
A combination of PROC SORT steps and DATA steps can create a data set equivalent to the table that was created by PROC SQL in the main example.
The following program combines the three data sets by match-merging on variable CLASSID. As with PROC SQL, the DATA step cannot match observations by a variable that has different types without additional programming.
The goal is to match CLASSID in its character form and to define CLASSID as a 3-byte character variable in the output data set. Two steps are taken to accomplish this:
The first DATA step converts the numeric version of CLASSID to a 3-byte character variable.
The match-merge DATA step places a LENGTH statement prior to the MERGE statement that assigns a length of 3 bytes to CLASSID. The MERGE statement lists WEBCLASSES first. By default, the attributes of the first occurrence of a variable in the data sets in the MERGE statement define the attributes of the variable in the output data set. Because the length of CLASSID in WEBCLASSES is 8 bytes and the goal is to output CLASSID as a 3-byte character variable, a LENGTH statement is needed prior to the MERGE statement.
The match-merge DATA step produces the following message:
WARNING: Multiple lengths were specified for the BY variable Classid by input data sets and LENGTH, FORMAT, INFORMAT, or ATTRIB statements. This may cause unexpected results.
SAS generates this message because the length of CLASSID in WEBCLASSES is 8 and its length in the other two input data sets is 3. Because the data values are never wider than 3 bytes, this match-merge completes successfully. If you are uncertain about your data in a similar situation, you might want to add DATA steps to convert the smaller length versions of your variables to the longest length.
Create SCHEDULE2 to contain a character version of CLASSID. Prevent the numeric version of CLASSID from being put in the PDV by renaming it. Define the length and type of the character version of CLASSID.
Drop the numeric version of CLASSID, which was renamed earlier when it was read in. Convert the numeric values of NCLASSID to character.
Sort the input data sets. Sort the observations in SCHEDULE2 also by STARTDATE so that the observations in ALLWEBCLASSES will be arranged chronologically within each CLASSID.
Create data set ALLWEBCLASSES. Assign a type and length to CLASSID, which overrides the length of 8 that it has in WEBCLASSES. Name the data sets to match-merge. Add the IN= data set option so that it can be determined whether the current observation has data contributed from SCHEDULE2. Specify the matching variable. Assign text to SESSIONID for observations that are not found in SCHEDULE2.
data schedule2; set webclass_schedule(rename=(classid=nclassid)); length classid $ 3; drop nclassid; classid=put(nclassid,z3.-l); run; proc sort data=webclasses; by classid; run; proc sort data=class_updates; by classid; run; proc sort data=schedule2; by classid startdate; run; data allwebclasses; length classid $ 3; merge webclasses class_updates schedule2(in=insched); by classid; if not insched then sessionid='none'; run;
18.116.90.246