Example 3.5 Combining Multiple Tables When the Matching Column Has Different Attributes

Goal

Combine multiple tables where the matching column has different attributes such as length and can be stored as either numeric or character.

Example Features

Featured StepPROC SQL
Featured Step Options and StatementsFULL OUTER JOIN COALESCE function Subquery
Related TechniquePROC SORT and DATA step match-merge

Input Tables

Table WEBCLASSES contains information about six classes. Table CLASS_UPDATES contains updates on some of the classes in WEBCLASSES, and it includes new classes. Table WEBCLASS_SCHEDULE contains scheduling information for classes and their section in fall 2009.

Column CLASSID is in common among the three tables. However, it is defined as character with different lengths in tables WEBCLASSES and CLASS_UPDATES, and it is defined as numeric in WEBCLASS_SCHEDULE.

              WEBCLASSES

Obs classid  maxstudents class_title
 1    101         25     Web Design 1
 2    201         15     Web Design 2
 3    301         15     Web Design 3
 4    210         35     Increasing Website Traffic
 5    203         20     Web Graphics 1
 6    303         15     Web Graphics 2

             CLASS_UPDATES

Obs classid  maxstudents class_title
 1    101         30     Basic Web Design
 2    201         20     Intermediate Web Design
 3    301         15     Advanced Web Design
 4    220         15     Internet Security
 5    010         40     Keyboarding

        WEBCLASS_SCHEDULE

Obs classid sessionid   startdate
 1    101       A      09/03/2009
 2    101       B      10/05/2009
 3    201       A      10/05/2009
 4    210       A      09/15/2009
 5    220       A      09/21/2009
 6     10       A      09/02/2009
 7     10       B      10/07/2009
 8     10       C      10/27/2009

Resulting Table

Output 3.5 ALLWEBCLASSES Table

       Example 3.5 ALLWEBCLASSES Table Created with PROC SQL

Obs classid sessionid  startdate class_title                maxstudents

  1   010     A       09/02/2009 Keyboarding                     40
  2   010     B       10/07/2009 Keyboarding                     40
  3   010     C       10/27/2009 Keyboarding                     40
  4   101     A       09/03/2009 Basic Web Design                30
  5   101     B       10/05/2009 Basic Web Design                30
  6   201     A       10/05/2009 Intermediate Web Design         20
  7   203     none             . Web Graphics 1                  20
  8   210     A       09/15/2009 Increasing Website Traffic      35
  9   220     A       09/21/2009 Internet Security               15
 10   301     none             . Advanced Web Design             15
 11   303     none             . Web Graphics 2                  15


Example Overview

This example shows you how to combine several tables where the matching column can have different attributes.

The goal in combining all three tables is to update information in WEBCLASSES with information from CLASS_UPDATES, and then join all the rows in these two tables with the rows in the third table, WEBCLASS_SCHEDULE. The resulting table should have a row for each class whether or not it's scheduled to be held. Classes with multiple sessions have a row for each session.

A PROC SQL step combines the three tables by matching column CLASSID, which has different attributes in each of the tables. CLASSID is defined as character in tables WEBCLASSES and CLASS_UPDATES and as numeric in table WEBCLASS_SCHEDULE. The length of the character version of CLASSID is 8 in WEBCLASSES and 3 in CLASS_UPDATES.

Because all matching and nonmatching rows from all three tables are required in the resulting table, a full outer join combines the three tables. And because a full outer join can combine only two tables at a time, two full outer joins are required to produce the final table. The result of a nested subquery that combines WEBCLASSES and CLASS_UPDATES with a full join is combined with WEBCLASS_SCHEDULE in the second full join.

The matching variable for both full joins is CLASSID. To ensure a length of 3 for CLASSID in the subquery, the LENGTH= option is added to the column definition of CLASSID. Joins can be completed only on columns with the same type. In order to perform the second full join, which is a join based on a numeric column and a character column, one of the columns must be converted to the type of the other. The column from WEBCLASS_SCHEDULE is converted to character by using the PUT function.

Program

Create table ALLWEBCLASSES. Select columns from the subquery results (alias c) and the WEBCLASS_SCHEDULE table (alias s). Place CLASSID from the subquery first so that the resulting column's attributes are the same as in the subquery. Convert the value of CLASSID from WEBCLASS_SCHEDULE to character because the arguments to COALESCE must have the same type. Assign the text 'name' to column SESSIONID for classes that are not found in WEBCLASS_SCHEDULE.

Assign an alias.

Perform a full join of WEBCLASS_SCHEDULE and the result of the subquery.

Specify a subquery that joins WEBCLASSES and CLASS_UPDATES and enclose it in parentheses. Apply the COALESCE function to the two CLASSID columns so that nonmatching rows will have a value for CLASSID. Override the length of the first argument to COALESCE, which is 8, by assigning a length of 3 to the results of the function. Apply the COALESCE function to columns in common. List first in each call the columns from the table with the updates, CLASS_UPDATES, which ensures new information overwrites old information. Perform a full join. Assign aliases.

Specify how to match the rows in the subquery. Assign an alias to the subquery result. Specify how to match the rows of table WEBCLASS_SCHEDULE and the subquery. Convert the numeric CLASSID column in WEBCLASS_SCHEDULE to character. Order the output rows.

proc sql;
  create table allwebclasses as
    select coalesce(c.classid,put(s.classid,z3.-l))
                          as classid,







           coalesce(sessionid,'none') as sessionid,

           startdate, class_title, maxstudents
        from webclass_schedule s
       
               full join


             (select coalesce(old.classid,new.classid)
                              as classid length=3,
  

                coalesce(new.maxstudents,old.maxstudents)
                               as maxstudents,
                coalesce(new.class_title,old.class_title)
                               as class_title


             from webclasses old
                    full join
                  class_updates new
             on old.classid=new.classid)

           c
           on c.classid=put(s.classid,z3.-l)




        order by classid, startdate;
quit;

Related Technique

A combination of PROC SORT steps and DATA steps can create a data set equivalent to the table that was created by PROC SQL in the main example.

The following program combines the three data sets by match-merging on variable CLASSID. As with PROC SQL, the DATA step cannot match observations by a variable that has different types without additional programming.

The goal is to match CLASSID in its character form and to define CLASSID as a 3-byte character variable in the output data set. Two steps are taken to accomplish this:

  • The first DATA step converts the numeric version of CLASSID to a 3-byte character variable.

  • The match-merge DATA step places a LENGTH statement prior to the MERGE statement that assigns a length of 3 bytes to CLASSID. The MERGE statement lists WEBCLASSES first. By default, the attributes of the first occurrence of a variable in the data sets in the MERGE statement define the attributes of the variable in the output data set. Because the length of CLASSID in WEBCLASSES is 8 bytes and the goal is to output CLASSID as a 3-byte character variable, a LENGTH statement is needed prior to the MERGE statement.

The match-merge DATA step produces the following message:

WARNING: Multiple lengths were specified for the BY variable
         Classid by input data sets and LENGTH, FORMAT, INFORMAT,
         or ATTRIB statements. This may cause unexpected results.

SAS generates this message because the length of CLASSID in WEBCLASSES is 8 and its length in the other two input data sets is 3. Because the data values are never wider than 3 bytes, this match-merge completes successfully. If you are uncertain about your data in a similar situation, you might want to add DATA steps to convert the smaller length versions of your variables to the longest length.

Create SCHEDULE2 to contain a character version of CLASSID. Prevent the numeric version of CLASSID from being put in the PDV by renaming it. Define the length and type of the character version of CLASSID.

Drop the numeric version of CLASSID, which was renamed earlier when it was read in. Convert the numeric values of NCLASSID to character.

Sort the input data sets. Sort the observations in SCHEDULE2 also by STARTDATE so that the observations in ALLWEBCLASSES will be arranged chronologically within each CLASSID.

Create data set ALLWEBCLASSES. Assign a type and length to CLASSID, which overrides the length of 8 that it has in WEBCLASSES. Name the data sets to match-merge. Add the IN= data set option so that it can be determined whether the current observation has data contributed from SCHEDULE2. Specify the matching variable. Assign text to SESSIONID for observations that are not found in SCHEDULE2.

data schedule2;

  set webclass_schedule(rename=(classid=nclassid));


  length classid $ 3;

  drop nclassid;


  classid=put(nclassid,z3.-l);

run;
proc sort data=webclasses;
  by classid;
run;
proc sort data=class_updates;
  by classid;
run;
proc sort data=schedule2;
  by classid startdate;
run;
data allwebclasses;
  length classid $ 3;


  merge webclasses class_updates schedule2(in=insched);


    by classid;
  if not insched then sessionid='none';


run;

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.90.246