Joining Multiple Tables and Views

So far, this chapter has presented PROC SQL queries that combine only two tables horizontally. However, there might be situations in which you have to create complex queries to combine more than two tables. Here is an example of a complex query that combines four different tables.

Example: Complex Query That Combines Four Tables

Suppose you want to list the names of supervisors for the crew on the flight to Copenhagen on March 4, 2000. To solve this problem, you need to use the following four tables.
Table
Relevant Columns
Sasuser.Flightschedule identifies the crew who flew to Copenhagen on March 4, 2000
EmpID, Date, Destination
Sasuser.Staffmaster identifies the names and states of residence for the employees
EmpID, FirstName, LastName, State
Sasuser.Payrollmaster identifies the job categories for the employees
EmpID, JobCode
Sasuser.Supervisors identifies the employees who are supervisors
EmpID, State, JobCategory
Note: Supervisors live in the same state as the employees that they supervise. There is one supervisor for each state and job category.
This problem can be handled in a number of different ways. Examine and compare three different techniques:
  • Technique 1: using PROC SQL subqueries, joins, and in-line views
  • Technique 2: using a multi-way join that combines four different tables and a reflexive join (joining a table with itself)
  • Technique 3: using traditional SAS programming (a series of PROC SORT and DATA steps, followed by a PROC PRINT step)

Example: Technique 1 (PROC SQL Subqueries, Joins, and In-Line Views)

Overview

Task
List the names of supervisors for the crew on the flight to Copenhagen on March 4, 2000.
Data
Sasuser.Flightschedule (EmpID, Date, Destination)
Sasuser.Staffmaster (EmpID, FirstName, LastName, State)
Sasuser.Payrollmaster (EmpID, JobCode)
Sasuser.Supervisors (EmpID, State, JobCategory)
Note: Supervisors live in the same state as the employees that they supervise. There is one supervisor for each state and job category.
Completing the stated task requires a complex query that includes several subqueries, joins, and an in-line view. To make the task more manageable, build the complex query piece-by-piece in four steps:
  1. Identify the crew for the Copenhagen flight.
  2. Find the states and job categories of the crew members that were returned by the first query.
  3. Find the employee numbers of the crew supervisors, based on the states and job categories that were returned by the second query.
  4. Find the names of the supervisors, based on the employee numbers that were returned by the third query.
Note that at each of the four steps, a new piece of the final query is added. The final query is included in the four separate pieces.

Query 1: Identify the Crew for the Copenhagen (CPH) Flight

This query lists the employee ID numbers of all six crew members on the Copenhagen flight:
proc sql;
   select empid
      from sasuser.flightschedule
      where date='04mar2000'd
            and destination='CPH';
Employee ID Numbers for Copenhagen Flight

Query 2: Find the States and Job Categories of the Crew Members

Query 1 becomes a subquery and returns the employee ID numbers of the six Copenhagen crew members to the outer query, Query 2. (Query 2 is shaded.) Query 2 uses an inner join to combine two tables. Query 2 selects the job category (by using the SUBSTR function to extract the job category from JobCode) and state for each of the six crew members.
proc sql;
   select substr(JobCode,1,2) as JobCategory,
          state 
      from sasuser.staffmaster as s,
           sasuser.payrollmaster as p
      where s.empid=p.empid and s.empid in
         (select empid
            from sasuser.flightschedule
            where date='04mar2000'd
                  and destination='CPH'),
Job Category and State for Copenhagen Crew Members

Query 3: Find the Employee Numbers of the Crew Supervisors

Query 2 becomes an in-line view within Query 3, and the alias c has been assigned to the in-line view. Query 2 returns to Query 3 the job category and state for each crew member. Query 3 selects the employee ID numbers for supervisors whose job category and state match the job category and state of a crew member.
Note: Sasuser.Supervisors specifies the label Supervisor ID for the EmpID column, and this label appears in the output.
proc sql;
   select empid
      from sasuser.supervisors as m,
         (select substr(jobcode,1,2) as JobCategory,
                 state
            from sasuser.staffmaster as s,
                 sasuser.payrollmaster as p
            where s.empid=p.empid and s.empid in
               (select empid
                  from sasuser.flightschedule
                  where date='04mar2000'd
                        and destination='CPH')) as c
      where m.jobcategory=c.jobcategory
            and m.state=c.state;
Supervisor ID Numbers, Matching Job Category and State for Crew Members
Note that two rows contain the same value of EmpID: 1983. This duplication indicates that two different crew members have the same manager. In all, there are five supervisors for the six crew members of the Copenhagen flight.

Query 4: Find the Names of the Supervisors

Query 3 becomes a subquery within Query 4. Query 3 returns to Query 4 the employee numbers (supervisor IDs) for the supervisors of the Copenhagen crew. Query 4 selects the names of the supervisors.
proc sql;
  select firstname, lastname
    from sasuser.staffmaster 
    where empid in
      (select empid
        from sasuser.supervisors as m,
          (select substr(jobcode,1,2) 
                  as JobCategory, 
                  state
             from sasuser.staffmaster as s,
                  sasuser.payrollmaster as p
             where s.empid=p.empid
                   and s.empid in
               (select empid
                 from sasuser.flightschedule
                 where date='04mar2000'd
                       and destination='CPH'))
                       as c
        where m.jobcategory=c.jobcategory
              and m.state=c.state);
Supervisor Names
Note that the output has five rows, one for each supervisor. The duplicate name of a supervisor has been eliminated.
Technique 1 produces a PROC SQL query that includes
  • four SELECT statements
  • four tables, each read separately.
This program is not optimized and, in addition, includes complex code that is likely to take a long time to write.

Example: Technique 2 (PROC SQL Multi-way Join with Reflexive Join)

Task
List the names of supervisors for the crew on the flight to Copenhagen on March 4, 2000.
Data
Sasuser.Flightschedule (EmpID, Date, Destination)
Sasuser.Staffmaster (EmpID, FirstName, LastName, State)
Sasuser.Payrollmaster (EmpID, JobCode)
Sasuser.Supervisors (EmpID, State, JobCategory)
Note: Supervisors live in the same state as the employees that they supervise. There is one supervisor for each state and job category.
You can also solve this problem by using a multi-way join with a reflexive join (joining a table to itself). The code is shown below:
proc sql;
   select distinct e.firstname, e.lastname
      from sasuser.flightschedule as a,
           sasuser.staffmaster as b,
           sasuser.payrollmaster as c,
           sasuser.supervisors as d,
           sasuser.staffmaster as e
      where a.date='04mar2000'd and
            a.destination='CPH' and
            a.empid=b.empid and
            a.empid=c.empid and
            d.jobcategory=substr(c.jobcode,1,2)
            and d.state=b.state
            and d.empid=e.empid;
Output: Multi-way Join with a Reflexive Join
Technique 2, which uses a multi-way join, provides a more efficient solution to the problem than Technique 1. In a multi-way join, PROC SQL joins two tables at a time and performs the joins in the most efficient order (the order minimizes the size of the Cartesian product). This multi-way join code is more difficult to build step-by-step than the code in Technique 1.
Note that Sasuser.Staffmaster is read two separate times in this query: this is the reflexive join. As you can see in the FROM clause, Sasuser.Staffmaster is assigned a different table alias each time it is read: first b, then e. The table is read the first time (alias b) to look up the states of the Copenhagen crew members, the second time (alias e) to look up the names of the supervisors.

Example: Technique 3 (Traditional SAS Programming)

Task
List the names of supervisors for the crew on the flight to Copenhagen on March 4, 2000.
Data
Sasuser.Flightschedule (EmpID, Date, Destination)
Sasuser.Staffmaster (EmpID, FirstName, LastName, State)
Sasuser.Payrollmaster (EmpID, JobCode)
Sasuser.Supervisors (EmpID, State, JobCategory)
Note: Supervisors live in the same state as the employees that they supervise. There is one supervisor for each state and job category.
For comparison, look at the traditional SAS programming that can be used to solve this problem. The code is shown below, followed by the output.
/* Find the crew for the flight. */

proc sort data=sasuser.flightschedule (drop=flightnumber)
          out=crew (keep=empid);
   where destination='CPH' and date='04MAR2000'd;
   by empid;
run;

/* Find the State and job code for the crew. */

proc sort data=sasuser.payrollmaster
               (keep=empid jobcode)
          out=payroll;
   by empid;
run;

proc sort data=sasuser.staffmaster
             (keep=empid state firstname lastname)
          out=staff;
   by empid;
run;

data st_cat (keep=state jobcategory);
   merge crew (in=c)
         staff
         payroll;
   by empid;
   if c;
   jobcategory=substr(jobcode,1,2);
run;

/* Find the supervisor IDs. */

proc sort
     data=st_cat;
        by jobcategory state;
run;

proc sort data=sasuser.supervisors
          out=superv;
   by jobcategory state;
run;

data super (keep=empid);
   merge st_cat(in=s)
         superv;
   by jobcategory state;
   if s;
run;

/* Find the names of the supervisors. */

proc sort data=super;
   by empid;
run;

data names(drop=empid);
   merge super (in=super)
         staff (keep=empid firstname lastname);
   by empid;
   if super;
run;

proc print data=names noobs uniform;
run;
Traditional SAS Programming
This output is not identical to the output of the PROC SQL approaches (Techniques 1 and 2). The SQL queries eliminated the duplicate names that are seen here. When you use Technique 3, you can eliminate duplicates by adding the NODUPKEY option to the last PROC SORT statement, as shown below:
proc sort data=super nodupkey;
Based on a mainframe benchmark in batch mode, the SQL queries use less CPU time, but more I/O operations, than this non-SQL program.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.184.6