Example 3.1 Merging Data Sets by a Common Variable

Goal

Match and merge observations from two data sets based on a variable common to both. To make the new data set more informative, create a new variable whose values indicate the data set origin of each observation, and replace the missing values that result from the merge operation with meaningful values.

Example Features

Featured StepDATA step
Featured Step Options and StatementsMatch-merge with MERGE and BY statements IN= data set option
Related TechniquePROC SQL, FULL JOIN, COALESCE function

Input Data Sets

Data set TASKS contains task assignments for five employees. Data set HOURS contains the available hours for five employees. Some employees have data in only one of the two data sets while others have data in both. Both TASKS and HOURS are sorted by ID.

               TASKS

Obs   id    name     dept     task
 1   DDQ02  Diana    B45    Document
 2   FXB03  Fred     B45    Survey
 3   MRF08  Monique  A12    Document
 4   MSG01  Miguel   A12    Document
 5   VTN18  Vien     D03    Survey

          HOURS
                    available_
Obs   id    name       hours
 1   DDQ02  Diana        40
 2   FXB03  Fred         35
 3   MRF08  Monique      37
 4   SWL14  Steve         0
 5   VTN18  Vien         42

Resulting Data Set

Output 3.1 TASK_STATUS Data Set

                Example 3.1 TASK_STATUS Data Set
                                                    available_
 Obs   origin    id     name      dept   task          hours

  1    both     DDQ02   Diana     B45    Document       40
  2    both     FXB03   Fred      B45    Survey         35
  3    both     MRF08   Monique   A12    Document       37
  4    tasks    MSG01   Miguel    A12    Document        0
  5    hours    SWL14   Steve     NEW    NONE            0
  6    both     VTN18   Vien      D03    Survey         42


Example Overview

This program demonstrates simple match-merging of two data sets by the values of one variable in common. The two data sets are sorted by the common variable prior to the match-merge DATA step.

Data set TASKS contains task assignments for five employees. Data set HOURS contains available hours for five employees. The two data sets have two variables in common: ID and NAME. The DATA step uses the values of variable ID to match the two data sets by value.

The program has three objectives:

  • create a single data set that matches each individual with the correct departmental and project information based on corresponding ID values

  • add a new variable that indicates the origin of that information (i.e., data set TASKS or HOURS or both data sets)

  • add meaningful information where values are missing

The MERGE and BY statements in the DATA step match and merge data sets TASKS and HOURS by variable ID to create output data set TASK_STATUS. The IN= data set option is applied to both input data sets in the MERGE statement so that the data set origin of the observation can be determined.

The first group of IF-THEN/ELSE statements tests the values of temporary variables INTASKS and INHOURS that are defined by the IN= data set options to determine the origin of the observation and assign a value to ORIGIN.

The second group of IF-THEN/ELSE statements tests if the values of variables that are unique to one data set are missing so that meaningful values can replace the missing values that result from the merge operation.

Program

Create data set TASK_STATUS.

Merge the observations. Add the IN= data set option to both data sets so that the data set origin of each observation can be determined. Match the observations by the values of ID. Assign values to ORIGIN based on the data set origin of the observation that is currently being processed. Assign 'both' to ORIGIN when the employee has data in both data sets. Assign 'tasks' to ORIGIN when the employee has data only in TASKS. Assign 'hours' to ORIGIN when the employee has data only in HOURS. Assign values to DEPT and TASK when either is missing, which occurs when an observation in HOURS does not have a match in TASKS. Assign a value to AVAILABLE_HOURS when it is missing, which occurs when an observation in TASKS does not have a match in HOURS.

data task_status;
  length origin $ 5;
  merge tasks(in=intasks) hours(in=inhours);


  by id;


  if intasks and inhours then origin='both';

  else if intasks then origin='tasks';

  else if inhours then origin='hours';

  if missing(dept) then dept='NEW';

  if missing(task) then task='NONE';

  if missing(available_hours) then available_hours=0;


run;

Related Technique

The following PROC SQL step creates a table equivalent to the data set that was created by the DATA step in the main example. A full join by ID combines the two tables.

A CASE expression defines column ORIGIN. The WHEN clauses test whether an ID value is missing in either TASKS or HOURS. When a row from each table is joined by ID, neither value of ID is missing. When an ID value in one table is not found in the other, the output row contains data from only the one table.

The COALESCE function returns the first nonmissing value in the list of arguments that is supplied to it. The arguments to COALESCE can be any combination of column names, text values, or expressions. The last three calls to COALESCE replace missing values with text in the DEPT, TASK, and AVAILABLE_HOURS columns. Replacement with text occurs when a row is found in only one table.

Create table TASK_STATUS. Define column ORIGIN with a CASE expression. Assign 'both' to ORIGIN when twp rows are joined by an ID. Assign 'hours' to ORIGIN when an ID value is found only in HOURS, which occurs when TASKS does not contain a matching rows. Complete the case expression by assigning a value to ORIGIN when a row has missing values in both TASKS and HOURS, which does not occur in this example. Terminate the case expression. Name the column and assign it a length. Return the first nonmissing value of the pair of columns within parentheses. Assign a value if the first argument to the COALESCE function is missing.

Select all matching and nonmatching rows by performing a full join of the two tables by ID

proc sql;
  create table task_status as
    select case

              when tasks.id is not missing and
                    hours.ids is not missing then 'both'
              when hours.id is not missing then 'hours'


              when tasks.id is not missing then 'tasks'


              else '?????'



           end as origin length=5,

           coalesce(tasks.id,hours.id) as id,
           coalesce(tasks.name,hours.name) as name,
           
           coalesce(tasks.task,'NONE') as task,
           coalesce(tasks.dept,'NEW') as dept,
           coalesce(hours.available_hours,0)as
                              available_hours
        from tasks full join hours on tasks.id=hours.id;

quit;

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.104.243