Match and merge observations from two data sets based on a variable common to both. To make the new data set more informative, create a new variable whose values indicate the data set origin of each observation, and replace the missing values that result from the merge operation with meaningful values.
Featured Step | DATA step |
Featured Step Options and Statements | Match-merge with MERGE and BY statements IN= data set option |
Related Technique | PROC SQL, FULL JOIN, COALESCE function |
Data set TASKS contains task assignments for five employees. Data set HOURS contains the available hours for five employees. Some employees have data in only one of the two data sets while others have data in both. Both TASKS and HOURS are sorted by ID.
TASKS Obs id name dept task 1 DDQ02 Diana B45 Document 2 FXB03 Fred B45 Survey 3 MRF08 Monique A12 Document 4 MSG01 Miguel A12 Document 5 VTN18 Vien D03 Survey
HOURS available_ Obs id name hours 1 DDQ02 Diana 40 2 FXB03 Fred 35 3 MRF08 Monique 37 4 SWL14 Steve 0 5 VTN18 Vien 42
Output 3.1 TASK_STATUS Data SetExample 3.1 TASK_STATUS Data Set available_ Obs origin id name dept task hours 1 both DDQ02 Diana B45 Document 40 2 both FXB03 Fred B45 Survey 35 3 both MRF08 Monique A12 Document 37 4 tasks MSG01 Miguel A12 Document 0 5 hours SWL14 Steve NEW NONE 0 6 both VTN18 Vien D03 Survey 42 |
This program demonstrates simple match-merging of two data sets by the values of one variable in common. The two data sets are sorted by the common variable prior to the match-merge DATA step.
Data set TASKS contains task assignments for five employees. Data set HOURS contains available hours for five employees. The two data sets have two variables in common: ID and NAME. The DATA step uses the values of variable ID to match the two data sets by value.
The program has three objectives:
create a single data set that matches each individual with the correct departmental and project information based on corresponding ID values
add a new variable that indicates the origin of that information (i.e., data set TASKS or HOURS or both data sets)
add meaningful information where values are missing
The MERGE and BY statements in the DATA step match and merge data sets TASKS and HOURS by variable ID to create output data set TASK_STATUS. The IN= data set option is applied to both input data sets in the MERGE statement so that the data set origin of the observation can be determined.
The first group of IF-THEN/ELSE statements tests the values of temporary variables INTASKS and INHOURS that are defined by the IN= data set options to determine the origin of the observation and assign a value to ORIGIN.
The second group of IF-THEN/ELSE statements tests if the values of variables that are unique to one data set are missing so that meaningful values can replace the missing values that result from the merge operation.
Create data set TASK_STATUS.
Merge the observations. Add the IN= data set option to both data sets so that the data set origin of each observation can be determined. Match the observations by the values of ID. Assign values to ORIGIN based on the data set origin of the observation that is currently being processed. Assign 'both' to ORIGIN when the employee has data in both data sets. Assign 'tasks' to ORIGIN when the employee has data only in TASKS. Assign 'hours' to ORIGIN when the employee has data only in HOURS. Assign values to DEPT and TASK when either is missing, which occurs when an observation in HOURS does not have a match in TASKS. Assign a value to AVAILABLE_HOURS when it is missing, which occurs when an observation in TASKS does not have a match in HOURS.
data task_status; length origin $ 5; merge tasks(in=intasks) hours(in=inhours); by id; if intasks and inhours then origin='both'; else if intasks then origin='tasks'; else if inhours then origin='hours'; if missing(dept) then dept='NEW'; if missing(task) then task='NONE'; if missing(available_hours) then available_hours=0; run;
The following PROC SQL step creates a table equivalent to the data set that was created by the DATA step in the main example. A full join by ID combines the two tables.
A CASE expression defines column ORIGIN. The WHEN clauses test whether an ID value is missing in either TASKS or HOURS. When a row from each table is joined by ID, neither value of ID is missing. When an ID value in one table is not found in the other, the output row contains data from only the one table.
The COALESCE function returns the first nonmissing value in the list of arguments that is supplied to it. The arguments to COALESCE can be any combination of column names, text values, or expressions. The last three calls to COALESCE replace missing values with text in the DEPT, TASK, and AVAILABLE_HOURS columns. Replacement with text occurs when a row is found in only one table.
Create table TASK_STATUS. Define column ORIGIN with a CASE expression. Assign 'both' to ORIGIN when twp rows are joined by an ID. Assign 'hours' to ORIGIN when an ID value is found only in HOURS, which occurs when TASKS does not contain a matching rows. Complete the case expression by assigning a value to ORIGIN when a row has missing values in both TASKS and HOURS, which does not occur in this example. Terminate the case expression. Name the column and assign it a length. Return the first nonmissing value of the pair of columns within parentheses. Assign a value if the first argument to the COALESCE function is missing.
Select all matching and nonmatching rows by performing a full join of the two tables by ID
proc sql; create table task_status as select case when tasks.id is not missing and hours.ids is not missing then 'both' when hours.id is not missing then 'hours' when tasks.id is not missing then 'tasks' else '?????' end as origin length=5, coalesce(tasks.id,hours.id) as id, coalesce(tasks.name,hours.name) as name, coalesce(tasks.task,'NONE') as task, coalesce(tasks.dept,'NEW') as dept, coalesce(hours.available_hours,0)as available_hours from tasks full join hours on tasks.id=hours.id; quit;
18.118.32.222