According to their BY-group values, identify unique and duplicate observations in a data set.
Featured Step | DATA step |
Featured Step Options and Statements | FIRST.variable and LAST.variable temporary variables |
Data set COMPSCISTUDENTS contains enrollment information about four students. Note that three students have duplicate observations: two each for Fred Lopez and Martyn Ross and three for Janie Nguyen.
COMPSCISTUDENTS Obs studentid studentname department classid credits 1 91838 Nguyen, Janie Computer Science 520 2 2 91838 Nguyen, Janie Speech 476 3 3 91838 Nguyen, Janie Speech 476 3 4 91838 Nguyen, Janie Speech 476 3 5 103722 Lopez, Fred Computer Science 210 3 6 103722 Lopez, Fred Computer Science 210 3 7 137193 Ross, Martyn Computer Science 201 3 8 137193 Ross, Martyn Computer Science 220 2 9 137193 Ross, Martyn Technical Communication 201 4 10 137193 Ross, Martyn Technical Communication 201 4 11 987175 Young, Kaitlyn Design 301 4 12 987175 Young, Kaitlyn Electrical Engineering 301 3
Output 8.2a OKCLASSES Data SetExample 8.2 OKCLASSES Data Set Created with DATA Step Obs studentid studentname department classid credits 1 91838 Nguyen, Janie Computer Science 520 2 2 137193 Ross, Martyn Computer Science 201 3 3 137193 Ross, Martyn Computer Science 220 2 4 987175 Young, Kaitlyn Design 301 4 5 987175 Young, Kaitlyn Electrical Engineering 301 3 |
Output 8.2b DUPCLASSES Data SetExample 8.2 DUPCLASSES Data Set Created with DATA Step Obs studentid studentname department classid credits 1 91838 Nguyen, Janie Speech 476 3 2 91838 Nguyen, Janie Speech 476 3 3 91838 Nguyen, Janie Speech 476 3 4 103722 Lopez, Fred Computer Science 210 3 5 103722 Lopez, Fred Computer Science 210 3 6 137193 Ross, Martyn Technical Communication 201 4 7 137193 Ross, Martyn Technical Communication 201 4 |
This example shows how to determine when there is more than one observation in a BY group and to separate the BY groups with multiple observations from BY groups with only one observation.
The DATA step requires BY-group processing. When you process your data in BY groups, SAS automatically creates the FIRST. and LAST. temporary variables for the variables named in the BY statement, which are used in this example. These variables are not saved to the output data sets.
Data set COMPSCISTUDENTS contains enrollment information about four computer science students. Two of the four have two identical observations each. One of the four has three identical observations. The goal is to separate the observations that are represented only once in a BY group from those represented more than once.
Assume that data set COMPSCISTUDENTS is sorted by the three variables, STUDENTID, DEPARTMENT, and CLASSID prior to the DATA step. The BY statement in the DATA step lists these three variables.
The third variable in the BY statement is CLASSID. The DATA step tests the temporary variables FIRST.CLASSID and LAST.CLASSID to determine when an observation is the only one in the BY group. It outputs observations to the appropriate data set based on the values of these temporary variables.
With data set COMPSCISTUDENTS sorted, the FIRST.CLASSID and LAST.CLASSID temporary variables will both equal 1 (true) when processing a single occurrence of a BY group. If either or both of the two variables are 0 (false), the DATA step is processing an observation from a BY group that has more than one observation with the same BY values.
Note that the DATA step determines duplicates only by the values of the variables in the BY statement. To make a more complete check of duplicate information in all variables in an observation, you would need to add code to test the values. This can be done by defining variables with a RETAIN statement that hold values across the observation that can be tested, or by using the LAG function.
Example 8.3 and Example 8.4 are similar to this example in that they also separate unique observations from duplicate observations. Both examples use PROC SORT, the DUPOUT= option, and either the NODUPRECS or NUPKEY option. They differ from this example in that they keep the first observation in a BY group that has multiple observations rather than separating it from BY groups with only one observation.
Sort by the variables that define the BY groups.
Create two data sets, OKCLASSES and DUPCLASSES. Read each observation from COMPSCISTUDENTS. Define BY-group processing for COMPSCISTUDENTS. Test the values of the FIRST. and LAST. temporary variables defined for the third BY variable, CLASSID. Output an observation to OKCLASSES if it is the only one in the BY group. Output an observation to DUPCLASSES when either or both FIRST.CLASSID and LAST.CLASSID are 0 (false), which is when there is more than one observation in the BY group.
proc sort data=compscistudents; by studentid department classid; run; data okclasses dupclasses; set compscistudents; by studentid department classid; if first.classid and last.classid then output okclasses; else output dupclasses; run;
18.118.9.197