Example 8.2 Separating Unique Observations from Duplicate Observations Based on BY Values

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Example 8.2 Separating Unique Observations from Duplicate Observations Based on BY Values

Goal

According to their BY-group values, identify unique and duplicate observations in a data set.

Example Features

Featured Step	DATA step
Featured Step Options and Statements	FIRST.variable and LAST.variable temporary variables

Input Data Set

Data set COMPSCISTUDENTS contains enrollment information about four students. Note that three students have duplicate observations: two each for Fred Lopez and Martyn Ross and three for Janie Nguyen.

                              COMPSCISTUDENTS

 Obs  studentid   studentname    department               classid   credits
   1     91838   Nguyen, Janie   Computer Science           520        2
   2     91838   Nguyen, Janie   Speech                     476        3
   3     91838   Nguyen, Janie   Speech                     476        3
   4     91838   Nguyen, Janie   Speech                     476        3
   5    103722   Lopez, Fred     Computer Science           210        3
   6    103722   Lopez, Fred     Computer Science           210        3
   7    137193   Ross, Martyn    Computer Science           201        3
   8    137193   Ross, Martyn    Computer Science           220        2
   9    137193   Ross, Martyn    Technical Communication    201        4
  10    137193   Ross, Martyn    Technical Communication    201        4
  11    987175   Young, Kaitlyn  Design                     301        4
  12    987175   Young, Kaitlyn  Electrical Engineering     301        3

Resulting Data Sets

Output 8.2a OKCLASSES Data Set

           Example 8.2 OKCLASSES Data Set Created with DATA Step

 Obs  studentid   studentname    department              classid  credits
  1      91838   Nguyen, Janie   Computer Science          520       2
  2     137193   Ross, Martyn    Computer Science          201       3
  3     137193   Ross, Martyn    Computer Science          220       2
  4     987175   Young, Kaitlyn  Design                    301       4
  5     987175   Young, Kaitlyn  Electrical Engineering    301       3

Output 8.2b DUPCLASSES Data Set

          Example 8.2 DUPCLASSES Data Set Created with DATA Step

 Obs  studentid   studentname   department               classid  credits

  1      91838   Nguyen, Janie  Speech                     476       3
  2      91838   Nguyen, Janie  Speech                     476       3
  3      91838   Nguyen, Janie  Speech                     476       3
  4     103722   Lopez, Fred    Computer Science           210       3
  5     103722   Lopez, Fred    Computer Science           210       3
  6     137193   Ross, Martyn   Technical Communication    201       4
  7     137193   Ross, Martyn   Technical Communication    201       4

Example Overview

This example shows how to determine when there is more than one observation in a BY group and to separate the BY groups with multiple observations from BY groups with only one observation.

The DATA step requires BY-group processing. When you process your data in BY groups, SAS automatically creates the FIRST. and LAST. temporary variables for the variables named in the BY statement, which are used in this example. These variables are not saved to the output data sets.

Data set COMPSCISTUDENTS contains enrollment information about four computer science students. Two of the four have two identical observations each. One of the four has three identical observations. The goal is to separate the observations that are represented only once in a BY group from those represented more than once.

Assume that data set COMPSCISTUDENTS is sorted by the three variables, STUDENTID, DEPARTMENT, and CLASSID prior to the DATA step. The BY statement in the DATA step lists these three variables.

The third variable in the BY statement is CLASSID. The DATA step tests the temporary variables FIRST.CLASSID and LAST.CLASSID to determine when an observation is the only one in the BY group. It outputs observations to the appropriate data set based on the values of these temporary variables.

With data set COMPSCISTUDENTS sorted, the FIRST.CLASSID and LAST.CLASSID temporary variables will both equal 1 (true) when processing a single occurrence of a BY group. If either or both of the two variables are 0 (false), the DATA step is processing an observation from a BY group that has more than one observation with the same BY values.

Note that the DATA step determines duplicates only by the values of the variables in the BY statement. To make a more complete check of duplicate information in all variables in an observation, you would need to add code to test the values. This can be done by defining variables with a RETAIN statement that hold values across the observation that can be tested, or by using the LAG function.

Example 8.3 and Example 8.4 are similar to this example in that they also separate unique observations from duplicate observations. Both examples use PROC SORT, the DUPOUT= option, and either the NODUPRECS or NUPKEY option. They differ from this example in that they keep the first observation in a BY group that has multiple observations rather than separating it from BY groups with only one observation.

Program

Sort by the variables that define the BY groups.

Create two data sets, OKCLASSES and DUPCLASSES. Read each observation from COMPSCISTUDENTS. Define BY-group processing for COMPSCISTUDENTS. Test the values of the FIRST. and LAST. temporary variables defined for the third BY variable, CLASSID. Output an observation to OKCLASSES if it is the only one in the BY group. Output an observation to DUPCLASSES when either or both FIRST.CLASSID and LAST.CLASSID are 0 (false), which is when there is more than one observation in the BY group.

proc sort data=compscistudents;
  by studentid department classid;
run;
data okclasses dupclasses;

  set compscistudents;

  by studentid department classid;

  if first.classid and last.classid then
                output okclasses;



  else output dupclasses;




run;

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Example 8.2 Separating Unique Observations from Duplicate Observations Based on BY Values

Create new playlist

Sign In

Sign Up

Example 8.2 Separating Unique Observations from Duplicate Observations Based on BY Values

Goal

Example Features

Input Data Set

Resulting Data Sets

Output 8.2a OKCLASSES Data Set

Output 8.2b DUPCLASSES Data Set

Example Overview

Program

Table of Contents for
Example 8.2 Separating Unique Observations from Duplicate Observations Based on BY Values