Identifying and removing duplicate cases

Datasets may contain duplicate records that often must be removed before data mining can begin. For example, the same individual may appear multiple times in a dataset with different addresses. The Distinct node finds or removes duplicate records in a dataset. The Distinct node, located in the Record Ops palette, checks for duplicate records and identifies the cases that appear more than once in a file so they can be reviewed and/or removed.

A duplicate case is defined by having identical data values on one or more fields that are specified. Any number or combination of fields may be used to specify a duplicate:

Place a Distinct node from the Record Ops palette onto the canvas.
Connect the Sort node to the Distinct node.
Edit the Distinct node.

The Distinct node can be a bit tricky to use; this is why we will run this node a couple of times, and hopefully in this way its options will become well-defined. The Mode option controls how the Distinct node is used. To remove duplicates in a dataset, the Mode should be set to Include only the first record in each group; in this way, the first time a record is identified it will pass, however subsequent encounters will be removed. To identify duplicates, set the mode to Discard only the first record in each group; this option removes the record the first time it is identified, however subsequent encounters will be kept, and in this way the duplicates will be identified. The Create a composite record for each group option provides a way to aggregate non-numeric fields which you can specify on the Composite tab to create composite records.

In this example, we will use the Distinct node to identify all the duplicate cases in a data file. The fields that provide the basis for identification of duplicates are selected in the Key fields for grouping box.

Note that the Distinct node will function even if the data is not sorted. However, it is best to sort the data beforehand, either by using the Sort node or by specifying a sort field(s) in the Distinct node dialog itself, in the Within groups, sort records by: area. Sorting makes checking the operation of the Distinct node more straightforward, and clearly, if there is a natural ordering to the data such that the first record in a group of duplicates is to be preferred, sorting beforehand is a necessity.

Place all the fields in the Key fields for grouping: box (by using all of the fields, we are looking for cases that are exact duplicates in all of the fields).

Make sure Mode: Discard only the first record in each group is selected:

Click OK to return to the stream canvas.
Connect the Distinct node to a Table node.
Run the Table node:

There is only one case in the data file that is an exact duplicate. We will now remove this case:

Edit the Distinct node.
Select Include only the first record in each group for the Mode:

We will now remove the exact duplicate that was found previously in the data file:

Click OK.
Run the Table node.

Notice that we have one fewer record in the data file since we have now removed the duplicate case.

Table of Contents for Identifying and removing duplicate cases

Create new playlist

Sign In

Sign Up

Table of Contents for
Identifying and removing duplicate cases