Datasets may contain duplicate records that often must be removed before data mining can begin. For example, the same individual may appear multiple times in a dataset with different addresses. The Distinct node finds or removes duplicate records in a dataset. The Distinct node, located in the Record Ops palette, checks for duplicate records and identifies the cases that appear more than once in a file so they can be reviewed and/or removed.
A duplicate case is defined by having identical data values on one or more fields that are specified. Any number or combination of fields may be used to specify a duplicate:
- Place a Distinct node from the Record Ops palette onto the canvas.
- Connect the Sort node to the Distinct node.
- Edit the Distinct node.
The Distinct node can be a bit tricky to use; this is why we will run this node a couple of times, and hopefully in this way its options will become well-defined. The Mode option controls how the Distinct node is used. To remove duplicates in a dataset, the Mode should be set to Include only the first record in each group; in this way, the first time a record is identified it will pass, however subsequent encounters will be removed. To identify duplicates, set the mode to Discard only the first record in each group; this option removes the record the first time it is identified, however subsequent encounters will be kept, and in this way the duplicates will be identified. The Create a composite record for each group option provides a way to aggregate non-numeric fields which you can specify on the Composite tab to create composite records.
In this example, we will use the Distinct node to identify all the duplicate cases in a data file. The fields that provide the basis for identification of duplicates are selected in the Key fields for grouping box.
- Place all the fields in the Key fields for grouping: box (by using all of the fields, we are looking for cases that are exact duplicates in all of the fields).
- Make sure Mode: Discard only the first record in each group is selected:
- Click OK to return to the stream canvas.
- Connect the Distinct node to a Table node.
- Run the Table node:
There is only one case in the data file that is an exact duplicate. We will now remove this case:
- Edit the Distinct node.
- Select Include only the first record in each group for the Mode:
We will now remove the exact duplicate that was found previously in the data file:
- Click OK.
- Run the Table node.
Notice that we have one fewer record in the data file since we have now removed the duplicate case.