Identifying and removing duplicate data

One problem when cleaning up data is dealing with duplicates. How do we find them? What do we do with them once we have them? While a part of this process can be automated, often merging duplicated data is a manual task, because a person has to look at potential matches and determine whether they are duplicates or not and determining what needs to be done with the overlapping data. We can code heuristics, of course, but at some point, a person needs to make the final call.

The first question that needs to be answered is what constitutes identity for the data. If you have two items of data, which fields do you have to look at in order to determine whether they are duplicates? Then, you must determine how close they need to be.

For this recipe, we'll examine some data and decide on duplicates by doing a fuzzy comparison of the name fields. We'll simply return all of the pairs that appear to be duplicates.

Getting ready

First, we need to add the library to do the fuzzy string matching to our Leiningen project.clj file:

(defproject cleaning-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [clj-diff "1.0.0-SNAPSHOT"]])

And to make sure that's available to our script or REPL:

(use 'clj-diff.core)

How to do it…

We'll first define a function to test for fuzzy equality. Then, we'll write another function that uses fuzzy equality to test whether two records match.

  1. Here are the main parameters for fuzzy string matching. We'll see how to use these later in the recipe:
    (def fuzzy-max-diff 2)
    (def fuzzy-percent-diff 0.1)
    (def fuzzy-dist edit-distance)
  2. Now, we can define a function that uses these parameters to determine whether two strings are equal to each other:
    (defn fuzzy= [a b]
      (let [dist (fuzzy-dist a b)]
        (or (<= dist fuzzy-max-diff)
            (<= (/ dist (min (count a) (count b)))
                fuzzy-percent-diff))))
  3. Building on this, we can write a function that determines whether two records are the same. It also takes one or more key functions, which returns the values that the items should be compared on:
    (defn records-match
     [key-fn a b]
      (let [kfns (if (sequential? key-fn) key-fn [key-fn])
            rfn (fn [prev next-fn]
                  (and prev (fuzzy= (next-fn a)
                                    (next-fn b))))]
        (reduce rfn true kfns)))
  4. These should allow you to test whether two records are approximately equal. Let's create some data to test this out:
    (def data
      {:mulder {:given-name "Fox" :surname "Mulder"}
       :molder {:given-name "Fox" :surname "Molder"}
       :mulder2 {:given-name "fox" :surname "mulder"}
       :scully {:given-name "Dana" :surname "Scully"}
       :scully2 {:given-name "Dan" :surname "Scully"}})
  5. Now, we can test some of these for equality:
    user=> (records-match [:given-name :surname]
                          (data :mulder) (data :molder))
    true
    user=> (records-match [:given-name :surname] 
                          (data :mulder) (data :mulder2))
    true
    user=> (records-match [:given-name :surname] 
                          (data :scully) (data :scully2))
    true
    user=> (records-match [:given-name :surname] 
                          (data :mulder) (data :scully))
    false

How it works…

The fuzzy string matching function uses several parameters. Let's take a look at each individually:

(def fuzzy-dist edit-distance)

fuzzy-dist is a function that returns a similarity metric for the two strings. Lower numbers indicate that the two strings are more similar. In this case, we're using clj-diff.core/edit-distance. The edit-distance parameter is the number of editing operations (usually inserting and deleting a single character, with a change being a combination of these) required to transform one string into another. For example, here are the edit distances of a few simple strings:

user=> (edit-distance "abc" "bc")
1
user=> (edit-distance "abc" "abcd")
1
user=> (edit-distance "abc" "bac")
2
user=> (edit-distance "abc" "bbc")
2

Based off of these values, the maximum allowable distance is determined by the following two parameters:

(def fuzzy-max-diff 2)

First, for equality, the distance has to be at most fuzzy-max-diff. Setting it to 2 allows replacements, which are generally two changes (delete and insert):

(def fuzzy-percent-diff 0.1)

The maximum distance can also be a percentage of the length of the shortest input string. In this case, we're using 10 percent as the maximum difference value that 2 can be.

If either of these two conditions is met, the strings are determined to be the same. This leads to two scenarios. No matter the length of the string, if only two characters change, it's the same. This is problematic for very short strings.

On the other hand, a hard maximum distance doesn't work for very long strings either. If, say for example, the value is 200 characters or more, you'll want to allow more absolute characters of difference than you would for a string of 20 characters. fuzzy-percent-diff provides this flexibility.

There's more…

As I mentioned, this will not handle short strings very well. For example, it will judge ace and are to be the same. We can make the logic more complicated by adding a clause that says only to use fuzzy-max-diff if the length of the string is greater than some value.

In this recipe, we used clj-diff.core/edit-distance. This measures the number of changes that need to be made in order to transform one string into the other with the single-character operations insert and delete. Another option is to use clj-diff.core/levenshtein-distance, which also uses a single-character replace operation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.246.245