One problem when cleaning up data is dealing with duplicates. How do we find them? What do we do with them once we have them? While a part of this process can be automated, often merging duplicated data is a manual task, because a person has to look at potential matches and determine whether they are duplicates or not and determining what needs to be done with the overlapping data. We can code heuristics, of course, but at some point, a person needs to make the final call.
The first question that needs to be answered is what constitutes identity for the data. If you have two items of data, which fields do you have to look at in order to determine whether they are duplicates? Then, you must determine how close they need to be.
For this recipe, we'll examine some data and decide on duplicates by doing a fuzzy comparison of the name fields. We'll simply return all of the pairs that appear to be duplicates.
First, we need to add the library to do the fuzzy string matching to our Leiningen project.clj
file:
(defproject cleaning-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [clj-diff "1.0.0-SNAPSHOT"]])
And to make sure that's available to our script or REPL:
(use 'clj-diff.core)
We'll first define a function to test for fuzzy equality. Then, we'll write another function that uses fuzzy equality to test whether two records match.
(def fuzzy-max-diff 2) (def fuzzy-percent-diff 0.1) (def fuzzy-dist edit-distance)
(defn fuzzy= [a b] (let [dist (fuzzy-dist a b)] (or (<= dist fuzzy-max-diff) (<= (/ dist (min (count a) (count b))) fuzzy-percent-diff))))
(defn records-match [key-fn a b] (let [kfns (if (sequential? key-fn) key-fn [key-fn]) rfn (fn [prev next-fn] (and prev (fuzzy= (next-fn a) (next-fn b))))] (reduce rfn true kfns)))
(def data {:mulder {:given-name "Fox" :surname "Mulder"} :molder {:given-name "Fox" :surname "Molder"} :mulder2 {:given-name "fox" :surname "mulder"} :scully {:given-name "Dana" :surname "Scully"} :scully2 {:given-name "Dan" :surname "Scully"}})
user=> (records-match [:given-name :surname] (data :mulder) (data :molder)) true user=> (records-match [:given-name :surname] (data :mulder) (data :mulder2)) true user=> (records-match [:given-name :surname] (data :scully) (data :scully2)) true user=> (records-match [:given-name :surname] (data :mulder) (data :scully)) false
The fuzzy string matching function uses several parameters. Let's take a look at each individually:
(def fuzzy-dist edit-distance)
fuzzy-dist
is a function that returns a similarity metric for the two strings. Lower numbers indicate that the two strings are more similar. In this case, we're using clj-diff.core/edit-distance
. The edit-distance
parameter is the number of editing operations (usually inserting and deleting a single character, with a change being a combination of these) required to transform one string into another. For example, here are the edit distances of a few simple strings:
user=> (edit-distance "abc" "bc") 1 user=> (edit-distance "abc" "abcd") 1 user=> (edit-distance "abc" "bac") 2 user=> (edit-distance "abc" "bbc") 2
Based off of these values, the maximum allowable distance is determined by the following two parameters:
(def fuzzy-max-diff 2)
First, for equality, the distance has to be at most fuzzy-max-diff
. Setting it to 2
allows replacements, which are generally two changes (delete and insert):
(def fuzzy-percent-diff 0.1)
The maximum distance can also be a percentage of the length of the shortest input string. In this case, we're using 10 percent as the maximum difference value that 2
can be.
If either of these two conditions is met, the strings are determined to be the same. This leads to two scenarios. No matter the length of the string, if only two characters change, it's the same. This is problematic for very short strings.
On the other hand, a hard maximum distance doesn't work for very long strings either. If, say for example, the value is 200 characters or more, you'll want to allow more absolute characters of difference than you would for a string of 20 characters. fuzzy-percent-diff
provides this flexibility.
As I mentioned, this will not handle short strings very well. For example, it will judge ace and are to be the same. We can make the logic more complicated by adding a clause that says only to use fuzzy-max-diff
if the length of the string is greater than some value.
In this recipe, we used clj-diff.core/edit-distance
. This measures the number of changes that need to be made in order to transform one string into the other with the single-character operations insert
and delete
. Another option is to use clj-diff.core/levenshtein-distance
, which also uses a single-character replace
operation.
3.135.246.245