Maintaining consistency with synonym maps

One common problem with data is inconsistency. Sometimes, a value is capitalized, while sometimes it is not. Sometimes it is abbreviated, and sometimes it is full. At times, there is a misspelling.

When it's an open domain, such as words in a free-text field, the problem can be quite difficult. However, when the data represents a limited vocabulary (such as US state names, for our example here) there's a simple trick that can help. While it's common to use full state names, standard postal codes are also often used. A mapping from common forms or mistakes to a normalized form is an easy way to fix variants in a field.

Getting ready

For the project.clj file, we'll use a very simple configuration:

(defproject cleaning-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]])

We just need to make sure that the clojure.string/upper-case function is available to us:

(use '[clojure.string :only (upper-case)])

How to do it…

  1. For this recipe, we'll define the synonym map and a function to use it. Then, we'll see it in action. We'll define the mapping to a normalized form. I will not list all of the states here, but you should get the idea:
    (def state-synonyms
      {"ALABAMA" "AL",
       "ALASKA" "AK",
       "ARIZONA" "AZ",
       …
       "WISCONSIN" "WI",
       "WYOMING" "WY"})
  2. We'll wrap it in a function that makes the input uppercased before querying the mapping, as shown here:
    (defn normalize-state [state]
      (let [uc-state (upper-case state)]
        (state-synonyms uc-state uc-state)))
  3. Then, we just call normalize-state with the strings we want to fix:
    user=> (map normalize-state
            ["Alabama" "OR" "Va" "Fla"])
    ("AL" "OR" "VA" "FL")

How it works…

The only wrinkle here is that we have to normalize the input a little by making sure that it's uppercased before we can apply the mapping of synonyms to it. Otherwise, we'd also need to have an entry for any possible way in which the input can be capitalized.

See also

  • The Fixing spelling errors recipe later in this chapter
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.251.57