Fixing spelling errors

One of the issues that we'll need to deal with at some point is spelling errors. Especially when you're trying to work with raw text, spelling errors can throw a wrench in the works.

At one time, spell checkers were major pieces of software with lots of optimizations to run in the constrained environments that were once everyday desktops. Now, that's not the case. Peter Norvig has published a piece on the Internet titled How to Write a Spelling Corrector (http://norvig.com/spell-correct.html). This shows how to take some text that is assumed to be spelled correctly and generate a spell checker built on it. He included a 21-line implementation in Python.

For this recipe, we'll convert the Python code to Clojure. Our version will be longer, but less dense. We can certainly implement it more concisely than we will do it, but it will be helpful for our explanation to break it out more.

Getting ready

We'll use the minimal project.clj file that we've seen in several recipes already in this chapter:

(defproject cleaning-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]])

We require clojure.string and one function from clojure.set:

(require '[clojure.string :as string])
(use '[clojure.set :only (union)])

How to do it…

This algorithm works by comparing a set of permutations of a word against a map of correctly spelled words and their frequencies. The spelling that occurs most frequently wins:

  1. We need a function to tokenize a string into words. We'll use a regular expression for this:
    (defn words
     [text]
      (re-seq #"[a-z]+" (string/lower-case text)))
  2. The training data structure is just a map of words and their frequencies:
    (defn train
     [feats] (frequencies feats))
  3. Now, we can train our spell checker. We'll use the dataset that Norvig has linked to in his article (http://norvig.com/big.txt), which I've downloaded locally:
    (def n-words
        (train (words (slurp "data/big.txt"))))
    (def alphabet
        "abcdefghijklmnopqrstuvwxyz")
  4. We need to define some operations on the words in our training corpus:
    (defn split-word [word i]
      [(.substring word 0 i) (.substring word i)])
    (defn delete-char [[w1 w2]]
      (str w1 (.substring w2 1)))
    (defn transpose-split [[w1 w2]]
      (str w1 (second w2) (first w2) (.substring w2 2)))
    (defn replace-split [[w1 w2]]
      (let [w2-0 (.substring w2 1)]
        (map #(str w1 % w2-0) alphabet)))
    (defn insert-split [[w1 w2]]
      (map #(str w1 % w2) alphabet))
  5. We're now ready to define the two functions that are the heart of the algorithm. The first function calculates all of the possible edits that can be made to a word, based on the operators we just defined:
    (defn edits-1 [word]
      (let [splits (map (partial split-word word)
                        (range (inc (count word))))
            long-splits (filter #(> (count (second %)) 1)
                                splits)
            deletes (map delete-char long-splits)
            transposes (map transpose-split long-splits)
            replaces (mapcat replace-split long-splits)
            inserts (remove nil?
                            (mapcat insert-split splits))]
        (set (concat deletes transposes
                     replaces inserts))))
  6. The second primary function gets the edits of the edits of a word, but only if they're known in the training set:
    (defn known-edits-2 [word]
      (set (filter (partial contains? n-words)
                   (apply union
                          (map #(edits-1 %)
                               (edits-1 word))))))
  7. Now, we need another utility function that takes a sequence of words and returns the set of those seen in the training corpus:
    (defn known [words]
      (set (filter (partial contains? n-words) words)))
  8. Finally, we can put it all together to create the correct function:
    (defn correct [word]
      (let [candidate-thunks [#(known (list word))
                              #(known (edits-1 word))
                              #(known-edits-2 word)
                              #(list word)]]
        (->>
          candidate-thunks
          (map (fn [f] (f)))
          (filter #(> (count %) 0))
          first
          (map (fn [w] [(get n-words w 1) w]))
          (reduce (partial max-key first))
          second)))

Let's see how it works:

user=> (correct "deete")
"delete"
user=> (correct "editr")
"editor"
user=> (correct "tranpsose")
"tranpsose"
user=> (correct "eidtor")
"editor"
user=> (correct "eidtr")
"elder"

It doesn't recognize transpose, and it miscorrects eidtr as elder. Let's take a look at the training data to see why:

user=> (n-words "transpose")
nil
user=> (n-words "elder")
40
user=> (n-words "editor")
17

That explains it. Transpose doesn't occur in the training set, and elder is there more than twice as often as editor, so it's the more likely correction.

How it works…

The heart of this are the edits-1 and known-edits-2 functions. They perform a search over the space between strings, looking for all of the known words that are one or two edits away from the word to be checked. Before the operations are applied, the words are split into two by the split-word function. The operations that constitute one edit are defined in a series of functions:

  • delete-char: This removes one character from the word (word to wod)
  • transpose-char: This transposes two characters (word to wrod)
  • replace-split: This replaces one letter by another character from the alphabet (word to wobd)
  • insert-split: This inserts a character into the word (word to wobrd)

The correct function looks at all of the edits returned that are in the training set and picks the one that is seen most frequently.

There's more…

If you want more information on the statistics that make this work (and you should—it's quite interesting), see Norvig's explanation in his article at http://norvig.com/spell-correct.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.250.11