Cleaning data with regular expressions

Often, cleaning data involves text transformations. Some, such as adding or removing a set and static strings, are pretty simple. Others, such as parsing a complex data format such as JSON or XML, requires a complete parser. However, many fall within a middle range of complexity. These need more processing power than simple string manipulation, but full-fledged parsing is too much. For these tasks, regular expressions are often useful.

Probably, the most basic and pervasive tool to clean data of any kind is a regular expression. Although they're overused sometimes, regular expressions truly are the best tool for the job many times. Moreover, Clojure has a built-in syntax for compiled regular expressions, so they are convenient too.

In this example, we'll write a function that normalizes U.S. phone numbers.

Getting ready

For this recipe, we will only require a very basic project.clj file. It should have these lines:

(defproject cleaning-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]])

We also only need to have the clojure.string library available for our script or REPL. We will get this by using:

(require '[clojure.string :as string])

How to do it…

  1. First, let's define a regular expression:
    (def phone-regex
      #"(?x)
      (d{3})     # Area code.
      D{0,2}     # Separator. Probably one of (, ), -, space.
      (d{3})     # Prefix.
      D?         # Separator.
      (d{4})
      ")
  2. Now, we'll define a function that uses this regular expression to pull apart a string that contains a phone number and put it back together in the form of (999)555-1212. If the string doesn't appear to be a phone number, it returns nil:
    (defn clean-us-phone [phone]
      (when-let [[_ area-code prefix post]
                 (re-find phone-regex phone)]
        (str ( area-code ) prefix - post)))
  3. The function works the way we expected:
    user=> (clean-us-phone "123-456-7890")
    "(123)456-7890"
    user=> (clean-us-phone "1 2 3 a b c 0 9 8 7")
    nil

How it works…

The most complicated part of this process is the regular expression. Let's break it down:

  • (?x): This is a flag that doesn't match anything by itself. Instead, it allows you to spread out the regular expression. The flag will ignore whitespaces and comments. Writing regular expressions in this way makes them considerably easier to read and work with, especially when you are trying to remember what it does after six months.
  • (d{3}): This matches three digits.
  • D{0,2}: This matches zero to two non-numeric characters. This is to allow optional separators between the area code and the prefix.
  • (d{3}): This matches another three digits.
  • D?: This is an optional non-numeric character. This allows a dash or something similar.
  • (d{4}): This matches the final four digits of the phone number.

The items in parentheses are captured by the regular expression. If there are no groups within the parentheses in the regular expression, re-find just returns a matching string. If there are groups, it returns a vector. The entire matching string is the first element, and the groups follow in the order in which they appear in the regular expression. In this recipe, we use the groups that are returned to build the output.

There's more...

Regular expressions are complex, and heavy books have been written about them. Here are some more resources:

See also

Jamie Zawinski is credited with this saying:

Some people, when confronted with a problem, think, "I know, I'll use regular expressions." Now they have two problems.

Regular expressions are a complex, dense, and often fiddly tool. Sometimes, they are the right tool, but sometimes they are not. We'll see a more powerful, and often better, solution in the Parsing custom data formats recipe.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.123.189