Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Cleaning data with regular expressions

Often, cleaning data involves text transformations. Some, such as adding or removing a set and static strings, are pretty simple. Others, such as parsing a complex data format such as JSON or XML, requires a complete parser. However, many fall within a middle range of complexity. These need more processing power than simple string manipulation, but full-fledged parsing is too much. For these tasks, regular expressions are often useful.

Probably, the most basic and pervasive tool to clean data of any kind is a regular expression. Although they're overused sometimes, regular expressions truly are the best tool for the job many times. Moreover, Clojure has a built-in syntax for compiled regular expressions, so they are convenient too.

In this example, we'll write a function that normalizes U.S. phone numbers.

Getting ready

For this recipe, we will only require a very basic project.clj file. It should have these lines:

(defproject cleaning-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]])

We also only need to have the clojure.string library available for our script or REPL. We will get this by using:

(require '[clojure.string :as string])

How to do it…

First, let's define a regular expression:

(def phone-regex
  #"(?x)
  (d{3})     # Area code.
  D{0,2}     # Separator. Probably one of (, ), -, space.
  (d{3})     # Prefix.
  D?         # Separator.
  (d{4})
  ")

Now, we'll define a function that uses this regular expression to pull apart a string that contains a phone number and put it back together in the form of (999)555-1212. If the string doesn't appear to be a phone number, it returns nil:
```
(defn clean-us-phone [phone]
  (when-let [[_ area-code prefix post]
             (re-find phone-regex phone)]
    (str ( area-code ) prefix - post)))
```

The function works the way we expected:

user=> (clean-us-phone "123-456-7890")
"(123)456-7890"
user=> (clean-us-phone "1 2 3 a b c 0 9 8 7")
nil

How it works…

The most complicated part of this process is the regular expression. Let's break it down:

(?x): This is a flag that doesn't match anything by itself. Instead, it allows you to spread out the regular expression. The flag will ignore whitespaces and comments. Writing regular expressions in this way makes them considerably easier to read and work with, especially when you are trying to remember what it does after six months.
(d{3}): This matches three digits.
D{0,2}: This matches zero to two non-numeric characters. This is to allow optional separators between the area code and the prefix.
(d{3}): This matches another three digits.
D?: This is an optional non-numeric character. This allows a dash or something similar.
(d{4}): This matches the final four digits of the phone number.

The items in parentheses are captured by the regular expression. If there are no groups within the parentheses in the regular expression, re-find just returns a matching string. If there are groups, it returns a vector. The entire matching string is the first element, and the groups follow in the order in which they appear in the regular expression. In this recipe, we use the groups that are returned to build the output.

There's more...

Regular expressions are complex, and heavy books have been written about them. Here are some more resources:

The JavaDocs for the Pattern class (http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html). This summarizes the syntax of Java's style of regular expressions.
Oracle's Java tutorial on regular expressions (http://docs.oracle.com/javase/tutorial/essential/regex/).
RegexPlant's online tester (http://www.regexplanet.com/advanced/java/index.html). However, the REPL is usually what I use to build and test regular expressions.

Table of Contents for
Cleaning data with regular expressions

Cleaning data with regular expressions

Getting ready

How to do it…

How it works…

There's more...

See also

Table of Contents for Cleaning data with regular expressions

Create new playlist

Sign In

Sign Up

Cleaning data with regular expressions

Getting ready

How to do it…

How it works…

There's more...

See also

Table of Contents for
Cleaning data with regular expressions