Parsing dates and times

One difficult issue when normalizing and cleaning up data is how to deal with time. People enter dates and times in a bewildering variety of formats; some of them are ambiguous, and some of them are vague. However, we have to do our best to interpret them and normalize them into a standard format.

In this recipe, we'll define a function that attempts to parse a date into a standard string format. We'll use the clj-time Clojure library, which is a wrapper around the Joda Java library (http://joda-time.sourceforge.net/).

Getting ready

First, we need to declare our dependencies in the Leiningen project.clj file:

(defproject cleaning-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [clj-time "0.9.0-beta1"]])

Then, we need to load these dependencies into our script or REPL. We'll exclude second from clj-time to keep it from clashing with clojure.core/second:

(use '[clj-time.core :exclude (extend second)]
     '[clj-time.format])

How to do it…

In order to solve this problem, we'll specify a sequence of date/time formats and walk through them. The first that doesn't throw an exception will be the one that we'll use.

  1. Here's a list of formats that you can try:
    (def ^:dynamic *default-formats*
      [:date
       :date-hour-minute
       :date-hour-minute-second
       :date-hour-minute-second-ms
       :date-time
       :date-time-no-ms
       :rfc822
       "YYYY-MM-dd HH:mm"
       "YYYY-MM-dd HH:mm:ss"
       "dd/MM/YYYY"
       "YYYY/MM/dd"
       "d MMM YYYY"])
  2. Notice that some of these are keywords and some are strings. Each needs to be handled differently. We'll define a protocol with the method ->formatter, which attempts to convert each type to a date formatter, and the protocol for both the types to be represented in the format list:
    (defprotocol ToFormatter
      (->formatter [fmt]))
    
    (extend-protocol ToFormatter
      java.lang.String
      (->formatter [fmt]
     (formatter fmt))
      clojure.lang.Keyword
      (->formatter [fmt] (formatters fmt)))
  3. Next, parse-or-nil will take a format and a date string, attempt to parse the date string, and return nil if there are any errors:
    (defn parse-or-nil [fmt date-str]
      (try
        (parse (->formatter fmt) date-str)
        (catch Exception ex
          nil)))
  4. With these in place, here is normalize-datetime. We just attempt to parse a date string with all of the formats, filter out any nil values, and return the first non-nil. Because Clojure's lists are lazy, this will stop processing as soon as one format succeeds:
    (defn normalize-datetime [date-str]
      (first
        (remove nil?
                (map #(parse-or-nil % date-str)
                     *default-formats*))))

Now we can try this out:

user=> (normalize-datetime "2012-09-12")
#<DateTime 2012-09-12T00:00:00.000Z>
user=> (normalize-datetime "2012/09/12")
#<DateTime 2012-09-12T00:00:00.000Z>
user=> (normalize-datetime "28 Sep 2012")
#<DateTime 2012-09-28T00:00:00.000Z>
user=> (normalize-datetime "2012-09-28 13:45")
#<DateTime 2012-09-28T13:45:00.000Z>

There's more…

This approach to parse dates has a number of problems. For example, because some date formats are ambiguous, the first match might not be the correct one.

However, trying out a list of formats is probably about the best we can do. Knowing something about our data allows us to prioritize the list appropriately, and we can augment it with ad hoc formats as we run across new data. We might also need to normalize data from different sources (for instance, U.S. date formats versus the rest of the world) before we merge the data together.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.145.82