Regularizing numbers

If we need to read in numbers as strings, we have to worry about how they're formatted. However, we'll probably want the computer to deal with them as numbers, not as strings, and this can't happen if the string contains a comma or period to separate the thousands place. This allows the numbers to be sorted and to be available for mathematical functions.

In this recipe, we'll write a short function that takes a number string and returns the number. The function will strip out all of the extra punctuation inside the number and only leave the last separator. Hopefully, this will be the one that marks the decimal place.

Of course, the version of this function, which we'll see here, only works in locales that use commas to separate thousands and periods to separate decimals. However, it would be relatively easy to write versions that will work in any particular locale.

Getting ready

For this recipe, we're back to the most simple project.clj files:

(defproject cleaning-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]])

To write this function, we just need to have access to the clojure.string library:

(require '[clojure.string :as string])

How to do it…

  1. The function itself is pretty short, shown as follows:
    (defn normalize-number [n]
       (let [v (string/split n #"[,.]")
            [pre post] (split-at (dec (count v)) v)]
        (Double/parseDouble (apply str (concat pre [.] post)))))
  2. Also, using the function is straightforward:
    user=> (normalize-number "1,000.00")
    1000.0
    user=> (normalize-number "1.000,00")
    1000.0
    user=> (normalize-number "3.1415")
    3.1415

How it works…

This function is fairly simple. So, let's take it apart, step by step:

  1. We take the input and use a regular expression to split it on every comma and period. This handles both the thousands separators and decimals for most locales, expressions that use comma for thousands and periods for decimals, and vice versa:
    (string/split n #"[,.]")
  2. We take the split input and partition it into the integer part (everything up to the last element) and the fractional part (the last element):
    (split-at (dec (count v)) v)
  3. We join them back together as a new string, using a period for the decimal and leaving out any thousands separators:
    (apply str (concat pre [.] post))
  4. We use the standard Java Double class to parse this into a double:
    (Double/parseDouble …)

This version of the function assumes that the numbers are represented with a decimal component. If that's not the case, there will be problems:

user=> (normalize-number "1,000")
1.0

How will you go about fixing this? It might be easier to have separate versions of this function for integers and floats. In the end, you need to know your data in order to decide how to best handle it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.82.154