Maintaining data consistency with validators

Clojure has a number of tools to work with agents. One of them is validators. When an agent's message function returns a value, any validator functions assigned to that agent receive the agent's data before it does. If the validators return true, all is well. The agent is updated and processing continues. However, if any validator returns false or raises an error, an error is raised on the agent.

This can be a handy tool to make sure that the data assigned to your agent conforms to your expectations, and it can be an important check on the consistency and validity of your data.

For this recipe, we'll read data from a CSV file and convert the values in some of the columns to integers. We'll use a validator to ensure that this actually happens.

Getting ready

For this recipe, we'll use the dependencies and requirements that we did from the Managing program complexity with STM recipe. We'll also use the lazy-read-csv and with-header functions from that recipe, and we'll use the data file that we used in that recipe. We'll keep that filename bound to data-file.

How to do it…

This recipe will be built from a number of shorter functions:

  1. Let's define a list of the rows that will need to be converted to integers. Looking at the data file, we can come up with this:
    (def int-rows
      [:GEOID :SUMLEV :STATE :POP100 :HU100 :POP100.2000
        :HU100.2000 :P035001 :P035001.2000])
  2. Now, we'll define a predicate function to check whether a value is an integer or not:
    (defn int? [x]
      (or (instance? Integer x) (instance? Long x)))
  3. We'll create a function that attempts to read a string to an integer, but silently returns the original value if there's an exception:
    (defn try-read-string [x]
      (try
        (read-string x)
        (catch Exception ex
          x)))
  4. This system will have three agents, each performing a different task. Here is the function for the agent that converts all whole number fields to integer values. It sends the output to another agent and uses that output as its own new value so it can be validated:
    (defn coerce-row [_ row sink]
      (let [cast-row
            (apply assoc row
                  (mapcat
                    (fn [k]
                      [k (try-read-string (k row))])
                    int-rows))]
        (send sink conj cast-row)
        cast-row))
  5. Here is the function for the agent that reads the input. It sends an item of the input to the coerce-row agent, queues itself to read another item of the input, and sets its value to the rest of the input:
    (defn read-row  [rows caster sink]
      (when-let [[item & items] (seq rows)]
        (send caster coerce-row item sink)
        (send *agent* read-row caster sink)
        items))
  6. Here is the validator for the coerce-row agent. It checks that the integer fields are either integers or empty strings:
    (defn int-val? [x] (or (int? x) (empty? x)))
    (defn validate [row]
      (or (nil? row)
          (reduce #(and %1 (int-val? (%2 row)))
                  true int-rows)))
  7. Finally, we'll define a function that defines the agents, starts processing them, and returns them:
    (defn agent-ints [input-file]
      (let [reader (agent (seque
                            (with-header
                              (lazy-read-csv
                                input-file))))
            caster (agent nil)
            sink (agent [])]
        (set-validator! caster validate)
        (send reader read-row caster sink)
        {:reader reader
         :caster caster
         :sink sink}))
  8. If we run this, we get a map containing the agents. We can get the output data by dereferencing the :sink agent:
    user=> (def ags (agent-ints data-file))
    #'user/ags
    user=> (first @(:sink ags))
    {:SUMLEV 160, :P035001 2056, :HU100.2000 3788, :HU100 4271, :NAME "Abingdon town", :GEOID 5100148, :NECTA "", :CBSA "", :CSA "", :P035001.2000 2091, :POP100.2000 7780, :CNECTA "", :POP100 8191, :COUNTY "", :STATE 51}

How it works…

The agent-ints function is pretty busy. It defines the agents, sets everything up, and returns the map containing the agents.

Let's break it down:

   (let [reader (agent (seque
                         (with-header
                           (lazy-read-csv input-file))))
         caster (agent nil)
         sink (agent [])]

These lines define the agents. One reads in the data, one converts it to integers, and one accumulates the results. This figure illustrates that process:

How it works…

Next, read-row simply gets the first item of the input and sends it to the caster agent. The coerce-row function tries to change the data in the columns listed in int-rows to integers. It then passes the results to the sink agent. Before it's completely done, however, its new state is passed to its validator function, validate.

The validator allows nil rows (for the agent's initial state) or integer fields that contain either integers or empty strings. Finally, the sink agent is called with conj. It accumulates the converted results.

See also

  • To learn how to use a nice DSL to validate data, see Validating data with Valip, in Chapter 2, Cleaning and Validating Data.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.233.205