Transforming data with Cascalog

Often, simply querying data won't do everything you need to do. The data might not be in a form you can use, for instance. In that case, you'll need to transform the data. You can easily do this in Cascalog.

For this recipe, we'll define a custom operation and use it to split year ranges of the form '2000–2010' into two fields.

Getting ready

We'll use the same dependencies and inclusions that we used in the Initializing Cascalog and Hadoop for distributed processing recipe. We'll also use the Doctor Who companion data from that recipe.

How to do it…

  1. We'll define a new, custom operation to take a date range string and split it into two values. In this dataset, we're splitting them on an en-dash (#"u2013"). If the input isn't a range (that is, it's just a year), then the year is returned for both the start and end of the range:
    (defmapfn split-range [date-range]
      (let [[from to] (string/split (str date-range) #"u2013" 2)]
        [from (if (nil? to) from (str (.substring from 0 2) to))]))
  2. Then we can use this to transform the tenure dates in the actors' data:
    user=> (?<- (stdout)
                [?n ?name ?from ?to]
                (actor ?n ?name ?range)
                (split-range ?range :> ?from ?to))
    …
    RESULTS
    -----------------------
    1       William Hartnell        1963    1966
    2       Patrick Troughton       1966    1969
    3       Jon Pertwee     1970    1974
    …

How it works…

In the split-range operator, we return a vector containing two years for the output. Then, in the query, we use the :> operator to bind the output values from split-range to the names ?from and ?to. Destructuring makes this especially easy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.239.103