Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Aggregating data with Cascalog

So far, the Cascalog queries you saw have all returned tables of results. However, sometimes you'll want to aggregate the tables in order to boil them down to a single value or into a table where groups from the original data are aggregated.

Cascalog also makes this easy to do, and it includes a number of aggregate functions. For this recipe, we'll only use two—cascalog.logic.opts/distinct-count and cascalog.logic.ops/sumsum—but you can find more easily in the API documentation on the Cascalog website (http://nathanmarz.github.io/cascalog/cascalog.logic.ops.html).

Getting ready

We'll use the same dependencies and imports as we did in Parsing CSV Files with Cascalog. We'll also use the same data that we defined in that recipe.

How to do it…

We'll take a look at a couple of examples on how to aggregate data with the count function:

First, we'll query how many:
```
user=> (?<- (stdout)
     [?count]
     ((hfs-text-delim "data/16285/flights_with_colnames.csv"
                      :has-header true)
      ?origin_airport _ _ _ _)
     (:distinct true)
     (c/distinct-count ?origin_airport :> ?count))
…
RESULTS
-----------------------
683
683
-----------------------
```
For this, we need to specify that we want to have distinct results for entire rows (the default). Then specify that we just include the aggregate operator as a predicate and give its results to a new name binding (?count). We use this binding—and only this binding—in the results. The other predicates in the query are used to select the data that we want aggregated.
Now let's try something more complicated. We'll find out how many flights leave each airport. To do this, we first need to define a mapping function in order to convert the values in a column to longs, using defmapfn. We'll use this to convert the flights column to numbers, and we'll use the c/sum function to aggregate those by airport:
```
user=> (defmapfn ->long
  "Converts a value to a long."
  [value]
  (Long/parseLong value))
user=> (?<- (stdout)
     [?origin_airport ?count]
     ((hfs-text-delim "data/16285/flights_with_colnames.csv"
                      :has-header true)
      ?origin_airport _ _ ?flights _)
     (:distinct true)
     (->long ?flights :> ?f)
     (c/sum ?f :> ?count))
…
RESULTS
-----------------------
1B11B1     1
ABE     197049
ABI     50043
ABQ     758168
ABR     30832
ABY     34298
…
```
This query is very similar. We use the map function to prepare the column that we want aggregated. We also include an aggregator predicate. Next, in the output bindings, we include both the value that we want the data grouped on (?origin_airport) and the aggregated binding (?count).

It's this simple. Cascalog takes care of the rest.

There's more

Cascalog provides a number of other aggregator functions as well. Some functions that you'll want to use regularly include count, max, min, sum, and avg. See the documentation for the build-in operations (https://github.com/nathanmarz/cascalog/wiki/Built-in-operations) for a more complete list.

We'll also talk more about defmapfn in the next recipe, Defining new Cascalog operators.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Aggregating data with Cascalog

Create new playlist

Sign In

Sign Up

Aggregating data with Cascalog

Getting ready

How to do it…

There's more

Table of Contents for
Aggregating data with Cascalog