Generating summary statistics with $rollup

One of the basic ways of getting a grip on a dataset is to look at some summary statistics: measures of centrality and variance, such as mean and standard deviation. These provide useful insights into our data, help us know what questions to ask next, and know how best to proceed.

Getting ready

First, we'll need to make sure Incanter is listed in the dependencies of our Leiningen project.clj file:

(defproject statim "0.1.0"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [incanter "1.5.5"]])

And we'll need to require these libraries in our script or REPL:

(require '[incanter.core :as i]
         'incanter.io
         '[incanter.stats :as s])

Finally, we'll use the dataset of census race data that we compiled for the Grouping data with $group-by recipe in Chapter 6, Working with Incanter Datasets. We'll bind the file name to the name data-file. You can download this from http://www.ericrochester.com/clj-data-analysis/data/all_160.P3.csv:

(def data-file "data/all_160.P3.csv")

How to do it…

To generate summary statistics in Incanter, we'll use the $rollup function.

First, we'll load the dataset and bind it to the name census:

(def census (incanter.io/read-dataset data-file :header true))

Then, we'll use $rollup to get the statistics for groups of data:

user=> (i/$rollup :mean :POP100 :STATE census)
| :STATE |       :POP100 |
|--------+---------------|
|     34 |   1054049/109 |
|      6 | 35184222/1523 |
|     18 |   4413508/681 |
|      5 |   1941247/541 |
…
user=> (i/$rollup s/sd :POP100 :STATE census)
| :STATE |       :POP100 |
|--------+---------------|
|     34 |   1054049/109 |
|      6 | 35184222/1523 |
|     18 |   4413508/681 |
|      5 |   1941247/541 |||
…

How it works…

The $rollup function takes the dataset (the fourth parameter) and groups the rows by the values of the grouping field (the third parameter). It takes the group subsets of the data and extracts the values from the field to aggregate (the second parameter). It passes those values to the aggregate function (the first parameter) to get the final table of values. That's a lot for one small function. Here's a snapshot to make it clearer:

How it works…

$rollup defines some standard aggregation functions (:count, :sum, :min, :max, and :mean) but we can also use any other function that takes a collection of values and returns a single value. This is what we did with incanter.stats/sd. For full details of the $rollup function and the aggregate keyword functions it provides, see the documentation at http://liebke.github.com/incanter/core-api.html#incanter.core/$rollup.

As an aside, the numbers in the first example, which calculated the mean, are expressed as rational numbers. These are real numbers that are more precise than IEEE floating-point numbers, which is what Clojure uses for its doubles. When Clojure divides two integers, we get rational numbers. If you want to see floating-point numbers, you convert them by passing the values to float:

user=> (/ 695433 172)
695433/172
user=> (float 695433/172)
4043.215
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.187.223