Selecting columns with $

Often, you need to cut the data to make it more useful. One common transformation is to pull out all the values from one or more columns into a new dataset. This can be useful for generating summary statistics or aggregating the values of some columns.

The Incanter macro $ slices out parts of a dataset. In this recipe, we'll see this in action.

Getting ready

For this recipe, we'll need to have Incanter listed in our project.clj file:

(defproject inc-dsets "0.1.0"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [incanter "1.5.5"]
                 [org.clojure/data.csv "0.1.2"]])

We'll also need to include these libraries in our script or REPL:

(require '[clojure.java.io :as io]
         '[clojure.data.csv :as csv]
         '[clojure.string :as str]
         '[incanter.core :as i])

Moreover, we'll need some data. This time, we'll use some country data from the World Bank. Point your browser to http://data.worldbank.org/country and select a country. I picked China. Under World Development Indicators, there is a button labeled Download Data. Click on this button and select CSV. This will download a ZIP file. I extracted its contents into the data/chn directory in my project. I bound the filename for the primary data file to the data-file name.

How to do it…

We'll use the $ macro in several different ways to get different results. First, however, we'll need to load the data into a dataset, which we'll do in steps 1 and 2:

  1. Before we start, we'll need a couple of utilities that load the data file into a sequence of maps and makes a dataset out of those:
    (defn with-header [coll]
      (let [headers (map #(keyword (str/replace % space -))
                         (first coll))]
        (map (partial zipmap headers) (next coll))))
    
    (defn read-country-data [filename]
      (with-open [r (io/reader filename)]
        (i/to-dataset
          (doall (with-header
                   (drop 2 (csv/read-csv r)))))))

  2. Now, using these functions, we can load the data:
    user=> (def chn-data (read-country-data data-file))
  3. We can select columns to be pulled out from the dataset by passing the column names or numbers to the $ macro. It returns a sequence of the values in the column:
    user=> (i/$ :Indicator-Code chn-data)
    ("AG.AGR.TRAC.NO" "AG.CON.FERT.PT.ZS" "AG.CON.FERT.ZS" …
  4. We can select more than one column by listing all of them in a vector. This time, the results are in a dataset:
    user=> (i/$ [:Indicator-Code :1992] chn-data)
    
    |           :Indicator-Code |               :1992 |
    |---------------------------+---------------------|
    |            AG.AGR.TRAC.NO |              770629 |
    |         AG.CON.FERT.PT.ZS |                     |
    |            AG.CON.FERT.ZS |                     |
    |            AG.LND.AGRI.K2 |             5159980 |
    …
  5. We can list as many columns as we want, although the formatting might suffer:
    user=> (i/$ [:Indicator-Code :1992 :2002] chn-data)
    
    |           :Indicator-Code |               :1992 |               :2002 |
    |---------------------------+---------------------+---------------------|
    |            AG.AGR.TRAC.NO |              770629 |                     |
    |         AG.CON.FERT.PT.ZS |                     |     122.73027213719 |
    |            AG.CON.FERT.ZS |                     |    373.087159048868 |
    |            AG.LND.AGRI.K2 |             5159980 |             5231970 |
    …

How it works…

The $ function is just a wrapper over Incanter's sel function. It provides a good way to slice columns out of the dataset, so we can focus only on the data that actually pertains to our analysis.

There's more…

The indicator codes for this dataset are a little cryptic. However, the code descriptions are in the dataset too:

user=> (i/$ [0 1 2] [:Indicator-Code :Indicator-Name] chn-data)

|   :Indicator-Code |                                               :Indicator-Name |
|-------------------+---------------------------------------------------------------|
|    AG.AGR.TRAC.NO |                              Agricultural machinery, tractors |
| AG.CON.FERT.PT.ZS |           Fertilizer consumption (% of fertilizer production) |
|    AG.CON.FERT.ZS | Fertilizer consumption (kilograms per hectare of arable land) |
…

See also…

  • For information on how to pull out specific rows, see the next recipe, Selecting rows with $.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.233.205