We don't always work with numbers as they are. For example, population is often given in thousands. In this recipe, we'll scale some values to make them easier to work with. In fact, some algorithms work better with scaled data. For instance, linear regression models are sometimes able to fit the data better after the data has been scaled logarithmically.
We'll use these dependencies in our project.clj
file:
(defproject statim "0.1.0" :dependencies [[org.clojure/clojure "1.6.0"] [incanter "1.5.5"]])
And we'll use these namespaces in our script or REPL:
(require '[incanter.core :as i] 'incanter.io)
For data, we'll use the Chinese development data from World Bank, which we originally saw in the Selecting columns with $ recipe from Chapter 6, Working with Incanter Datasets. I've pulled out the data related to agricultural land use and rearranged the columns. You can download this from http://www.ericrochester.com/clj-data-analysis/data/chn-land.csv. For this chapter, I've downloaded it into the data directory:
(def data-file "data/chn-land.csv")
In this recipe, we'll scale the data in two ways:
(def data (incanter.io/read-dataset data-file :header true))
(def data (->> data (i/$where {:AG.LND.AGRI.K2 {:$ne nil}, :AG.SRF.TOTL.K2 {:$ne nil}}) (i/add-derived-column :AGRI.PRC [:AG.LND.AGRI.K2 :AG.SRF.TOTL.K2] #(float (/ %1 %2)))))
The workhorse of this recipe is the function incanter.core/add-derived-column
. This takes the value of one or more existing columns, passes them through a function, and then injects this new value into the dataset under a new column. This kind of manipulation is done all the time, and this function makes that workflow a lot easier.
3.145.191.134