Scaling variables to simplify variable relationships

We don't always work with numbers as they are. For example, population is often given in thousands. In this recipe, we'll scale some values to make them easier to work with. In fact, some algorithms work better with scaled data. For instance, linear regression models are sometimes able to fit the data better after the data has been scaled logarithmically.

Getting ready

We'll use these dependencies in our project.clj file:

(defproject statim "0.1.0"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [incanter "1.5.5"]])

And we'll use these namespaces in our script or REPL:

(require '[incanter.core :as i]
         'incanter.io)

For data, we'll use the Chinese development data from World Bank, which we originally saw in the Selecting columns with $ recipe from Chapter 6, Working with Incanter Datasets. I've pulled out the data related to agricultural land use and rearranged the columns. You can download this from http://www.ericrochester.com/clj-data-analysis/data/chn-land.csv. For this chapter, I've downloaded it into the data directory:

(def data-file "data/chn-land.csv")

How to do it…

In this recipe, we'll scale the data in two ways:

  1. Before we start scaling anything, we'll read in the data:
    (def data
     (incanter.io/read-dataset data-file :header true))
  2. We'll filter out null values and then scale the amount of land used for agriculture by the total amount of land:
    (def data
      (->> data
           (i/$where {:AG.LND.AGRI.K2 {:$ne nil},
                      :AG.SRF.TOTL.K2 {:$ne nil}})
           (i/add-derived-column :AGRI.PRC
                                 [:AG.LND.AGRI.K2 
                                  :AG.SRF.TOTL.K2]
                                 #(float (/ %1 %2)))))

How it works…

The workhorse of this recipe is the function incanter.core/add-derived-column. This takes the value of one or more existing columns, passes them through a function, and then injects this new value into the dataset under a new column. This kind of manipulation is done all the time, and this function makes that workflow a lot easier.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.191.134