Projecting from multiple datasets with $join

So far, we've been focusing on splitting up datasets, on dividing them into groups of rows or groups of columns with functions and macros such as $ or $where. However, sometimes we'd like to move in the other direction. We might have two related datasets and want to join them together to make a larger one. For example, we might want to join crime data to census data, or take any two related datasets that come from separate sources and analyze them together.

Getting ready

First, we'll need to include these dependencies in our project.clj file:

(defproject inc-dsets "0.1.0"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [incanter "1.5.5"]
                 [org.clojure/data.csv "0.1.2"]])

We'll use these statements for inclusions:

(require '[clojure.java.io :as io]
         '[clojure.data.csv :as csv]
         '[clojure.string :as str]
         '[incanter.core :as i])

For our data file, we'll use the same data that we introduced in the Selecting columns with $ recipe: China's development dataset from the World Bank.

How to do it…

In this recipe, we'll take a look at how to join two datasets using Incanter:

  1. To begin with, we'll load the data from the data/chn/chn_Country_en_csv_v2.csv file. We'll use the with-header and read-country-data functions that were defined in the Selecting columns with $ recipe:
    (def data-file "data/chn/chn_Country_en_csv_v2.csv")
    (def chn-data (read-country-data data-file))
  2. Currently, the data for each row contains the data for one indicator across many years. However, for some analyses, it will be more helpful to have each row contain the data for one indicator for one year. To do this, let's first pull out the data from 2 years into separate datasets. Note that for the second dataset, we'll only include a column to match the first dataset (:Indicator-Code) and the data column (:2000):
    (def chn-1990
      (i/$ [:Indicator-Code :Indicator-Name :1990]
           chn-data))
    (def chn-2000
      (i/$ [:Indicator-Code :2000] chn-data))
  3. Now, we'll join these datasets back together. This is contrived, but it's easy to see how we will do this in a more meaningful example. For example, we might want to join the datasets from two different countries:
    (def chn-decade
      (i/$join [:Indicator-Code :Indicator-Code]
               chn-1990 chn-2000))

From this point on, we can use chn-decade just as we use any other Incanter dataset.

How it works…

Let's take a look at this in more detail:

  (i/$join [:Indicator-Code :Indicator-Code] chn-1990 chn-2000)

The pair of column keywords in a vector ([:Indicator-Code :Indicator-Code]) are the keys that the datasets will be joined on. In this case, the :Indicator-Code column from both the datasets is used, but the keys can be different for the two datasets. The first column that is listed will be from the first dataset (chn-1990), and the second column that is listed will be from the second dataset (chn-2000).

This returns a new dataset. Each row of this new dataset is a superset of the corresponding rows from the two input datasets.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.38.92