Using PCA to graph multi-dimensional data

So far, we've been limiting ourselves to two-dimensional data. After all, the human mind has a lot of trouble dealing with more than three dimensions, and even two-dimensional visualizations of three-dimensional space can be difficult to comprehend.

However, we can use PCA to help. It projects higher-dimensional data down to lower dimensions, but it does this in a way that preserves the most significant relationships in the data. It re-projects the data on a lower dimension in a way that captures the maximum amount of variance in the data. This makes the data easier to visualize in three- or two-dimensional space, and it also provides a way to select the most relevant features in a dataset.

In this recipe, we'll take the data from the US census by race that we've worked with in previous chapters, and create a two-dimensional scatter plot of it.

Getting ready

We'll use the same dependencies in our project.clj file as we did in Creating Scatter Plots with Incanter, and this set of imports in our script or REPL:

(require '[incanter.core :as i]
         '[incanter.charts :as c]
         '[incanter.io :as iio]
         '[incanter.stats :as s])

We'll use the aggregated census race data for all states. You can download this from http://www.ericrochester.com/clj-data-analysis/data/all_160.P3.csv. We'll assign it to the race-data variable:

(def race-data (iio/read-dataset "data/all_160.P3.csv"
                                 :header true))

How to do it...

We'll first summarize the data to make it more manageable and easier to visualize. Then we'll use PCA to project it on a two-dimensional space. We'll graph this view of the data:

  1. First, we need to summarize the columns that we're interested in, getting the total population of each racial group by state:
    (def fields [:P003002 :P003003 :P003004 :P003005
                 :P003006 :P003007 :P003008])
    (def race-by-state
      (reduce #(i/$join [:STATE :STATE] %1 %2)
              (map #(i/$rollup :sum % :STATE race-data)
                   fields)))
  2. Next, we'll take the summary and create a matrix from it. From that matrix, we'll extract the columns that we're interested in analyzing and graphing:
    (def race-by-state-matrix (i/to-matrix race-by-state))
    (def x (i/sel race-by-state-matrix :cols (range 1 8)))
  3. Now we'll perform the principal component analysis:
    (def pca (s/principal-components x))
  4. From the output of the PCA, we'll get the components for the first two dimensions and multiply all the columns in the data matrix by each component:
    (def components (:rotation pca))
    (def pc1 (i/sel components :cols 0))
    (def pc2 (i/sel components :cols 1))
    (def x1 (i/mmult x pc1))
    (def x2 (i/mmult x pc2))
  5. We can plot x1 and x2. We'll use them to create a two-dimensional scatter plot:
    (def pca-plot
      (c/scatter-plot
        x1 x2
        :x-label "PC1", :y-label "PC2"
        :title "Census Race Data by State"))
  6. We can view that chart as we normally would:
    (i/view pca-plot)

    This provides us with a graph expressing the most salient features of the dataset in two dimensions:

    How to do it...

How it works...

Conceptually, PCA projects the entire dataset on a lower-dimensional space and rotates to a view that captures the maximum variability it can see from that dimension.

In the preceding chart, we can see that most of the data clusters are around the origin. A few points trail off to the higher numbers of the graph.

There's more...

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.158.134