Robust principal components

In this recipe, we will work with robust principal components. Principal components are used to project data into a smaller subspace that is easier to work with. It is probably the most important dimensionality reduction technique. 

Analyzing and working with lots of features is usually complicated for two main reasons:

  • It's difficult to find patterns between them, because combinations of them might be relatively correlated.
  • Modeling variables to predict another variable sometimes carries a significant amount of noise. Ideally, we would like to compress some of the information contained in the data in order to have a simpler model.

In order to introduce principal components, let's review a basic example. Let's assume we have a football and rugby score for some students:

Person

Football

Rugby

John

10

8

Michael

3

5

 

It would actually be best if we found a single variable that collapsed the information from both variables. We could for example do: 0.5 x 10 + 0.5 x 8 = 9 for the first student. And we would get 0.5 x 3 + 0.5 x 5 = 4 for the second one. This would be useful, because we can now work with a single metric. But a natural question arises: can we multiply them by another set of weights different from (0.5, 0.5) in order to capture more of the variability that exists between those two variables? 

Principal components are just linear combinations of variables that capture the main directions of variability in our datasets. They have an interesting property: each one of them is orthogonal to the other ones. This property means, in layman's terms, , intuitively, that the variability that each one of them captures is not contained in the other ones.

Mathematically, principal components are extracted by computing the eigenvalues—that is, the eigenvectors from the covariance matrix. Each eigenvector will have as many elements as variables that we have. Each eigenvector will be associated with an eigenvalue, and the size of the eigenvalue will determine how much variability is explained by the eigenvector. Going back to our example, we would have two principal components. Each one of them would have two entries and each one of them should be multiplied by each column in order to project the data.

Principal component analysis is usually used in two ways:

  • To project features into a smaller subspace that is more robust and provides more stable predictions.
  • To find groups of variables that are related. This can be determined by finding variables that have high loadings into some principal components.

A separate question is how many principal components should be kept. Usually, a scree plot is used, which shows the eigenvalues for each principal component; a strategy is usually adopted to keep all of them that are greater than 1.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.187.119