Principal Component Analysis

"The only easy day was yesterday."

- A Special Forces motivational saying

This chapter is the second one where we will focus on unsupervised learning techniques. In the previous chapter, we covered cluster analysis, which provides us with the groupings of similar observations. In this chapter, we will see how to reduce the dimensionality and improve the understanding of our data by grouping the correlated variables with principal components analysis (PCA). Then, we will use the principal components in supervised learning.

In many datasets, particularly in the social sciences, you will see many variables highly correlated with each other. They may additionally suffer from high-dimensionality or, as it is better known, the curse of dimensionality. This is a problem because the number of samples needed to estimate a function grows exponentially with the number of input features. In such datasets, it may be the case that some variables are redundant as they end up measuring the same constructs, for example, income and poverty or depression and anxiety. The goal then is to use PCA in order to create a smaller set of variables that capture most of the information from the original set of variables, thus simplifying the dataset and often leading to hidden insights. These new variables (principal components) are highly uncorrelated with each other. In addition to supervised learning, it is also very common to use these components to perform data visualization.

From over a decade of either doing or supporting analytics using PCA, it has been my experience that it is widely used but poorly understood, especially among people who don't do the analysis but consume the results. It is intuitive to understand that you are creating a new variable from the other correlated variables. However, the technique itself is shrouded in potentially misunderstood terminology and mathematical concepts that often bewilder the layperson. The intention here is to provide a good foundation on what it is and how to use it by covering the following:

Preparing a dataset for PCA
Conducting PCA
Selecting our principal components
Building a predictive model using principal components
Making out-of-sample predictions using the predictive model

Table of Contents for Principal Component Analysis

Create new playlist

Sign In

Sign Up

Table of Contents for
Principal Component Analysis