Chapter 6. Principal Component Analysis

Principal component analysis, or PCA, is one of the minor miracles of machine learning. It’s a dimensionality reduction technique that reduces the number of dimensions in a dataset without sacrificing a commensurate amount of information. While that might seem underwhelming on the face of it, it has profound implications for engineers and software developers working to build predictive models from their data.

What if I told you that you could take a dataset with 1,000 columns, use PCA to reduce it to 100 columns, and retain 90% or more of the information in the original dataset? That’s relatively common, believe it or not. And it lends itself to a variety of practical uses, including:

  • Reducing high-dimensional data to two or three dimensions so that it can be plotted and explored

  • Reducing the number of dimensions in a dataset and then restoring the original number of dimensions, which finds application in anomaly detection and noise filtering

  • Anonymizing datasets so that they can be shared with others without revealing the nature or meaning of the data

And that’s not all. A side effect of applying PCA to a dataset is that less important features—columns of data that have less relevance to the outcome of a predictive model—are removed, while dependencies between columns is eliminated. And in datasets with a low ratio of samples (rows) to features (columns), PCA can be used to increase that ratio. As a rule of thumb, you typically want a dataset used for machine learning to have at least five times as many rows as it has columns. If you can’t add rows, an alternative is to use PCA to shave columns.

Once you learn about PCA, you’ll wonder how you lived without it. Let’s take a few moments to understand what it is and how it works. Then we’ll look at some examples demonstrating why it’s such an indispensable tool.

Understanding Principal Component Analysis

One way to wrap your head around PCA is to see how it reduces a two-dimensional dataset to one dimension. Figure 6-1 depicts a 2D dataset comprising a somewhat random collection of x and y values. If you reduced this dataset to a single dimension by simply dropping the x column or the y column, you’d be left with a horizontal or vertical line that bears little resemblance to the original dataset.

Figure 6-1. Two-dimensional dataset

Figure 6-2 adds arrows representing the dataset’s two principal components. Essentially, the coordinate system has been transformed so that one axis (the longer of the two arrows) captures most of the variance in the dataset. This is the dataset’s primary principal component. The other axis contains a narrower range of values and represents the secondary principal component. The number of principal components equals the number of dimensions in a dataset, so in this example, there are two principal components.

Figure 6-2. Arrows depicting the principal components of a two-dimensional dataset

To reduce a two-dimensional dataset to one dimension, PCA finds the two principal components and eliminates the one with less variance. This effectively projects the data points onto the primary principal component axis, as shown in Figure 6-3. The red data points don’t retain all of the information in the original dataset, but they contain most of it. In this example, the PCAed dataset retains more than 95% of the information in the original. PCA reduced the number of dimensions by 50%, but it sacrificed less than 5% of the meaningful information in the dataset. That’s the gist of PCA: reducing the number of dimensions without incurring a commensurate loss of information.

Under the hood, PCA works its magic by building a covariance matrix that quantifies the variance of each dimension with respect to the others, and from the matrix computing eigenvectors and eigenvalues that identify the dataset’s principal components. If you’d like to dig deeper, I suggest reading “A Step-by-Step Explanation of Principal Component Analysis (PCA)” by Zakaria Jaadi. The good news is that you don’t have to understand the math to make PCA work, because Scikit-Learn’s PCA class does the math for you. The following statements reduce the dataset x to five dimensions, regardless of the number of dimensions it originally contains:

pca = PCA(n_components=5)
x = pca.fit_transform(x)
Figure 6-3. Two-dimensional dataset (blue) reduced to one dimension (red) with PCA

You can also invert a PCA transform to restore the original number of dimensions:

x = pca.inverse_transform(x)

The inverse_transform method restores the dataset to its original number of dimensions, but it doesn’t restore the original dataset. The information that was discarded when the PCA transform was applied will be missing from the restored dataset.

You can visualize the loss of information when a PCA transform is applied and then inverted using the Labeled Faces in the Wild (LFW) dataset introduced in Chapter 5. To demonstrate, fire up a Jupyter notebook and run the following code:

%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_lfw_people

faces = fetch_lfw_people(min_faces_per_person=100, slice_=None)
faces.images = faces.images[:, 35:97, 39:86]
faces.data = faces.images.reshape(faces.images.shape[0], faces.images.shape[1] *
                                  faces.images.shape[2])
fig, ax = plt.subplots(3, 8, figsize=(18, 10))

for i, axi in enumerate(ax.flat):
    axi.imshow(faces.images[i], cmap='gist_gray')
    axi.set(xticks=[], yticks=[], xlabel=faces.target_names[faces.target[i]])

The output shows the first 24 images in the dataset:

Each image measures 47 × 62 pixels, for a total of 2,914 pixels per image. That means the dataset has 2,914 dimensions. Now use the following code to reduce the number of dimensions to 150 (roughly 5% of the original number), restore the original 2,914 dimensions, and plot the restored images:

from sklearn.decomposition import PCA

pca = PCA(n_components=150, random_state=0)
pca_faces = pca.fit_transform(faces.data)
unpca_faces = pca.inverse_transform(pca_faces).reshape(1140, 62, 47)

fig, ax = plt.subplots(3, 8, figsize=(18, 10))

for i, axi in enumerate(ax.flat):
    axi.imshow(unpca_faces[i], cmap='gist_gray')
    axi.set(xticks=[], yticks=[], xlabel=faces.target_names[faces.target[i]])

Even though you removed almost 95% of the dimensions in the dataset, little meaningful information was discarded. The restored images are slightly blurrier than the originals, but the faces are still recognizable:

To reiterate, you reduced the number of dimensions from 2,914 to 150, but because PCA found 2,914 principal components and removed the ones that are least important (the ones with the least variance), you retained the bulk of the information in the original dataset. Which begs a question: precisely how much of the original information was retained?

After a PCA object is fit to a dataset, you can find out how much variance is encoded in each principal component from the explained_variance_ratio_ attribute. It’s an array with one element for each principal component in the transformed dataset. Here’s how it looks after the LFW dataset is reduced to 150 dimensions:

array([0.20098166, 0.1436709 , 0.0694095 , 0.0554688 , 0.04888214,
       0.02838693, 0.02344352, 0.02056908, 0.01904505, 0.01790946,
       0.01446775, 0.0141357 , 0.01173403, 0.01033751, 0.00927581,
       0.00900304, 0.00895557, 0.00830898, 0.00770731, 0.00712525,
       0.0064077 , 0.00619189, 0.00582111, 0.00557892, 0.00535471,
       0.00494034, 0.00482188, 0.00446195, 0.00443723, 0.00404091,
       0.00382839, 0.00370596, 0.00363874, 0.00352478, 0.00335927,
       0.00328615, 0.00315452, 0.00309412, 0.00290268, 0.00284517,
       0.00278296, 0.00267253, 0.00258336, 0.00249151, 0.00243766,
       0.00239778, 0.00237984, 0.00231506, 0.00223432, 0.00220306,
       0.00208555, 0.00207567, 0.00204288, 0.00196099, 0.00192303,
       0.00189352, 0.0018381 , 0.00180081, 0.00178862, 0.00174389,
       0.00168321, 0.00165759, 0.00162565, 0.00159976, 0.00153559,
       0.00152782, 0.00150262, 0.0014841 , 0.00147757, 0.00144323,
       0.00140246, 0.00138122, 0.00136053, 0.00132581, 0.00130121,
       0.00128062, 0.00126851, 0.00123904, 0.00123427, 0.00120644,
       0.00118998, 0.00117278, 0.00116551, 0.00115161, 0.00111428,
       0.00108951, 0.00107443, 0.00105793, 0.00104903, 0.00104119,
       0.00099986, 0.00098006, 0.00097077, 0.00095622, 0.00093874,
       0.00092516, 0.00091716, 0.00091061, 0.00090051, 0.00087887,
       0.00086778, 0.0008543 , 0.00084502, 0.00082587, 0.00081203,
       0.00080346, 0.00079375, 0.00077893, 0.00077295, 0.00077045,
       0.00075456, 0.00073704, 0.00073038, 0.00072013, 0.0007093 ,
       0.00070115, 0.00069389, 0.00067964, 0.00067382, 0.00065503,
       0.0006506 , 0.00063969, 0.00063328, 0.00062684, 0.00062352,
       0.0006103 , 0.00060463, 0.00059769, 0.00058182, 0.00057901,
       0.00056648, 0.00056551, 0.00054979, 0.00054543, 0.00053753,
       0.0005361 , 0.00053067, 0.00051841, 0.00051382, 0.00050711,
       0.00049933, 0.0004919 , 0.00048888, 0.00047992, 0.00047919,
       0.00046916, 0.00046408, 0.00046142, 0.00045397, 0.0004432 ],
      dtype=float32)

This reveals that 20% of the variance in the dataset is explained by the primary principal component, 14% is explained by the secondary principal component, and so on. Observe that the numbers decrease as the index increases. By definition, each principal component in a PCAed dataset contains more information than the principal component after it. In this example, the 2,764 principal components that were discarded contained so little information that their loss was barely noticeable when the transform was inverted. In fact, the sum of the 150 numbers in the preceding example is 0.938. This means reducing the dataset from 2,914 dimensions to 150 retained 93.8% of the information in the original dataset. In other words, you reduced the number of dimensions by almost 95%, and yet you retained almost 94% of the information in the dataset. If that’s not awesome, I don’t know what is.

A logical question to ask is, what is the “right” number of components? In other words, what number of components strikes the best balance between reducing the number of dimensions in the dataset and retaining most of the information? One way to find that number is with a scree plot, which charts the proportion of explained variance for each dimension. The following code produces a scree plot for the PCA transform used on the facial images:

import seaborn as sns
sns.set()

plt.plot(pca.explained_variance_ratio_)
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance')

Here is the output:

Another way to look at it is to plot the cumulative sum of the variances as a function of component count:

import numpy as np

plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance');

Here is the output:

Either way you look at it, the bulk of the information is contained in the first 50 to 100 dimensions. Based on these plots, if you reduced the number of dimensions to 50 instead of 150, would you expect the restored facial images to look substantially different? If you’re not sure, try it and see.

Filtering Noise

One very practical use for PCA is to filter noise from data. Noise is data that is random, corrupt, or otherwise meaningless, and it’s particularly likely to occur when the data comes from physical devices such as pressure sensors or accelerometers. The basic approach to using PCA for noise reduction is to PCA-transform the data and then invert the transform, reducing the dataset from m dimensions to n and then restoring it to m. Because PCA discards the least important information when reducing dimensions and noise tends to have little or no informational value, this ideally eliminates much of the noise while retaining most of the meaningful data.

You can test this supposition with the LFW dataset. Use the following statements to add noise to the facial images using a random-number generator and plot the first 24 images:

%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_lfw_people
import numpy as np

faces = fetch_lfw_people(min_faces_per_person=100, slice_=None)
faces.images = faces.images[:, 35:97, 39:86]
faces.data = faces.images.reshape(faces.images.shape[0], faces.images.shape[1] *
                                  faces.images.shape[2])
 
np.random.seed(0)
noisy_faces = np.random.normal(faces.data, 0.0765)

fig, ax = plt.subplots(3, 8, figsize=(18, 10))

for i, axi in enumerate(ax.flat):
    axi.imshow(noisy_faces[i].reshape(62, 47), cmap='gist_gray')
    axi.set(xticks=[], yticks=[], xlabel=faces.target_names[faces.target[i]])

The resulting facial images resemble a staticky 1960s TV screen:

Now use PCA to reduce the number of dimensions. Rather than specify the number of dimensions (components), we’ll specify that we want to reduce the amount of information in the dataset to 80%. We’ll let Scikit decide how many dimensions will remain, and then show the count:

from sklearn.decomposition import PCA

pca = PCA(0.8, random_state=0)
pca_faces = pca.fit_transform(noisy_faces)
pca.n_components_

PCA reduced the number of dimensions from 2,914 to 179, but the remaining dimensions contain 80% of the information in the original 2,914. Now reconstruct the facial images from the PCAed faces and show the results:

unpca_faces = pca.inverse_transform(pca_faces)

fig, ax = plt.subplots(3, 8, figsize=(18, 10))

for i, axi in enumerate(ax.flat):
    axi.imshow(unpca_faces[i].reshape(62, 47), cmap='gist_gray')
    axi.set(xticks=[], yticks=[], xlabel=faces.target_names[faces.target[i]])

Here is the output:

The reconstructed dataset isn’t quite as clean as the original, but it’s clean enough that you can make out the faces in the photos.

Anonymizing Data

Chapter 3 demonstrated how to use various learning algorithms to build a binary classification model that detects credit card fraud. The dataset used in the example contained real credit card data that had been anonymized to protect the card holders (and the credit card company’s intellectual property). The first 10 rows of that dataset are pictured in Figure 6-4.

Figure 6-4. Anonymized fraud detection dataset

Another practical use for PCA is to anonymize data in this manner. It’s generally a two-step process:

  1. Use PCA to “reduce” the dataset from m dimensions to m, where m is the original number of dimensions (as well as the number of dimensions after “reduction”).

  2. Normalize the data so that it has unit variance.

The second step isn’t required, but it does make the ranges of values more uniform. Data anonymized this way can still be used to train a machine learning model, but its original meaning can’t be inferred.

Try it with a dataset of your own. First, use the following code to load Scikit’s breast cancer dataset and display the first five rows:

import pandas as pd
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
df = pd.DataFrame(data=data.data, columns=data.feature_names)
pd.set_option('display.max_columns', 6)
df.head()

The output is as follows:

The dataset contains 30 columns, not counting the label column. Now use the following statements to find the 30 principal components and apply StandardScaler to the transformed data:

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

pca = PCA(n_components=30, random_state=0)
pca_data = pca.fit_transform(df)

scaler = StandardScaler()
anon_df = pd.DataFrame(scaler.fit_transform(pca_data))
pd.set_option('display.max_columns', 8)
anon_df.head()

The result is as follows:

The dataset is unrecognizable after the PCA transform. Without the transform, it’s impossible to work backward and reconstruct the original data. Yet the sum of the explained_variance_ratio_ values is 1.0, which means no information was lost. You can prove it this way:

import numpy as np

np.sum(pca.explained_variance_ratio_)

The PCAed dataset is just as useful for machine learning as the original. Furthermore, if you want to share the dataset with others so that they can train models of their own, there is no risk of divulging sensitive or proprietary information.

Visualizing High-Dimensional Data

Yet another use for PCA is to reduce a dataset to two or three dimensions so that it can be plotted with libraries such as Matplotlib. You can’t plot a dataset that has 1,000 columns. You can plot a dataset that has two or three columns. The fact that PCA can reduce high-dimensional data to two or three dimensions while retaining much of the original information makes it a great tool for exploring data and visualizing relationships between classes.

Suppose you’re building a classification model and want to assess up front whether there is sufficient separation between classes to support such a model. Take the Optical Recognition of Handwritten Digits dataset built into Scikit, for example. Each digit in the dataset is represented by an 8 × 8 array of pixel values, meaning the dataset has 64 dimensions. If you could plot a 64-dimensional diagram, you might be able to inspect the dataset and look for separation between classes. But 64 dimensions is 61 too many for most humans.

Enter PCA. The following code loads the dataset, uses PCA to reduce it to two dimensions, and plots the result, with different colors representing different classes (digits):

from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
%matplotlib inline

digits = load_digits()
pca = PCA(n_components=2, random_state=0)
pca_digits = pca.fit_transform(digits.data)

plt.figure(figsize=(12, 8))
plt.scatter(pca_digits[:, 0], pca_digits[:, 1], c=digits.target,
            cmap=plt.cm.get_cmap('Paired', 10))
plt.colorbar(ticks=range(10))
plt.clim(-0.5, 9.5)

The resulting plot provides an encouraging sign that you might be able to train a classifier with the data. While there is clearly some overlap between classes, the different classes form rather distinct clusters. There is significant overlap between red (the digit 4) and light purple (the digit 6), indicating that a model might have some difficulty distinguishing between 4s and 6s. However, 0s and 1s lie at the top and bottom, while 3s and 4s fall on the far left and far right. A model would presumably be proficient at telling these digits apart:

You can better visualize relationships between classes with a 3D plot. The following code uses PCA to reduce the dataset to three dimensions and Mplot3D to produce an interactive plot. Note that if you run this code in Jupyter Lab, you’ll probably have to change the first line to %matplotlib widget:

%matplotlib notebook
from mpl_toolkits.mplot3d import Axes3D

digits = load_digits()
pca = PCA(n_components=3, random_state=0)
pca_digits = pca.fit_transform(digits.data)

ax = plt.figure(figsize=(12, 8)).add_subplot(111, projection='3d')
ax.scatter(xs = pca_digits[:, 0], ys = pca_digits[:, 1], zs = pca_digits[:, 2],
           c=digits.target, cmap=plt.cm.get_cmap('Paired', 10))

You can rotate the resulting plot in 3D and look at it from different angles. Here we can see that there is more separation between 4s and 6s than was evident in two dimensions:

PCA isn’t the only way to reduce a dataset to two or three dimensions for plotting. You can also use Scikit’s Isomap class or its TSNE class. TSNE implements t-distributed stochastic neighbor embedding, or t-SNE for short. t-SNE is a dimensionality reduction algorithm that is used almost exclusively for visualizing high-dimensional data. Whereas PCA uses a linear function to transform data, t-SNE uses a nonlinear transform that tends to heighten the separation between classes by keeping similar data points close together in low-dimensional space. (PCA, by contrast, focuses on keeping dissimilar points far apart.) Here’s an example that plots the Digits dataset in two dimensions after reducing it with t-SNE:

%matplotlib inline
from sklearn.manifold import TSNE

digits = load_digits()
tsne = TSNE(n_components=2, init='pca', learning_rate='auto',
            random_state=0)
tsne_digits = tsne.fit_transform(digits.data)

plt.figure(figsize=(12, 8))
plt.scatter(tsne_digits[:, 0], tsne_digits[:, 1], c=digits.target,
            cmap=plt.cm.get_cmap('Paired', 10))
plt.colorbar(ticks=range(10))
plt.clim(-0.5, 9.5)

And here is the output:

t-SNE does a better job of separating groups of digits into clusters, indicating there are patterns in the data that machine learning can exploit. The chief drawback is that t-SNE is compute intensive, which means it can take a prohibitively long time to run on large datasets. One way to mitigate that is to run t-SNE on a subset of rows rather than the entire dataset. Another strategy is to use PCA to reduce the number of dimensions, and then subject the PCAed dataset to t-SNE.

Anomaly Detection

Anomaly detection is a branch of machine learning that seeks to identify anomalies in datasets or data streams. Airbus uses it to predict failures in jet engines and detect anomalies in telemetry data beamed down from the International Space Station. Credit card companies use it to detect credit card fraud. The goal of anomaly detection is to identify outliers in data—samples that aren’t “normal” when compared to others. In the case of credit card fraud, the assumption is that if transactions are subjected to an anomaly detection algorithm, fraudulent transactions will show up as anomalous, while legitimate transactions will not.

There are many ways to perform anomaly detection. They go by names such as isolation forests, one-class SVMs, and local outlier factor (LOF). Most rely on unsupervised learning methods and therefore do not require labeled data. They simply look at a collection of samples and determine which ones are anomalous. Unsupervised anomaly detection is particularly interesting because it doesn’t require a priori knowledge of what constitutes an anomaly, nor does it require an unlabeled dataset to be meticulously labeled.

One of the most popular forms of anomaly detection relies on principal component analysis. You already know that PCA can be used to reduce data from m dimensions to n, and that a PCA transform can be inverted to restore the original m dimensions. You also know that inverting the transform doesn’t recover the data that was lost when the transform was applied. The gist of PCA-based anomaly detection is that an anomalous sample should exhibit more loss or reconstruction error (the difference between the original data and the same data after a PCA transform is applied and inverted) than a normal one. In other words, the loss incurred when an anomalous sample is PCAed and un-PCAed should be higher than the loss incurred when the same operation is applied to a normal sample. Let’s see if this assumption holds up in the real world.

Using PCA to Detect Credit Card Fraud

Supervised learning isn’t the only option for detecting credit card fraud. Here’s an alternative approach that uses PCA-based anomaly detection to identify fraudulent transactions. Begin by loading the dataset, separating the samples by class into one dataset representing legitimate transactions and another representing fraudulent transactions, and dropping the Time and Class columns. If you didn’t download the dataset in Chapter 3, you can get it now from the ZIP file.

import pandas as pd

df = pd.read_csv('Data/creditcard.csv')
df.head()

# Separate the samples by class
legit = df[df['Class'] == 0]
fraud = df[df['Class'] == 1]

# Drop the "Time" and "Class" columns
legit = legit.drop(['Time', 'Class'], axis=1)
fraud = fraud.drop(['Time', 'Class'], axis=1)

Use PCA to reduce the two datasets from 29 to 26 dimensions, and then invert the transform to restore each dataset to 29 dimensions. The transform is fitted to legitimate transactions only because we need a baseline value for reconstruction error that allows us to discriminate between legitimate and fraudulent transactions. It is applied, however, to both datasets:

from sklearn.decomposition import PCA

pca = PCA(n_components=26, random_state=0)
legit_pca = pd.DataFrame(pca.fit_transform(legit), index=legit.index)
fraud_pca = pd.DataFrame(pca.transform(fraud), index=fraud.index)

legit_restored = pd.DataFrame(pca.inverse_transform(legit_pca),
                              index=legit_pca.index)

fraud_restored = pd.DataFrame(pca.inverse_transform(fraud_pca),
                              index=fraud_pca.index)

Some information was lost in the transition. Hopefully, the fraudulent transactions incurred more loss than the legitimate ones, and we can use that to differentiate between them. The next step is to compute the loss for each row in the two datasets by summing the squares of the differences between the values in the original rows and the restored rows:

import numpy as np

def get_anomaly_scores(df_original, df_restored):
    loss = np.sum((np.array(df_original) - np.array(df_restored)) ** 2, axis=1)
    loss = pd.Series(data=loss, index=df_original.index)
    return loss

legit_scores = get_anomaly_scores(legit, legit_restored)
fraud_scores = get_anomaly_scores(fraud, fraud_restored)

Now plot the losses incurred when the legitimate transactions were transformed and restored:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

legit_scores.plot(figsize = (12, 6))

Here is the result:

Next, plot the losses for the fraudulent transactions:

fraud_scores.plot(figsize = (12, 6))

Here is the result:

The plots reveal that most of the rows in the dataset representing legitimate transactions incurred a loss of less than 200, while many of the rows in the dataset representing fraudulent transactions incurred a loss greater than 200. Separate the rows on this basis—classifying transactions with a loss of less than 200 as legitimate and transactions with a higher loss as fraudulent—and use a confusion matrix to visualize the results:

threshold = 200

true_neg = legit_scores[legit_scores < threshold].count()
false_pos = legit_scores[legit_scores >= threshold].count()
true_pos = fraud_scores[fraud_scores >= threshold].count()
false_neg = fraud_scores[fraud_scores < threshold].count()

labels = ['Legitimate', 'Fraudulent']
mat = [[true_neg, false_pos], [false_neg, true_pos]]

sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False, cmap='Blues',
            xticklabels=labels, yticklabels=labels)

plt.xlabel('Predicted label')
plt.ylabel('True label')

Here is the result:

The results aren’t quite as good as they were with the random forest, but the model still caught about 50% of the fraudulent transactions while mislabeling just 76 out of 284,315 legitimate transactions. That’s an error rate of less than 0.03% for legitimate transactions, compared to 0.007% for the supervised learning model.

Two parameters in this model drive the error rate: the number of dimensions the datasets were reduced to (26), and the threshold chosen to distinguish between legitimate and fraudulent transactions (200). You can tweak the accuracy by experimenting with different values. I did some informal testing and concluded that this was a reasonable combination. Picking a lower threshold improves the model’s ability to identify fraudulent transactions, but at the cost of misclassifying more legitimate transactions. In the end, you have to decide what error rate you’re willing to live with, keeping in mind that declining a legitimate credit card purchase is likely to anger a customer.

Using PCA to Predict Bearing Failure

One of the classic uses for anomaly detection is to predict failures in rotating machinery. Let’s apply PCA-based anomaly detection to a subset of a dataset published by NASA to predict failures in bearings. The dataset contains vibration data for four bearings supporting a rotating shaft with a radial load of 6,000 pounds applied to it. The bearings were run to failure, and vibration data was captured by high-sensitivity quartz accelerometers at regular intervals until failure occurred.

First, download the CSV file containing the subset that I culled from the larger NASA dataset. Then create a Jupyter notebook and load the data:

import pandas as pd

df = pd.read_csv('Data/bearings.csv', index_col=0, parse_dates=[0])
df.head()

Here are the first five rows in the dataset:

The dataset contains 984 samples. Each sample contains vibration data for four bearings, and the samples were taken 10 minutes apart. Plot the vibration data for all four bearings as a time series:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

df.plot(figsize = (12, 6))

Here is the output:

About four days into the test, vibrations in bearing 1 began increasing. They spiked a day later, and about two days after that, bearing 1 suffered a catastrophic failure. Our goal is to build a model that recognizes increased vibration in any bearing as a sign of impending failure, and to do it without a labeled dataset.

The next step is to extract samples representing “normal” operation from the dataset (x_train in the following code) and reduce four dimensions to one using PCA—essentially combining the data from all four bearings. Then apply the same PCA transform to the remainder of the dataset (x_test), combine the two partial datasets, and plot the result:

from sklearn.decomposition import PCA

x_train = df['2004-02-12 10:32:39':'2004-02-13 23:42:39']
x_test = df['2004-02-13 23:52:39':]

pca = PCA(n_components=1, random_state=0)
x_train_pca = pd.DataFrame(pca.fit_transform(x_train))
x_train_pca.index = x_train.index

x_test_pca = pd.DataFrame(pca.transform(x_test))
x_test_pca.index = x_test.index

df_pca = pd.concat([x_train_pca, x_test_pca])
df_pca.plot(figsize = (12, 6))
plt.legend().remove()

The output is shown here:

Now invert the PCA transform and plot the “restored” dataset:

df_restored = pd.DataFrame(pca.inverse_transform(df_pca), index=df_pca.index)
df_restored.plot(figsize = (12, 6))

The results are as follows:

It is obvious that a loss was incurred by applying and inverting the transform. Let’s define a function that computes the loss in a range of samples, then apply that function to all of the samples in the original dataset and the restored dataset and plot the differences over time:

import numpy as np

def get_anomaly_scores(df_original, df_restored):
    loss = np.sum((np.array(df_original) - np.array(df_restored)) ** 2, axis=1)
    loss = pd.Series(data=loss, index=df_original.index)
    return loss

scores = get_anomaly_scores(df, df_restored)
scores.plot(figsize = (12, 6))

Here is the output:

The loss is very small when all four bearings are operating normally, but it begins to rise when one or more bearings exhibit greater-than-normal vibration. From the chart, it’s apparent that when the loss rises above a threshold value of approximately 0.002, that’s an indication a bearing might fail.

Now that you’ve selected a tentative loss threshold, you can use it to detect anomalous behavior in the bearings. Begin by defining a function that takes a sample and returns True or False indicating whether the sample is anomalous by applying and inverting a PCA transform, measuring the loss for each bearing, and comparing it to a specified loss threshold:

def is_anomaly(row, pca, threshold):
    pca_row = pca.transform(row)
    restored_row = pca.inverse_transform(pca_row)
    losses = np.sum((row - restored_row) ** 2)
        
    for loss in losses:
        if loss > threshold:
            return True;

    return False

Apply the function to a row early in the time series that represents normal behavior and confirm that it returns False:

x = df.loc[['2004-02-16 22:52:39']]
is_anomaly(x, pca, 0.002)

Apply the function to a row later in the time series that represents anomalous behavior and confirm that it returns True:

x = df.loc[['2004-02-18 22:52:39']]
is_anomaly(x, pca, 0.002)

Now apply the function to all the samples in the dataset and shade anomalous samples red in order to visualize when anomalous behavior is detected:

df.plot(figsize = (12, 6))

for index, row in df.iterrows():
    if is_anomaly(pd.DataFrame([row]), pca, 0.002):
        plt.axvline(row.name, color='r', alpha=0.2)

Here is the output:

Repeat this procedure, but this time use a loss threshold of 0.0002 rather than 0.002:

df.plot(figsize = (12, 6))

for index, row in df.iterrows():
    if is_anomaly(pd.DataFrame([row]), pca, 0.0002):
        plt.axvline(row.name, color='r', alpha=0.2)

Here is the output:

You can adjust the sensitivity of the model by adjusting the threshold value used to detect anomalies. Using a loss threshold of 0.002 predicts bearing failure about two days before it occurs, while a loss threshold of 0.0002 predicts the failure about three days before. You typically want to choose a loss threshold that predicts failure as early as possible without raising false alarms.

Multivariate Anomaly Detection

Could we have predicted failure in the preceding example by simply monitoring individual bearings? Perhaps. But what if impending failure is indicated by marginally elevated vibrations in two bearings rather than just one? Engineers frequently find that it isn’t individual sensors but a combination of readings from several sensors that signal impending trouble. These readings may come from sensors of different types: temperature sensors and pressure gauges in automotive and aerospace applications, for example, or heart monitors and blood pressure monitors in health-care applications. Reducing the number of dimensions to one with PCA is an attempt to capture relationships between data emanating from individual sensors and treat the readings systemically, a technique known as multivariate anomaly detection.

One limitation of using PCA to detect anomalies in multivariate systems is that because it uses linear transforms, PCA is better at modeling linear relationships between variables than nonlinear relationships. Neural networks, by contrast, excel at modeling nonlinear data. That’s the primary reason why state-of-the-art multivariate anomaly detection today commonly relies on deep learning.

As the number of variables increases, so too does the challenge of modeling the interdependencies between them. It is not uncommon for overall system health to be determined by dozens of otherwise independent variables. In September 2020, a team of researchers at Microsoft and Peking University published a paper titled “Multivariate Time-series Anomaly Detection via Graph Attention Network” that proposed a novel architecture for multivariate anomaly detection. It combines two deep-learning models: one that relies on prediction error and another that relies on reconstruction error. Microsoft uses this architecture in its Azure Multivariate Anomaly Detector service, which can model dependencies between up to 300 independent data sources and is used by companies such as Airbus and Siemens to detect irregularities in space-station telemetry and to test medical devices before they’re sent to market. The Azure Multivariate Anomaly Detector service is part of Azure Cognitive Services, which is covered in Chapter 14.

Summary

Principal component analysis is a technique for reducing the number of dimensions in a dataset without incurring a commensurate loss of information. It enjoys a number of uses in machine learning, including visualizing high-dimensional data, anonymizing data, reducing noise, and increasing the ratio of rows to columns by reducing the number of dimensions. It can also be used to perform anomaly detection by measuring the loss incurred when a PCA transform is applied and then inverted. Anomalous samples tend to incur more loss.

When I teach classes, I often introduce PCA as “the best-kept secret in machine learning.” It shouldn’t remain a secret, because it’s an indispensable tool in the hands of machine learning engineers. Now that you know about it, I can just about guarantee that you’ll find ways to put it to work.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.253.62