Chapter 1. Unsupervised Machine Learning

In this chapter, you will learn how to apply unsupervised learning techniques to identify patterns and structure within datasets.

Unsupervised learning techniques are a valuable set of tools for exploratory analysis. They bring out patterns and structure within datasets, which yield information that may be informative in itself or serve as a guide to further analysis. It's critical to have a solid set of unsupervised learning tools that you can apply to help break up unfamiliar or complex datasets into actionable information.

We'll begin by reviewing Principal Component Analysis (PCA), a fundamental data manipulation technique with a range of dimensionality reduction applications. Next, we will discuss k-means clustering, a widely-used and approachable unsupervised learning technique. Then, we will discuss Kohenen's Self-Organizing Map (SOM), a method of topological clustering that enables the projection of complex datasets into two dimensions.

Throughout the chapter, we will spend some time discussing how to effectively apply these techniques to make high-dimensional datasets readily accessible. We will use the UCI Handwritten Digits dataset to demonstrate technical applications of each algorithm. In the course of discussing and applying each technique, we will review practical applications and methodological questions, particularly regarding how to calibrate and validate each technique as well as which performance measures are valid. To recap, then, we will be covering the following topics in order:

  • Principal component analysis
  • k-means clustering
  • Self-organizing maps

Principal component analysis

In order to work effectively with high-dimensional datasets, it is important to have a set of techniques that can reduce this dimensionality down to manageable levels. The advantages of this dimensionality reduction include the ability to plot multivariate data in two dimensions, capture the majority of a dataset's informational content within a minimal number of features, and, in some contexts, identify collinear model components.

Note

For those in need of a refresher, collinearity in a machine learning context refers to model features that share an approximately linear relationship. For reasons that will likely be obvious, these features tend to be unhelpful as the related features are unlikely to add information mutually that either one provides independently. Moreover, collinear features may emphasize local minima or other false leads.

Probably the most widely-used dimensionality reduction technique today is PCA. As we'll be applying PCA in multiple contexts throughout this book, it's appropriate for us to review the technique, understand the theory behind it, and write Python code to effectively apply it.

PCA – a primer

PCA is a powerful decomposition technique; it allows one to break down a highly multivariate dataset into a set of orthogonal components. When taken together in sufficient number, these components can explain almost all of the dataset's variance. In essence, these components deliver an abbreviated description of the dataset. PCA has a broad set of applications and its extensive utility makes it well worth our time to cover.

Note

Note the slightly cautious phrasing here—a given set of components of length less than the number of variables in the original dataset will almost always lose some amount of the information content within the source dataset. This lossiness is typically minimal, given enough components, but in cases where small numbers of principal components are composed from very high-dimensional datasets, there may be substantial lossiness. As such, when performing PCA, it is always appropriate to consider how many components will be necessary to effectively model the dataset in question.

PCA works by successively identifying the axis of greatest variance in a dataset (the principal components). It does this as follows:

  1. Identifying the center point of the dataset.
  2. Calculating the covariance matrix of the data.
  3. Calculating the eigenvectors of the covariance matrix.
  4. Orthonormalizing the eigenvectors.
  5. Calculating the proportion of variance represented by each eigenvector.

Let's unpack these concepts briefly:

  • Covariance is effectively variance applied to multiple dimensions; it is the variance between two or more variables. While a single value can capture the variance in one dimension or variable, it is necessary to use a 2 x 2 matrix to capture the covariance between two variables, a 3 x 3 matrix to capture the covariance between three variables, and so on. So the first step in PCA is to calculate this covariance matrix.
  • An Eigenvector is a vector that is specific to a dataset and linear transformation. Specifically, it is the vector that does not change in direction before and after the transformation is performed. To get a better feeling for how this works, imagine that you're holding a rubber band, straight, between both hands. Let's say you stretch the band out until it is taut between your hands. The eigenvector is the vector that did not change direction between before the stretch and during it; in this case, it's the vector running directly through the center of the band from one hand to the other.
  • Orthogonalization is the process of finding two vectors that are orthogonal (at right angles) to one another. In an n-dimensional data space, the process of orthogonalization takes a set of vectors and yields a set of orthogonal vectors.
  • Orthonormalization is an orthogonalization process that also normalizes the product.
  • Eigenvalue (roughly corresponding to the length of the eigenvector) is used to calculate the proportion of variance represented by each eigenvector. This is done by dividing the eigenvalue for each eigenvector by the sum of eigenvalues for all eigenvectors.

In summary, the covariance matrix is used to calculate Eigenvectors. An orthonormalization process is undertaken that produces orthogonal, normalized vectors from the Eigenvectors. The eigenvector with the greatest eigenvalue is the first principal component with successive components having smaller eigenvalues. In this way, the PCA algorithm has the effect of taking a dataset and transforming it into a new, lower-dimensional coordinate system.

Employing PCA

Now that we've reviewed the PCA algorithm at a high level, we're going to jump straight in and apply PCA to a key Python dataset—the UCI handwritten digits dataset, distributed as part of scikit-learn.

This dataset is composed of 1,797 instances of handwritten digits gathered from 44 different writers. The input (pressure and location) from these authors' writing is resampled twice across an 8 x 8 grid so as to yield maps of the kind shown in the following image:

Employing PCA

These maps can be transformed into feature vectors of length 64, which are then readily usable as analysis input. With an input dataset of 64 features, there is an immediate appeal to using a technique like PCA to reduce the set of variables to a manageable amount. As it currently stands, we cannot effectively explore the dataset with exploratory visualization!

We will begin applying PCA to the handwritten digits dataset with the following code:

import numpy as np
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.lda import LDA
import matplotlib.cm as cm

digits = load_digits()
data = digits.data

n_samples, n_features = data.shape
n_digits = len(np.unique(digits.target))
labels = digits.target

This code does several things for us:

  1. First, it loads up a set of necessary libraries, including numpy, a set of components from scikit-learn, including the digits dataset itself, PCA and data scaling functions, and the plotting capability of matplotlib.
  2. The code then begins preparing the digits dataset. It does several things in order:
    • First, it loads the dataset before creating helpful variables
    • The data variable is created for subsequent use, and the number of distinct digits in the target vector (0 through to 9, so n_digits = 10) is saved as a variable that we can easily access for subsequent analysis
    • The target vector is also saved as labels for later use
    • All of this variable creation is intended to simplify subsequent analysis
  3. With the dataset ready, we can initialize our PCA algorithm and apply it to the dataset:
    pca = PCA(n_components=10)
    data_r = pca.fit(data).transform(data)
    
    print('explained variance ratio (first two components): %s' % str(pca.explained_variance_ratio_))
    print('sum of explained variance (first two components): %s' % str(sum(pca.explained_variance_ratio_)))
  4. This code outputs the variance explained by each of the first ten principal components ordered by explanatory power.

In the case of this set of 10 principal components, they collectively explain 0.589 of the overall dataset variance. This isn't actually too bad, considering that it's a reduction from 64 variables to 10 components. It does, however, illustrate the potential lossiness of PCA. The key question, though, is whether this reduced set of components makes subsequent analysis or classification easier to achieve; that is, whether many of the remaining components contained variance that disrupts classification attempts.

Having created a data_r object containing the output of pca performed over the digits dataset, let's visualize the output. To do so, we'll first create a vector of colors for class coloration. We then simply create a scatterplot with colorized classes:

X = np.arange(10)
ys = [i+x+(i*x)**2 for i in range(10)]

plt.figure()
colors = cm.rainbow(np.linspace(0, 1, len(ys)))
for c, i target_name in zip(colors, [1,2,3,4,5,6,7,8,9,10], labels):
   plt.scatter(data_r[labels == I, 0], data_r[labels == I, 1],     
   c=c, alpha = 0.4)
   plt.legend()
   plt.title('Scatterplot of Points plotted in first 
'
   '10 Principal Components')
   plt.show()

The resulting scatterplot looks as follows:

Employing PCA

This plot shows us that, while there is some separation between classes in the first two principal components, it may be tricky to classify highly accurately with this dataset. However, classes do appear to be clustered and we may be able to get reasonably good results by employing a clustering analysis. In this way, PCA has given us some insight into how the dataset is structured and has informed our subsequent analysis.

At this point, let's take this insight and move on to examine clustering by the application of the k-means clustering algorithm.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.171.86