6

Clustering and Unsupervised Models

In this chapter, we are going to introduce some fundamental clustering algorithms and discuss their strengths and weaknesses. The field of unsupervised learning, as well as any other machine learning approach, must always be based on the concept of Occam's razor. Simplicity must always be preferred, so long as the performance of the model meets your requirements.

However, in this case, the ground truth can be unknown. When a clustering algorithm is adopted as an exploratory tool, we can only assume that the dataset represents a precise data-generating process. If this assumption is correct, the best strategy is to determine the number of clusters to maximize the internal cohesion (denseness) and the external separation. This means that we expect to find blobs (or isles) whose samples share some common and partially unique features.

In particular, the algorithms and the topics we are going to analyze are:

  • K-Nearest Neighbors (KNN), based on k-dimensional (k-d) trees and ball trees
  • K-means and k-means++
  • Evaluation of clustering models

We can now start analyzing one of the simplest approaches to clustering data, considering the pros and cons and how it's possible to improve your results.

K-nearest neighbors

This algorithm belongs to a particular family called instance-based algorithms (the methodology is called instance-based learning).

It differs from other approaches because it doesn't work with an actual mathematical model. On the contrary, the inference is performed by direct comparison of new samples with existing ones (which are defined as instances). KNN is an approach that can be easily employed to solve clustering, classification, and regression problems (even though, in this case, we are going to consider only the first technique). The main idea behind the clustering algorithm is very simple. Let's consider a data generating process pdata and finite a dataset drawn from this distribution:

Each point has a dimensionality equal to N. We can now introduce a distance function , which in most cases can be generalized with the Minkowski distance:

When p = 2, represents the classical Euclidean distance, that is normally the default choice in almost any scenario. In particular cases, it can be helpful to employ other variants, such as p = 1 (which is also known as Manhattan distance) or p > 2. Even if all the properties of a metric function remain unchanged, different values of p yield results that can be semantically diverse. As an example, we can consider the distance between points and as a function of p:

Minkowski distance between (0, 0) and (15, 10) as a function of parameter p

The distance decreases monotonically with p and converges to the largest component absolute difference when . Therefore, whenever it's important to weight all the components in the same way in order to have a consistent metric, small values of p are preferable (for example, p = 1 or 2). This result has also been studied and formalized by Aggarwal, Hinneburg, and Keim (in Aggarwal C. C., Hinneburg A., Keim D. A., On the Surprising Behavior of Distance Metrics in High Dimensional Space, ICDT, 2001), who proved a fundamental inequality.

If we consider a generic distribution G of M points , a distance function based on the Lp norm, and the maximum and minimum distances (computed using the Lp norm) between two points, and drawn from G and the origin , the following inequality holds:

It's clear that when the input dimensionality is very high and p >> 2, the expected value becomes bounded between two constants, and , reducing the actual effect of almost any distance. In fact, given two generic couples of points and drawn from G, the natural consequence of the following inequality is that when , independently of their relative positions. This important result confirms the importance of choosing the right metric according to the dimensionality of the dataset and that p = 1 is the best choice when d >> 1, while p >> 1 can produce inconsistent results due the ineffectiveness of the metric. To see direct confirmation of this phenomenon, it's possible to run the following snippet, which computes the average difference between the maximum and minimum distances considering 100 sets containing 100 data points drawn from a uniform distribution, . In the snippet, the case of d =5, 10, 15, 20 is analyzed with Minkowski metrics with P = 1, 2, 5, 10 (the final values depend on the random seed and how many times the experiment is repeated):

import numpy as np
from scipy.spatial.distance import pdist
nb_samples = 100
nb_bins = 100
def max_min_mean(p=1.0, d=2):
    Xs = np.random.uniform(0.0, 1.0,
         size=(nb_bins, nb_samples, d))
    pd_max = np.zeros(shape=(nb_bins,))
    pd_min = np.zeros(shape=(nb_bins,))
    for i in range(nb_bins):
        pd = pdist(Xs[i], metric='minkowski', p=p)
        pd_max[i] = np.max(pd)
        pd_min[i] = np.min(pd)
    return np.mean(pd_max - pd_min)

The result for different values grouped by metric value p is shown in the following figure:

Average difference between the maximum and minimum Minkowski distances grouped by p

A particular case that is a direct consequence of the previous inequality is when the largest absolute difference between components determines the most important factor of a distance, large values of p can be employed. For example, if we consider three points: , and , then , and . So, if we set a threshold at d = 16 centered at , the point is outside the boundaries. If instead p = 15, both distances become close to 15 and the two points and are inside the boundaries. A particular use of large values of p is when it's important to take into account the inhomogeneity among components.

For example, some feature vectors can represent the age and height of a set of people. Considering a test person , with large p values, the distances between x and two samples (35, 150) and (25, 151) are almost identical (about 25.0), and the only dominant factor becomes the height difference (independent of the age).

The KNN algorithm determines the k closest samples to each training point. When a new sample is presented, the procedure is repeated with two possible variants:

  • With a predefined value of k, the KNN are computed.
  • With a predefined radius/threshold r, all the neighbors whose distance is less than or equal to the radius are computed.

The philosophy of KNN is that similar samples can share their features. For example, a recommendation system can cluster users using this algorithm and, given a new user, find the most similar ones (based, for example, on the products they bought) to recommend the same category of items. In general, a similarity function is defined as the reciprocal of a distance (there are some exceptions, such as the cosine similarity, which is effective for any value of p):

Two different users, A and B, who are classified as neighbors, will differ under some viewpoints, but at the same time, they will share some peculiar features. This statement authorizes us to increase the homogeneity by suggesting the differences. For example, if A liked book b1 and B liked b2, we can recommend b1 to B and b2 to A. If our hypothesis was correct, the similarity between A and B will be increased; otherwise, the two users will move toward other clusters that better represent their behavior.

Unfortunately, the vanilla algorithm (in scikit-learn, it is called the brute-force algorithm) can become extremely slow with a large number of samples because it's necessary to compute all the pairwise distances in order to answer any query. With M points, this number is equal to M2, which is often unacceptable (if M = 1,000, each query needs to compute a million distances). More precisely, as the computation of a distance in an N-dimensional space requires N operations, the total complexity becomes , which can be reasonable only for small values of both M and N. That's why some important strategies, such as k-d trees and ball trees, have been implemented to reduce the computational complexity.

K-d trees

As all KNN queries can be considered search problems, one of the most efficient ways to reduce the overall complexity is to reorganize the dataset into a tree structure. In a binary tree (one-dimensional data), the average computational complexity of a query is O (log M) because we assume we have almost the same number of elements in each branch (if the tree is completely unbalanced, all the elements are inserted sequentially and the resulting structure has a single branch, so the complexity becomes O (M)). In general, the real complexity is slightly higher than O (log M) because of the unbalancing of the tree, but the operation is always much more efficient than a vanilla search, which is O (M2).

However, we normally work with N-dimensional data and the previous structure cannot be immediately employed. K-d trees extend the concept of binary trees for N > 1. In this case, a split cannot be immediately performed, and a different strategy must be chosen. The easiest way to solve this problem is to select a feature at each level (1, 2, …, N) and repeat the process until the desired depth is reached. In the following diagram, there's an example of k-d trees with three-dimensional points:

Example of a three-dimensional k-d tree

The root is point (5, 3, 7). The first split is performed considering the first feature, so two children are (2, 1, 1) and (8, 4, 3). The second one operates on the second feature, and so on. The average computational complexity is O (N log M), but if the distribution is very asymmetric, the probability that the tree will become unbalanced is very high. To mitigate this issue, it's possible to select the feature corresponding to the median of the (sub-)dataset and to continue splitting with this criterion. In this way, the tree is guaranteed to be balanced. However, the average complexity is always proportional to the dimensionality and this can dramatically affect the performance.

For example, if M = 10,000 and N = 10, using the log10 x, O (N log M) = O (40), while with N = 1,000, the complexity becomes O (40,000). Generally, k-d trees suffer from the curse of dimensionality and when N becomes large, the average complexity is about O (MN), which is always better than the vanilla algorithm, but often too expensive for real-life applications. Therefore, k-d trees are only really effective when the dimensionality is not very high. In all other cases, the probability of having an unbalanced tree and the resulting computational complexity suggest employing a different method.

Ball trees

An alternative to k-d trees is provided by ball trees. The idea is to rearrange the dataset in a way that is almost insensitive to high-dimensional samples. A ball is defined as a set of points whose distance from a center sample is less than or equal to a fixed radius:

Starting from the first main ball, it's possible to build smaller ones nested into the parent ball and stop the process when the desired depth has been reached. A fundamental condition is that a point can always belong to a single ball. In this way, considering the cost of the N-dimensional distance, the computational complexity is O (N log M) and doesn't suffer the curse of dimensionality that k-d trees suffers from. The structure is based on hyperspheres whose boundaries are defined by the equations (given a center point and a radius Ri):

Therefore, the only operation needed to find the right ball is measuring the distance between a sample and the centers starting from the smallest balls. If a point is outside the ball, it's necessary to move upward and check the parents, until the ball containing the sample is found.

In the following diagram, there's an example of ball trees with two levels:

Example of ball trees with seven bidimensional points and two levels

In this example, the seven bidimensional points are split first into two balls containing three and four points. At the second level, the second ball is split again into two smaller balls containing two points each. This procedure can be repeated until a fixed depth is reached or by imposing the maximum number of elements that a leaf must contain (in this case, it can be equal to 3).

Fitting a KNN model

Both k-d trees and ball trees can be efficient structures to reduce the complexity of KNN queries. However, when fitting a model, it's important to consider both the k parameter (which normally represents the average or the standard number of neighbors computed in a query) and the maximum tree depth. These particular structures are not employed for common tasks (such as sorting) and their efficiency is maximized when all the requested neighbors can be found in the same sub-structure (with a size of K << M to avoid an implicit fallback to the vanilla algorithm). In other words, the tree has the role of reducing the dimensionality of the search space by partitioning it into reasonably small regions.

At the same time, if the number of samples contained in a leaf is small, the number of tree nodes grows and the complexity is subsequently increased. The negative impact is doubled because on average it's necessary to explore more nodes, and if k is much greater than the number of elements contained in a node, it's necessary to merge the samples belonging to different nodes. On the other hand, a very large number of samples per node leads to a condition that is close to the vanilla algorithm.

For example, if M = 1,000 and each node contains 250 elements, once the right node is computed, the number of distances to compute is comparable with the initial dataset size and no real advantage is achieved by employing a tree structure. An acceptable practice is to set the size of a leaf equal to 5 to 10 times the average value of k to maximize the probability of finding all the neighbors inside the same leaf. However, every specific problem must be analyzed (while also benchmarking the performance) in order to find the most appropriate value. If different values of k are necessary, it's important to consider the relative frequencies of the queries. For example, if a program needs 10 5-NN queries and 1 50-NN query, it's probably better to set a leaf size equal to 25, even if the 50-NN query will be more expensive. In fact, setting a good value for a second query (for example, 200) will dramatically increase the complexity of the first 10 queries, leading to a loss of performance.

In this context, we are discussing KNN as an unsupervised algorithm. However, it can also be employed both in regression and classification scenarios. Most of the concepts discussed in the previous sections also apply when the problem needs a supervised approach. In particular, as the neighborhoods represent uniform domains, small numbers of neighbors lead to very low biases because, given a test sample, the values employed in computing the label (either categorical or continuous) are the ones of the most similar sample points. Obviously, such a low bias is the consequence of intrinsic overfitting, which leads to a naturally large variance. The problem can be mitigated by choosing larger neighborhoods.

This solution is a sort of regularization because the loss of precision is directly connected to an implicit and controlled linearization of the dataset. On the other hand, larger neighbors are more computationally expensive, therefore their size is often (and wrongly) reduced. In our context, we are always making our decisions without considering the bias-variance trade-off because the scenarios are unsupervised. However, it's helpful for the reader to keep in mind that instance-based methods are generally harder to manage than parametric ones, as the advantage of obtaining a synthetic model is absent and the predictions are strongly influenced by the structure (including noisy points and outliers) of the dataset.

We can now create a complete Python example using the scikit-learn API.

Example of KNN with scikit-learn

In order to test the KNN algorithm, we are going to use the MNIST handwritten digit dataset provided directly by scikit-learn. It is made up of 1,797 8 × 8 grayscale images representing the digits from 0 to 9.

The first step is loading it and normalizing all the values to be bounded between 0 and 1:

import numpy as np
from sklearn.datasets import load_digits
digits = load_digits()
X_train = digits['data'] / np.max(digits['data'])

The dictionary "digits" contains both the images, digits['images'], and the flattened 64-dimensional arrays, digits['data']. Scikit-learn implements different classes (for example, it's possible to work directly with k-d trees and ball trees using the KDTree and BallTree classes) that can be used in the context of KNN (as clustering, classification, and regression algorithms). However, we're going to employ the main class, NearestNeighbors, which allows us to perform clustering and queries based either on the number of neighbors or on the radius of a ball centered on a sample:

from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(n_neighbors=50,
                       leaf_size=30, 
                       algorithm='ball_tree')
knn.fit(X_train)

We have chosen to have a default number of neighbors equal to 50 and an algorithm based on a ball tree. The leaf size (leaf_size) parameter has been kept to its default value, equal to 30. We have also employed the default metric (Euclidean), but it's possible to change it using the metric and p parameters (which is the order of the Minkowski metric). Scikit-learn supports all the metrics implemented by SciPy in the scipy.spatial.distance package (as not all metrics are compatible with k-d trees and ball trees, I invite the reader to check the official scikit-learn documentation). However, in the majority of cases, it's sufficient to use a Minkowski metric and adjust the value of p if the results are not acceptable with any number of neighbors. Other metrics, such as the cosine distance, can be employed when the similarity must not be affected by the Euclidean distance, but only by the angle between two vectors pointing at the samples. Applications that use this metric include, for example, deep learning models for natural language processing, where the words are embedded in feature vectors whose semantic similarity is proportional to their Cosine distance.

In general, when the dimensionality is high, Cosine distance is an effective choice, but I invite the reader to carefully evaluate every scenario to make the most appropriate decision.

We can now query the model in order to find 50 neighbors of a sample. For our purposes, we have selected the sample with index 100, which represents a 4 (the images have a very low resolution, but it's always possible to distinguish the digit):

Sample digit used to query the KNN model

The query can be performed using the instance method kneighbors, which allows specifying the number of neighbors (the n_neighbors parameter – the default is the value selected during the instantiation of the class) and whether we want to also get the distances of each neighbor (the return_distance parameter). In this example, we are also interested in evaluating how far the neighbors are from the center, so we set return_distance=True:

distances, neighbors = ( knn.kneighbors(X_train[100].reshape(1, -1), 
                         return_distance=True))
print(distances[0])

The output of the previous snippet is:

[ 0. 0.91215747  1.16926793  1.22633855

The first neighbor is always the center, so its distance is 0. The other ones range from 0.9 to 1.9. Considering that, in this case, the maximum possible distance is 8 (between a 64-dimensional vector and the null vector), the result could be acceptable. In order to get confirmation, we can plot the neighbors as bidimensional 8 × 8 arrays (the returned array, neighbors, contains the indexes of the samples).

The result is shown in the following figure:

50 neighbors selected by the KNN model

As it's possible to see, there are no errors, but all the shapes are slightly different. In particular, the last one, which is also the farthest, has a lot of white pixels (corresponding to the value 1.0), explaining the reason for a distance equal to about 2.0. I invite the reader to test the radius_neighbors method until spurious values appear among the results. It's also interesting to try this algorithm with the Olivetti faces dataset, whose complexity is higher, and many more geometrical parameters can influence the similarity.

In this section, we have discussed the main concepts related to KNN, focusing our attention on pros and cons. We can now move on to another common clustering algorithm, K-means. We'll discuss its limitations and how to tune up the hyperparameters to obtain optimal performance.

K-means

When we discussed the Gaussian mixture algorithm, we defined it as soft K-means. The reason is that each cluster was represented by three elements: mean, variance, and weight. Each sample always belongs to all clusters with a probability provided by the Gaussian distributions. This approach can be very useful when it's possible to manage the probabilities as weights, but in many other situations, it's preferable to determine a single cluster per sample.

Such an approach is called hard clustering and K-means can be considered the hard version of a Gaussian mixture. In fact, when all variances , the distributions degenerate to Dirac deltas , which represent perfect spikes centered at a specific point (even if they are not real functions but distributions). In this scenario, the only possibility to determine the most appropriate cluster is to find the shortest distance between a sample point and all the centers (from now on, we are going to call them centroids). This approach is also based on an important double principle that should be taken into account in every clustering algorithm. The clusters must be set up to maximize:

  • The intra-cluster cohesion
  • The inter-cluster separation

This means that we expect to label high-density regions that are well separated from each other. When this is not possible, the criterion must try to minimize the intra-cluster average distance between samples and centroid. This quantity is also called inertia and it's defined as:

Large levels of inertia imply low cohesion because there are probably too many points belongings to clusters whose centroids are too far away. The problem can be solved by minimizing the previous quantity. However, the computational complexity needed to find the global minimum is exponential (K-means belongs to the class of NP-hard problems). The alternative approach employed by the K-means algorithm, also known as Lloyd's algorithm, is iterative and starts from selecting k random centroids (in the next section, we're going to analyze a more efficient method) and adjusting them until their configuration becomes stable.

The dataset to cluster (with M data points) is represented as:

An initial guess for the centroids could be:

There are no specific restrictions on the initial values. However, the choice can influence both the convergence speed and the minimum that is found.

The iterative procedure will loop over the dataset, computing the Euclidean distance between and each and assigning a cluster based on the criterion:

Once all the points have been clustered, the new centroids are computed:

The quantity represents the number of points belonging to cluster j. At this point, the inertia is recomputed, and the new value is compared with the previous one. The procedure will stop either after a fixed number of iterations or when the variations in the inertia become smaller than a predefined threshold. Lloyd's algorithm is very similar to a particular case of the EM algorithm. In fact, the first step of each iteration is the computation of an expectation (the centroid configuration), while the second step maximizes the intra-cluster cohesion by minimizing the inertia.

Before moving forward, it's important to understand the structure of the inertia and the limitations that K-means inherits from it. Let's suppose we have a very dense blob and another less dense (and possibly with larger variance) one. The internal summation of S is limited to the points assigned to each cluster. In our case, the formula becomes:

As Cdense contains a (much) larger set of points than Csparse, the first term dominates the sum. Therefore, when S is minimized there's a high chance of finding the optimal position of , but a very low probability of obtaining a global optimum. In fact, the modifications to S due to the sparse blob are less and less negligible and the algorithm may stop before having explored all possibilities. When dealing with such scenarios, a possible solution is to up-sample the sparse blob by introducing n copies of the same points. This approach is equivalent to introducing a set of class weights in the computation of S to encode the prior knowledge about the geometrical structure:

For example, in the previous scenario, w1 = 1 and w2 can be chosen as the ratio of points belonging to Cdense and Csparse (of course, if this is not known before the first training, it can be included after an analysis of the results).

K-means can also be implemented with an incremental approach (formally known as mini-batch K-means). When the datasets are too large to fit into memory and no other solutions are feasible (such as a Dask or Spark cluster), it's possible to apply the same strategy to smaller batches with a slight modification of the algorithm. We are not covering all the details in this book (they are also available in Bonaccorso G., Hands-On Unsupervised Learning with Python, Packt Publishing, 2019), but it's not difficult to understand that the main problem is the incorrect allocation due to the fact that a part of the sample is not immediately available. For this reason, mini-batch K-means introduces a parametrized reallocation strategy to allow a point to be reassigned with a predefined sensitivity threshold (small values cause fluctuations, while larger ones yield sub-optimal final configurations). Even if the algorithm is less precise than standard K-means, it's possible to prove that the actual performance loss is generally extremely small, and it can therefore be employed in production scenarios without large risks.

The complete vanilla K-means algorithm (that is, the standard algorithm without any optimization or improvement) is:

  • Set a maximum number of iterations Nmax
  • Set a tolerance Thr
  • Set the value of k (the number of expected clusters)
  • Initialize vector C(0) with random values. They can be points belonging to the dataset or sampled from a suitable distribution
  • Compute the initial inertia S(0)
  • Set N = 0
  • While :
    • N = N + 1
    • For :

      Assign to a cluster using the shortest distance between

    • Recompute the centroid vector C(t) considering the new assignments
    • Recompute the inertia S(t)

The algorithm is quite simple and intuitive, and many real-life applications are based on it. However, there are two important elements to consider. The first one is the convergence speed.

It's easy to show that every initial guess drives to a convergence point, but the number of iterations is dramatically influenced by this choice and there's no guarantee of finding the global minimum. If the initial centroids are close to the final ones, the algorithm needs only a few steps to correct the values, but when the choice is totally random, it's not uncommon to need a very high number of iterations. If there are N data points and k centroids, Nk distances must be computed at each iteration, leading to an inefficient result. In the next paragraph, we'll show how it's possible to initialize the centroids to minimize the convergence time.

Another important aspect is that contrary to KNN, K-means needs to predefine the number of expected clusters. In some cases, this is a secondary problem because we already know the most appropriate value for k (for example, if the number of clusters is defined by external constraints, like in market segmentation). However, when the dataset is high-dimensional and our knowledge is limited, this choice could be hazardous. A good approach to solve the issue is to analyze the final inertia for a different number of clusters. As we expect to maximize the intra-cluster cohesion, a small number of clusters will lead to increased inertia. We try to pick the highest point below a maximum tolerable value. Theoretically, we can also pick k = N. In this case, the inertia becomes zero because each point represents the centroid of its cluster, but a large value of k transforms the clustering scenario into a fine-grained partitioning that might not be the best strategy to capture the feature of a consistent group. It's impossible to define a rule for the upper bound kmax, but we assume that this value is always much less than N. The best choice is to select k to minimize the inertia, selecting the values from a set bounded, for example, between 2 and kmax.

Even if the vanilla algorithm is quite effective, the optimal choice for the initial positions of the clusters can speed up the computation. This is the goal achieved by the variant called K-means++.

K-means++

We have said that a good choice for the initial centroids can improve the convergence speed and lead to a minimum that is closer to the global optimum of the inertia. Arthur and Vassilvitskii (in Arthur D., Vassilvitskii S., The Advantages of Careful Seeding, k-means++: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2006) proposed a method called K-means++, which allows us to increase the accuracy of the initial centroid guess considering the most likely final configuration.

In order to explore the algorithm, it's useful to introduce a function, , which is defined as:

defines the shortest distance between each sample and one of the centroids already selected. As the process is incremental, this function must be recomputed after all steps. For our purposes, let's also define an auxiliary probability distribution (we omit the index variable for simplicity):

The first centroid is sampled from X using a uniform distribution. The next steps are:

  • Compute for all considering the centroids already selected
  • Compute
  • Select the next centroid from X with a probability

In the aforementioned paper, the authors showed a very important property. If we define S* as the global optimum of S, a K-means++ initialization determines an upper bound for the expected value of the actual inertia:

This condition is often expressed by saying that K-means++ is O(log k)-competitive. When k is sufficiently small, the probability of finding a local minimum close to the global one increases. However, K-means++ is still a probabilistic approach and different initializations on the same dataset lead to different initial configurations. A good practice is to run a limited number of initializations (for example, 10) and pick the one associated with the smallest inertia. When training complexity is not a primary issue, this number can be increased, but different experiments showed that the improvement achievable with a very large number of trials is negligible when compared to the actual computational cost. The default value in scikit-learn is 10 and it's advisable to keep this value in most cases. If the result continues to be poor, it's preferable to pick another method.

Moreover, there are problems that cannot be solved using K-means (even with the best possible initialization) because one of the assumptions of the algorithm is that each cluster is a hypersphere and the distances are measured using a Euclidean function. In the next chapter, we're going to analyze other algorithms that are not constrained to working with such limitations and can easily solve clustering problems using asymmetric cluster geometries.

Example of K-means with scikit-learn

In this example, we continue using the MNIST dataset (the X_train array is the same as was defined in the earlier section dedicated to KNN), but we also want to analyze different clustering evaluation methods. The first step is visualizing the inertia corresponding to different numbers of clusters. We are going to use the KMeans class, which accepts the n_clusters parameter and employs the K-means++ initialization as the default method (as explained in the previous section, in order to find the best initial configuration, scikit-learn performs several attempts and selects the configuration with the lowest inertia; it's possible to change the number of attempts through the n_iter parameter):

import numpy as np
from sklearn.cluster import KMeans
min_nb_clusters = 2
max_nb_clusters = 20
inertias = np.zeros(shape=(max_nb_clusters - min_nb_clusters + 1,))
for i in range(min_nb_clusters, max_nb_clusters + 1):
    km = KMeans(n_clusters=i, random_state=1000)
    km.fit(X_train)
    inertias[i - min_nb_clusters] = km.inertia_

We are supposing to analyze the range [2, 20]. After each training session, the final inertia can be retrieved using the inertia_ instance variable. The following graph shows the plot of the gradient of the inertia as a function of the number of clusters. The plot is obtained using the NumPy function np.gradient():

Gradient of the inertia as a function of the number of clusters

As expected, the function is decreasing (the gradient is negative) but the slope tends to reach 0 when . In this case, we know that the real number of clusters is 10, but it's also possible to discover this by observing the trend. The absolute value of the slope is quite high before 10, but it starts decreasing more and more slowly after this threshold. This is a signal that informs us that some clusters are not well separated, even if their internal cohesion is high. In order to confirm this hypothesis, we can set n_clusters=10 and, first of all, check the centroids at the end of the training process:

km = KMeans(n_clusters=10, random_state=1000)
Y = km.fit_predict(X_train)

The centroids are available through the cluster_centers_ instance variable. In the following screenshot, there's a plot of the corresponding bidimensional arrays:

K-means centroid at the end of the training process

All the digits are present and there are no duplicates. This confirms that the algorithm has successfully separated the sets, but the final inertia (which is about 4,500) informs us that there are probably incorrect assignments. To obtain confirmation, we can plot the dataset using a dimensionality reduction method, such as t-SNE (see Chapter 5, Graph-Based Semi-Supervised Learning, for further details):

from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=10.0, 
            random_state=1000)
X_tsne = tsne.fit_transform(X_train)

At this point, we can plot the bidimensional dataset with the corresponding cluster labels:

t-SNE representation of the MNIST dataset; the labels correspond to the clusters

The plot confirms that the dataset is made up of well separated blobs, but a few data points are assigned to the wrong cluster (this is not surprising considering the similarity between some pairs of digits). An important observation can further explain the trend of the inertia. In fact, the point where the slope changes almost abruptly corresponds to about 10 clusters. Observing the t-SNE plot, we can immediately discover the reason: the cluster corresponding to the digit 7 is indeed split into 3 blocks. The main one contains the majority of samples, but there are another 2 smaller blobs that are wrongly attached to clusters 1 and 9. This is not surprising, considering that the digit 7 can appear very similar to a distorted 1 or 9. However, these two spurious blobs are always at the boundaries of the wrong clusters (remember that the geometric structures are hyperspheres), confirming that the metric has successfully detected a low similarity. If a group of wrongly assigned samples were in the middle of a cluster, it would mean that the separation failed dramatically, and another method should be employed.

At this point, it's helpful to introduce some common evaluation metrics that can be employed both when the ground truth is known and when it's not.

Evaluation metrics

In many cases, it's impossible to evaluate the performance of a clustering algorithm using only a visual inspection. Moreover, it's important to use standard objective metrics that allow us to compare different approaches.

We are now going to introduce some methods based on the knowledge of the ground truth (the correct assignment for each data point) and one common strategy employed when the true labels are unknown.

Before discussing the scoring functions, we need to introduce a standard notation. If there are k clusters, we define the true labels as:

In the same way, we can define the predicted labels:

Both sets can be considered as sampled from two discrete random variables (for simplicity, we denote them with the same names), whose probability mass functions are and with a generic (yi represents the index of the ith cluster). These two probabilities can be approximated with a frequency count; so, for example, the probability is computed as the number of data points whose true label is one over the total number of data points M. In this way, we can define the entropies:

These quantities describe the intrinsic uncertainty of the random variables. They are maximized when all the classes have the same probability, while, for example, they are null if all the samples belong to a single class (minimum uncertainty). We also need to know the uncertainty of a random variable Y given another one X. This can be achieved using the conditional entropy H(Y|X). In this case, we need to compute the joint probability because the definition of H(Y|X) is:

In order to approximate the previous expression, we can define the function n(itrue, jpred), which counts the number of samples with the true label i assigned to cluster j.

In this way, if there are M samples, the approximated conditional entropies become:

Using these measures, we can now compute some scores that cover different aspects of the clustering result. They are often computed together because each of them has a peculiar feature.

Homogeneity score

This score is useful to check whether the clustering algorithm meets an important requirement: a cluster should contain only samples belonging to a single class. It's defined as:

It's bounded between 0 and 1, with low values indicating low homogeneity. In fact, when the knowledge of Ypred reduces the uncertainty of Ytrue, H(Ytrue | Ypred) becomes smaller () and vice versa. For our example, the homogeneity score can be computed as:

from sklearn.metrics import homogeneity_score
print(homogeneity_score(digits['target'], Y))

The output is:

0.739148799605

The digits['target'] array contains the true labels, while Y contains the predictions (all the functions we are going to use accept the true labels as the first parameter and the predictions as the second one). The homogeneity score confirms that the clusters are rather homogeneous, but there's still a moderate level of uncertainty because some clusters contain incorrect assignments.

This method, together with the other ones, can be used to search for the right number of clusters and tune up all supplementary hyperparameters (such as the number of iterations or the metric function).

Completeness score

This score is complementary to the homogeneity score. Its purpose is to provide a piece of information about the assignment of samples belonging to the same class. More precisely, a good clustering algorithm should assign all samples with the same true label to the same cluster. From our previous analysis, we know that, for example, the digit 7 has been wrongly assigned to both clusters 9 and 1; therefore, we expect a non-perfect completeness score. The definition is symmetric to the homogeneity score:

The rationale is very intuitive. When H(Ypred | Ytrue) is low (), it means that the knowledge of the ground truth reduces the uncertainty about the predictions. Therefore, if we know that all the samples of subset A have the same label yi, we are quite sure that all the corresponding predictions have been assigned to the same cluster. The completeness score for our example is:

from sklearn.metrics import completeness_score
print(completeness_score(digits['target'], Y))

The output is:

0.747718831945

Again, the value confirms our hypothesis. The residual uncertainty is due to a lack of completeness because a few samples with the same label have been split into blocks that are assigned to wrong clusters. It's obvious that a perfect scenario is characterized by having both homogeneity and completeness scores equal to 1.

Adjusted Rand index

This score is useful to compare the original label distribution with the clustering prediction. Ideally, we'd like to reproduce the exact ground truth distribution, but in general, this is very difficult in real-life scenarios. A way to measure the discrepancy is provided by the adjusted Rand index.

In order to compute this score, we need to define the auxiliary variables:

  • a: Number of sample pairs that have the same true label and that are assigned to the same cluster
  • b: Number of sample pairs that have a different true label and that are assigned to different clusters

The Rand index is defined as:

The adjusted Rand index is the Rand index corrected for chance and it's defined as:

The RA measure is bounded between -1 and 1. A value close to -1 indicates a prevalence of wrong assignments, while a value close to 1 indicates that the clustering algorithm is correctly reproducing the ground truth distribution. The adjusted Rand score for our example is:

from sklearn.metrics import adjusted_rand_score
print(adjusted_rand_score(digits['target'], Y))

The output is:

0.666766395716

This value confirms that the algorithm is working well (because it's positive), but it can be further optimized by trying to reduce the number of incorrect assignments. The adjusted Rand score is a very powerful tool when the ground truth is known and can be employed as a single method to optimize all the hyperparameters.

Silhouette score

This measure doesn't need to know the ground truth and can be used to check, at the same time, the intra-cluster cohesion and the inter-cluster separation. In order to define the Silhouette score, we need to introduce two auxiliary functions. The first one is the average intra-cluster distance of a point belonging to a cluster Cj:

In the previous expression, n(k) is the number of samples assigned to the cluster Cj and is a standard distance function (in the majority of cases, the Euclidean distance is the most reasonable choice). We also need to define the lowest inter-cluster distance, which can be interpreted as the average nearest-cluster distance. In the sample , let's call Ct the nearest cluster; therefore, the function is defined as:

The Silhouette score for sample is:

The value of , like for the adjusted Rand index, is bounded between -1 and 1. A value close to -1 indicates that , so the average intra-cluster distance is greater than the average nearest-cluster index and sample is wrongly assigned. Conversely, a value close to 1 indicates that the algorithm achieved a very good level of internal cohesion and inter-cluster separation (because ). Contrary to the other measure, the Silhouette score isn't a cumulative function and must be computed for each sample. A feasible strategy is to analyze the average value, but in this way, it's not possible to determine which clusters have the highest impact on the result. Another approach (the most common), is based on Silhouette plots, which display the score for each cluster in descending order. In the following snippet, we create plots for four different values of n_clusters (3, 5, 10, and 12):

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples
sns.set()
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
nb_clusters = [3, 5, 10, 12]
mapping = [(0, 0), (0, 1), (1, 0), (1, 1)]
for i, n in enumerate(nb_clusters):
    km = KMeans(n_clusters=n, random_state=1000)
    Y = km.fit_predict(X_train)
    silhouette_values = silhouette_samples(X_train, Y)
    ax[mapping[i]].set_xticks(
              [-0.15, 0.0, 0.25, 0.5, 0.75, 1.0])
    ax[mapping[i]].set_yticks([])
    ax[mapping[i]].set_title("{} clusters".format(n), 
                             fontsize=16)
    ax[mapping[i]].set_xlim([-0.15, 1])
    ax[mapping[i]].grid(True)
    y_lower = 20
    for t in range(n):
        ct_values = silhouette_values[Y == t]
        ct_values.sort()
        y_upper = y_lower + ct_values.shape[0]
        color = cm.Accent(float(t) / n)
        ax[mapping[i]].fill_betweenx(
            np.arange(y_lower, y_upper), 0, 
            ct_values, 
            facecolor=color, 
            edgecolor=color)
        y_lower = y_upper + 20

The result is shown in the following figure:

Silhouette plots for different numbers of clusters

The analysis of a Silhouette plot should follow some common guidelines:

  • The width of each block must be proportional to the number of samples that are expected to belong to the corresponding cluster. If the label distribution is uniform, all the blocks must have a similar width. If the cluster distribution is originally balanced, any asymmetry indicates incorrect assignments. Of course, this is not true if the classes are intrinsically unbalanced. For example, in our case, we know that the right number of clusters is 10, but a couple of blocks are thinner than the other ones. This means that a cluster contains fewer samples than expected and the remaining ones have been assigned to incorrect partitions. On the contrary, if, for example, 50% of our dataset was zeros, a larger silhouette for this class would be perfectly fine. The correct interpretation of this plot requires background knowledge about the data-generating process. If such knowledge is missing (because, for example, the problem has never been studied), looking for symmetric silhouettes is generally good practice (particularly when the other scores confirm the result to be valid).
  • The shape of a block shouldn't be sharp and peaked (like a knife) because it means that many samples have a low Silhouette score. The ideal (realistic) scenario is made up of shapes similar to cigars with a minimal difference between the highest and lowest values. Unfortunately, this is not always possible to achieve, but it's always preferable to tune up the algorithm if the shapes are like the ones plotted in the first diagram (three clusters).
  • The maximum Silhouette score should be close to 1. Lower values (like in our example) indicate the presence of partial overlaps and wrong assignments. Negative values must be absolutely avoided (or limited to a very small number of samples) because they show a failure in the clustering process. Moreover, it's possible to prove that convex clusters (such as K-means hyperspheres) lead to higher values. This is due to the properties of the common distance functions (such as the Euclidean distance) that can suggest a low internal cohesion whenever the shape of a cluster is concave (think about a circle compared to a half-moon). In this case, the process of embedding the shape into a convex geometry leads to a lower density, and this negatively affects the Silhouette score.

In our particular case, we cannot accept having anything other than 10 clusters. However, the corresponding Silhouette plot is not perfect. We know the reasons for such imperfections (the structure of the samples and the high similarity of different digits) and it's quite difficult to avoid them using an algorithm such as K-means. The reader can try to improve the performance by increasing the number of iterations, but in these cases, if the result doesn't meet the requirements, it's preferable to adopt another method (such as the spectral clustering method described in the next chapter, which can manage asymmetric clusters and more complex geometries).

Summary

In this chapter, we presented some fundamental clustering algorithms. We started with KNN, which is an instance-based method that restructures the dataset to find the most similar samples to a given query point. We discussed three approaches: a naive one, which is also the most expensive in terms of computational complexity, and two strategies based respectively on the construction of a k-d tree and a ball tree. These two data structures can dramatically improve performance even when the number of samples is very large.

The next topic was a classic algorithm: K-means, which is a symmetric partitioning strategy comparable to a Gaussian mixture with variances close to zero that can solve many real-life problems. We discussed both a vanilla algorithm, which couldn't find a valid sub-optimal solution, and an optimized initialization method, called K-means++, which was able to speed up the convergence toward solutions quite close to the global minimum.

We also presented some evaluation methods that can be employed to assess the performance of a generic clustering algorithm. These metrics include homogeneity and completeness scores, which allow us to measure the between- and within-cluster separation. We also discussed a more complete measure, the Adjusted Rand index, and a very practical graphical tool, Silhouette plots, which show the structure of the clustering result and help the data scientist identify anomalies and overlapping clusters.

In the next chapter, Chapter 7, Advanced Clustering and Unsupervised Models, we're going to introduce more complex methodologies, like spectral and density-based clustering, that can easily solve problems where algorithms like K-means fail.

Further reading

  • Aggarwal C. C., Hinneburg A., Keim D. A., On the Surprising Behavior of Distance Metrics in High Dimensional Space, ICDT, 2001
  • Arthur D., Vassilvitskii S., The Advantages of Careful Seeding, k-means++: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2006
  • Pedrycz W., Gomide F., An Introduction to Fuzzy Sets, The MIT Press, 1998
  • Shi J., Malik J., Normalized Cuts and Image Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, 08, 2000
  • Gelfand I. M., Glagoleva E. G., Shnol E. E., Functions and Graphs Vol. 2, The MIT Press, 1969
  • Biyikoglu T., Leydold J., Stadler P. F., Laplacian Eigenvectors of Graphs, Springer, 2007
  • Ester M., Kriegel H. P., Sander J., Xu X., A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, AAAI Press, pp. 226-231, 1996
  • Kluger Y., Basri R., Chang J. T., Gerstein M., Spectral Biclustering of Microarray Cancer Data: Co-clustering Genes and Conditions, Genome Research, 13, 2003
  • Huang, S., Wang, H., Li, D., Yang, Y., Li, T., Spectral co-clustering ensemble. Knowledge-Based Systems, 84, 46-55, 2015
  • Bichot, C., Co-clustering Documents and Words by Minimizing the Normalized Cut Objective Function. Journal of Mathematical Modelling and Algorithms, 9(2), 131-147, 2010
  • Agrawal R., Srikant R., Fast Algorithms for Mining Association Rules, Proceedings of the 20th VLDB Conference, 1994
  • Li, Y., The application of Apriori algorithm in the area of association rules. Proceedings of SPIE, 8878, 88784H-88784H-5, 2013
  • Bonaccorso G., Hands-On Unsupervised Learning with Python, Packt Publishing, 2019
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.252.140