Organizing clusters as a hierarchical tree

In this section, we will take a look at an alternative approach to prototype-based clustering: hierarchical clustering. One advantage of hierarchical clustering algorithms is that it allows us to plot dendrograms (visualizations of a binary hierarchical clustering), which can help with the interpretation of the results by creating meaningful taxonomies. Another useful advantage of this hierarchical approach is that we do not need to specify the number of clusters upfront.

The two main approaches to hierarchical clustering are agglomerative and divisive hierarchical clustering. In divisive hierarchical clustering, we start with one cluster that encompasses all our samples, and we iteratively split the cluster into smaller clusters until each cluster only contains one sample. In this section, we will focus on agglomerative clustering, which takes the opposite approach. We start with each sample as an individual cluster and merge the closest pairs of clusters until only one cluster remains.

The two standard algorithms for agglomerative hierarchical clustering are single linkage and complete linkage. Using single linkage, we compute the distances between the most similar members for each pair of clusters and merge the two clusters for which the distance between the most similar members is the smallest. The complete linkage approach is similar to single linkage but, instead of comparing the most similar members in each pair of clusters, we compare the most dissimilar members to perform the merge. This is shown in the following diagram:

Organizing clusters as a hierarchical tree

Note

Other commonly used algorithms for agglomerative hierarchical clustering include average linkage and Ward's linkage. In average linkage, we merge the cluster pairs based on the minimum average distances between all group members in the two clusters. In Ward's method, those two clusters that lead to the minimum increase of the total within-cluster SSE are merged.

In this section, we will focus on agglomerative clustering using the complete linkage approach. This is an iterative procedure that can be summarized by the following steps:

  1. Compute the distance matrix of all samples.
  2. Represent each data point as a singleton cluster.
  3. Merge the two closest clusters based on the distance of the most dissimilar (distant) members.
  4. Update the distance matrix.
  5. Repeat steps 2 to 4 until one single cluster remains.

Now we will discuss how to compute the distance matrix (step 1). But first, let's generate some random sample data to work with. The rows represent different observations (IDs 0 to 4), and the columns are the different features (X, Y, Z) of those samples:

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(123)
>>> variables = ['X', 'Y', 'Z']
>>> labels = ['ID_0','ID_1','ID_2','ID_3','ID_4']
>>> X = np.random.random_sample([5,3])*10
>>> df = pd.DataFrame(X, columns=variables, index=labels)
>>> df

After executing the preceding code, we should now see the following DataFrame containing the randomly generated samples:

Organizing clusters as a hierarchical tree

Performing hierarchical clustering on a distance matrix

To calculate the distance matrix as input for the hierarchical clustering algorithm, we will use the pdist function from SciPy's spatial.distance submodule:

>>> from scipy.spatial.distance import pdist, squareform
>>> row_dist = pd.DataFrame(squareform(
...            pdist(df, metric='euclidean')), 
...            columns=labels, index=labels)
>>> row_dist

Using the preceding code, we calculated the Euclidean distance between each pair of sample points in our dataset based on the features X, Y, and Z. We provided the condensed distance matrix—returned by pdist—as input to the squareform function to create a symmetrical matrix of the pair-wise distances, as shown here:

Performing hierarchical clustering on a distance matrix

Next we will apply the complete linkage agglomeration to our clusters using the linkage function from SciPy's cluster.hierarchy submodule, which returns a so-called linkage matrix.

However, before we call the linkage function, let's take a careful look at the function documentation:

>>> from scipy.cluster.hierarchy import linkage
>>> help(linkage)
[…]
Parameters:
  y : ndarray
    A condensed or redundant distance matrix. A condensed 
    distance matrix is a flat array containing the upper 
    triangular of the distance matrix. This is the form 
    that pdist returns. Alternatively, a collection of m 
    observation vectors in n dimensions may be passed as 
    an m by n array.

  method : str, optional
    The linkage algorithm to use. See the Linkage Methods 
    section below for full descriptions.

  metric : str, optional
    The distance metric to use. See the distance.pdist 
    function for a list of valid distance metrics.

 Returns:
  Z : ndarray
    The hierarchical clustering encoded as a linkage matrix.
[…]

Based on the function description, we conclude that we can use a condensed distance matrix (upper triangular) from the pdist function as an input attribute. Alternatively, we could also provide the initial data array and use the euclidean metric as a function argument in linkage. However, we should not use the squareform distance matrix that we defined earlier, since it would yield different distance values from those expected. To sum it up, the three possible scenarios are listed here:

  • Incorrect approach: In this approach, we use the squareform distance matrix. The code is as follows:
    >>> from scipy.cluster.hierarchy import linkage
    >>> row_clusters = linkage(row_dist, 
    ...                        method='complete', 
    ...                        metric='euclidean')
  • Correct approach: In this approach, we use the condensed distance matrix. The code is as follows:
    >>> row_clusters = linkage(pdist(df, metric='euclidean'),
    ...                        method='complete')
  • Correct approach: In this approach, we use the input sample matrix. The code is as follows:
    >>> row_clusters = linkage(df.values, 
    ...                        method='complete', 
    ...                        metric='euclidean')

To take a closer look at the clustering results, we can turn them to a pandas DataFrame (best viewed in IPython Notebook) as follows:

>>> pd.DataFrame(row_clusters, 
...      columns=['row label 1', 
...               'row label 2', 
...               'distance', 
...               'no. of items in clust.'],
...      index=['cluster %d' %(i+1) for i in
...             range(row_clusters.shape[0])])

As shown in the following table, the linkage matrix consists of several rows where each row represents one merge. The first and second columns denote the most dissimilar members in each cluster, and the third row reports the distance between those members. The last column returns the count of the members in each cluster.

Performing hierarchical clustering on a distance matrix

Now that we have computed the linkage matrix, we can visualize the results in the form of a dendrogram:

>>> from scipy.cluster.hierarchy import dendrogram
# make dendrogram black (part 1/2)
# from scipy.cluster.hierarchy import set_link_color_palette
# set_link_color_palette(['black'])
>>> row_dendr = dendrogram(row_clusters, 
...                       labels=labels,
...                       # make dendrogram black (part 2/2)
...                       # color_threshold=np.inf
...                       )
>>> plt.tight_layout()
>>> plt.ylabel('Euclidean distance')
>>> plt.show()

If you are executing the preceding code or reading the e-book version of this book, you will notice that the branches in the resulting dendrogram are shown in different colors. The coloring scheme is derived from a list of matplotlib colors that are cycled for the distance thresholds in the dendrogram. For example, to display the dendrograms in black, you can uncomment the respective sections that I inserted in the preceding code.

Performing hierarchical clustering on a distance matrix

Such a dendrogram summarizes the different clusters that were formed during the agglomerative hierarchical clustering; for example, we can see that the samples ID_0 and ID_4, followed by ID_1 and ID_2, are the most similar ones based on the Euclidean distance metric.

Attaching dendrograms to a heat map

In practical applications, hierarchical clustering dendrograms are often used in combination with a heat map, which allows us to represent the individual values in the sample matrix with a color code. In this section, we will discuss how to attach a dendrogram to a heat map plot and order the rows in the heat map correspondingly.

However, attaching a dendrogram to a heat map can be a little bit tricky, so let's go through this procedure step by step:

  1. We create a new figure object and define the x axis position, y axis position, width, and height of the dendrogram via the add_axes attribute. Furthermore, we rotate the dendrogram 90 degrees counter-clockwise. The code is as follows:
    >>> fig = plt.figure(figsize=(8,8), facecolor='white')
    >>> axd = fig.add_axes([0.09,0.1,0.2,0.6])
    >>> row_dendr = dendrogram(row_clusters, orientation='right')
    # note: for matplotlib >= v1.5.1, please use orientation=‘left’
  2. Next we reorder the data in our initial DataFrame according to the clustering labels that can be accessed from the dendrogram object, which is essentially a Python dictionary, via the leaves key. The code is as follows:
    >>> df_rowclust = df.ix[row_dendr['leaves'][::-1]]
  3. Now we construct the heat map from the reordered DataFrame and position it right next to the dendrogram:
    >>> axm = fig.add_axes([0.23,0.1,0.6,0.6])
    >>> cax = axm.matshow(df_rowclust, 
    ...              interpolation='nearest', cmap='hot_r')
  4. Finally we will modify the aesthetics of the heat map by removing the axis ticks and hiding the axis spines. Also, we will add a color bar and assign the feature and sample names to the x and y axis tick labels, respectively. The code is as follows:
    >>> axd.set_xticks([])
    >>> axd.set_yticks([])
    >>> for i in axd.spines.values():
    ...     i.set_visible(False)
    >>> fig.colorbar(cax)
    >>> axm.set_xticklabels([''] + list(df_rowclust.columns))
    >>> axm.set_yticklabels([''] + list(df_rowclust.index))
    >>> plt.show()

After following the previous steps, the heat map should be displayed with the dendrogram attached:

Attaching dendrograms to a heat map

As we can see, the row order in the heat map reflects the clustering of the samples in the dendrogram. In addition to a simple dendrogram, the color-coded values of each sample and feature in the heat map provide us with a nice summary of the dataset.

Applying agglomerative clustering via scikit-learn

In this section, we saw how to perform agglomerative hierarchical clustering using SciPy. However, there is also an AgglomerativeClustering implementation in scikit-learn, which allows us to choose the number of clusters that we want to return. This is useful if we want to prune the hierarchical cluster tree. By setting the n_cluster parameter to 2, we will now cluster the samples into two groups using the same complete linkage approach based on the Euclidean distance metric as before:

>>> from sklearn.cluster import AgglomerativeClustering
>>> ac = AgglomerativeClustering(n_clusters=2,
...                              affinity='euclidean', 
...                              linkage='complete')
>>> labels = ac.fit_predict(X)
>>> print('Cluster labels: %s' % labels)
Cluster labels: [0 1 1 0 0]

Looking at the predicted cluster labels, we can see that the first, fourth, and fifth sample (ID_0, ID_3, and ID_4) were assigned to one cluster (0), and the samples ID_1 and ID_2 were assigned to a second cluster (1), which is consistent with the results that we can observe in the dendrogram.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.96.105