Showing hierarchy in multivariate data with clustermap

Sometimes, a heatmap illustration can be hard to interpret when there are too many alternating color bands. This is because our data may not be ordered in terms of similarity. In this case, we need to group more similar data together in order to see the structure.

For this purpose, Seaborn offers the clustermap API, which is a combination of heatmap and dendrogram. A dendrogram is a tree-shaped graph that clusters more similar variables under the same branches/leaves. Drawing a dendrogram involves generally unsupervised hierarchical clustering, which is run in the background by default when we call the clustermap() function.

Besides unsupervised clustering, if we have a priori knowledge of certain labels, we can also show it in colors with the row_colors keyword argument.

Here, we extend from the preceding heatmap example of CPU models, draw a clustered heatmap, and label the generation as row colors. Let's look at the code:

import seaborn as sns

row_colors = cpuspec['Gen'].map({7:'#a2ecec',8:'#ecaabb'}) # map color values to generation
sns.clustermap(cpuspec.drop(['Gen'],axis=1),standard_scale=True,cmap='Blues',row_colors=row_colors);

Again, calling the API is just as simple as the earlier heatmap, and we have generated the following figure:

Other than being helpful in showing multiple properties of a larger number of samples, with some tweaking, clustermap can also be used in pairwise clustering to show the similarity among samples with all the available properties considered together.

To draw a pairwise clustering heatmap, we have to first calculate the correlation between samples from the various property values, convert the correlation matrix into a distance matrix, and then perform hierarchical clustering to generate linkage values for dendrogram plotting. We use the scipy package for this purpose. To understand more about linkage calculation methods, please refer to the SciPy documentation.

We will provide the user-defined function here:

from scipy.cluster import hierarchy
from scipy.spatial import distance
import seaborn as sns

def pairwise_clustermap(df,method='average',metric='cityblock',figsize=(13,13),cmap='viridis',**kwargs):
correlations_array = np.asarray(df.corr())

row_linkage = hierarchy.linkage(
distance.pdist(correlations_array), method=method)

col_linkage = hierarchy.linkage(
distance.pdist(correlations_array.T), method=method)

g = sns.clustermap(correlations, row_linkage=row_linkage, col_linkage=col_linkage,
method=method, metric=metric, figsize=figsize, cmap=cmap,**kwargs)
return g

Here is the result of the pairwise clustering plot:

From both heatmaps, we can observe that, based on these four properties, the CPUs seem to be better clustered by the product line suffix such as U, K, and Y than by brand modifiers such as i5 and i7. When we approach data, this is among the analytical skills where observation of the similarity within a large group is required.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.203.142