Clustering

Another technique used in data mining is clustering. SciPy has two modules to deal with any problem in this field, each of them addressing a different clustering tool—scipy.cluster.vq for k-means and scipy.cluster.hierarchy for hierarchical clustering.

Vector quantization and k-means

We have two routines to divide data into clusters using the k-means technique—kmeans and kmeans2. They correspond to two different implementations. The former has a very simple syntax:

kmeans(obs, k_or_guess, iter=20, thresh=1e-05)

The obs parameter is an ndarray with the data we wish to cluster. If the dimensions of the array are m x n, the algorithm interprets this data as m points in the n-dimensional Euclidean space. If we know the number of clusters in which this data should be divided, we enter so with the k_or_guess option. The output is a tuple with two elements. The first is an ndarray of dimension k x n, representing a collection of points—as many as clusters were indicated. Each of these locations indicates the centroid of the found clusters. The second entry of the tuple is a floating-point value indicating the distortion between the passed points, and the centroids generated previously.

If we wish to impose an initial guess for the centroids of the clusters, we may do so with the k_or_guess parameter again, by sending a k x n ndarray.

The data we pass to kmeans need to be normalized with the whiten routine.

The second option is much more flexible, as its syntax indicates:

kmeans2(data, k, iter=10, thresh=1e-05,
minit='random', missing='warn')

The data and k parameters are the same as obs and k_or_guess, respectively. The difference in this routine is the possibility of choosing among different initialization algorithms, hence providing us with the possibility to speed up the process and use fewer resources if we know some properties of our data. We do so by passing to the minit parameter, one of the strings such as 'random' (initialization centroids are constructed randomly using a Gaussian), 'points' (initialization is done by choosing points belonging to our data), or 'uniform' (if we prefer uniform distribution to Gaussian).

In case we would like to provide the initialization centroids ourselves with the k parameter, we must indicate our choice to the algorithm by passing 'matrix' to the minit option as well.

In any case, if we wish to classify the original data by assigning to each point the cluster to which it belongs; we do so with the vq routine (for vector quantization). The syntax is pretty simple as well:

vq(obs, centroids)

The output is a tuple with two entries. The first entry is a one-dimensional ndarray of size n holding for each point in obs, the cluster to which it belongs. The second entry is another one-dimensional ndarray of the same size, but containing floating-point values indicating the distance from each point to the centroid of its cluster.

Let us illustrate with a classical example, the mouse dataset. We will create a big dataset with randomly generated points in three disks, as follows:

>>> import numpy
>>> from scipy.stats import norm
>>> from numpy import array,vstack
>>> data=norm.rvs(0,0.3,size=(10000,2))
>>> inside_ball=numpy.hypot(data[:,0],data[:,1])<1.0
>>> data=data[inside_ball]
>>> data = vstack((data, data+array([1,1]),data+array([-1,1])))

Once created, we will request the data to be separated into three clusters:

>>> from scipy.cluster.vq import *
>>> centroids, distortion = kmeans(data,3)
>>> cluster_assignment, distances = vq(data,centroids)

Let us present the results:

>>> from matplotlib.pyplot import plot
>>> import matplotlib.pyplot as plt
>>> plt.plot(data[cluster_assignment==0,0], 
     data[cluster_assignment==0,1], 'ro')
>>> plt.plot(data[cluster_assignment==1,0], 
     data[cluster_assignment==1,1], 'b+')
>>> plt.plot(data[cluster_assignment==2,0], 
     data[cluster_assignment==2,1], 'k.')
>>> plt.show()

This gives the following plot showing the mouse dataset with three clusters from left to right—red (squares), blue (pluses), and black (dots):

Vector quantization and k-means

Hierarchical clustering

There are several different algorithms to perform hierarchical clustering. SciPy has routines for the following methods:

  • Single/min/nearest method: single
  • Complete/max/farthest method: complete
  • Average/UPGMA method: average
  • Weighted/WPGMA method: weighted
  • Centroid/UPGMC method: centroid
  • Median/WPGMC method: median
  • Ward's linkage method: ward

In any of the previous cases, the syntax is the same; the only input is the dataset, which can be either an m x n ndarray representing m points in the n-dimensional Euclidean space, or a condensed distance matrix obtained from the previous data using the pdist routine from scipy.spatial. The output is always an ndarray representing the corresponding linkage matrix of the clustering obtained.

Alternatively, we may call the clustering with the generic routine linkage. This routine accepts a dataset/distance matrix, and a string indicating the method to use. The strings coincide with the names introduced. The advantage of linkage over the previous routines is that we are also allowed to indicate a different metric than the usual Euclidean distance. The complete syntax for linkage is then as follows:

linkage(data, method='single', metric='euclidean')

Different statistics on the resulting linkage matrices may be performed with the routines such as Cophenetic distances between observations (cophenet); inconsistency statistics (inconsistent); maximum inconsistency coefficient for each non-singleton cluster with its descendants (maxdists); and maximum statistic for each non-singleton cluster with its descendants (maxRstat).

It is customary to use binary trees to represent linkage matrices, and the scipy.cluster.hierachy submodule has a large number of different routines to manipulate and extract information from these trees. The most useful of these routines is the visualization of these trees, often called dendrograms. The corresponding routine in SciPy is dendrogram, and has the following imposing syntax:

dendrogram(Z, p=30, truncate_mode=None, color_threshold=None, 
get_leaves=True, orientation='top', labels=None, 
count_sort=False, distance_sort=False, 
show_leaf_counts=True, no_plot=False, no_labels=False, 
color_list=None, leaf_font_size=None, 
leaf_rotation=None, leaf_label_func=None, 
no_leaves=False, show_contracted=False,
link_color_func=None)

The first obvious parameter, Z, is a linkage matrix. This is the only non-optional variable. The other options control the style of the output (colors, labels, rotation, and so on), and since they are technically nonmathematical in nature, we will not explore them in detail in this monograph, other than through the simple application to animal clustering shown next.

Clustering mammals by their dentition

Mammals' teeth are divided into four groups such as incisors, canines, premolars, and molars. The dentition of several mammals has been collected, and is available for download at http://www.uni-koeln.de/themen/statistik/data/cluster/dentitio.dat.

This file presents the name of the mammal, together with the number of top incisors, bottom incisors, top canines, bottom canines, top premolars, bottom premolars, top molars, and bottom molars.

We wish to use hierarchical clustering on that dataset to assess which species are closer to each other by these features.

We will start by preparing the dataset and store the relevant data in ndarrays. The original data is given as a text file, where each line represents a different mammal. The first four lines are as follows:

OPOSSUM                    54113344
HAIRY TAIL MOLE            33114433
COMMON MOLE              32103333
STAR NOSE MOLE            33114433

The first 27 characters of each line hold the name of the animal. The characters in positions 28 to 35 are the number of respective kinds of dentures. We need to prepare this data into something that SciPy can handle. We will collect the names apart, since we will be using them as labels in the dendrogram. The rest of the data will be forced into an array of integers:

>>> import numpy
>>> file=open("dentitio.dat","r") # open the file
>>> lines=file.readlines() # read each line in memory
>>> file.close() # close the file
>>> mammals=[] # this stores the names
>>> dataset=numpy.zeros((len(lines),8)) # this stores the data
>>> for index,line in enumerate(lines):
           mammals.append( line[0:27].rstrip(" ").capitalize() )
               for tooth in range(8):
                   dataset[index,tooth]=int(line[27+tooth])

We will proceed to compute the linkage matrix and its posterior dendrogram, making sure to use the Python list, mammals, as labels:

>>> import matplotlib.pyplot as plt
>>> from scipy.cluster.hierarchy import linkage, dendrogram
>>> Z=linkage(dataset)
>>> dendrogram(Z, labels=mammals, orientation="right")
>>> matplotlib.pyplot.show()
>>> plt.show()

This gives us the following dendrogram showing clustering of mammals according to their dentition:

Clustering mammals by their dentition

Note how all the bats are clustered together. The mice are also clustered together, but far from the bats. Sheep, goats, antelopes, deer, and moose have similar dentures too, and they appear clustered at the bottom of the tree, next to the opossum and the armadillo. Note how all felines are also clustered together, on the top of the tree.

Experts in data analysis can obtain more information from dendrograms; they are able to interpret the lengths of the branches or the different colors used in the composition, and give us more insightful explanations about the way the clusters differ from each other.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.114.19