Spectral clustering

An interesting application of eigenvectors is for clustering data. Using the eigenvectors of a matrix derived from a distance matrix, unlabelled data can be separated into groups. Spectral clustering methods get their name from the use of the spectrum of this matrix. A distance matrix for n elements (for example, the pairwise distance between data points) is an n × n symmetric matrix. Given such an n × n distance matrix M with distance values mij, we can create the Laplacian matrix of the data points as follows:

Spectral clustering

Here, I is the identity matrix and D is the diagonal matrix containing the row sums of M,

 

Spectral clustering

The data clusters are obtained from the eigenvectors of L. In the simplest case of data points with only two classes, the first eigenvector (that is, the one corresponding to the largest eigenvalue) is often enough to separate the data.

Here is an example for simple two-class clustering. The following code creates some 2D data points and clusters them based on the first eigenvector of the Laplacian matrix:

import scipy.linalg as sl

# create some data points
n = 100
x1 = 1.2 * random.randn(n, 2)
x2 = 0.8 * random.randn(n, 2) + tile([7, 0],(n, 1))
x = vstack((x1, x2))

# pairwise distance matrix
M = array([[ sqrt(sum((x[i] - x[j])**2)) 
                                  for i in range(2*n)]          
                                    for j in range(2 * n)])
 
# create the Laplacian matrix
D = diag(1 / sqrt( M.sum(axis = 0) ))
L = identity(2 * n) - dot(D, dot(M, D))

# compute eigenvectors of L
S, V = sl.eig(L)
# As L is symmetric the imaginary parts
# in the eigenvalues are only due to negligible numerical errors S=S.real
V=V.real

The eigenvector corresponding to the largest eigenvalue gives the grouping (for example, by thresholding at 0) and can be shown with:

largest=abs(S).argmax()
plot(V[:,largest])

The following figure (Figure 14.2) shows the result of spectral clustering of a simple two-class dataset:

Spectral clustering

Figure 14.2: shows result of simple two-class clustering

For more difficult datasets and more classes, one usually takes the k eigenvectors corresponding to the k largest eigenvalues and then clusters the data with some other method, but using the eigenvectors instead of the original data points. A common choice is the k-means clustering algorithm, which is the topic of the next example:

The eigenvectors are used as input to k-means clustering, as follows:

import scipy.linalg as sl
import scipy.cluster.vq as sc
# simple 4 class data
x = random.rand(1000,2)
ndx = ((x[:,0] < 0.4) | (x[:,0] > 0.6)) & 
                     ((x[:,1] < 0.4) | (x[:,1] > 0.6))
x = x[ndx]
n = x.shape[0]

# pairwise distance matrix
M = array([[ sqrt(sum((x[i]-x[j])**2)) for i in range(n) ]
                                       for j in range(n)])

# create the Laplacian matrix
D = diag(1 / sqrt( M.sum(axis=0) ))
L = identity(n) - dot(D, dot(M, D))

# compute eigenvectors of L
_,_,V = sl.svd(L)

k = 4
# take k first eigenvectors
eigv = V[:k,:].T

# k-means
centroids,dist = sc.kmeans(eigv,k)
clust_id = sc.vq(eigv,centroids)[0]

Note that we computed the eigenvectors here using the singular value decomposition, sl.svd. As L is symmetric, the result is the same as if we would have used sl.eig, but the eigenvectors come already ordered corresponding to the ordering of the eigenvalues. We also used throw-away variables. svd returns a list with three arrays, the left and right singular vectors U, V, and the singular values S, as follows:

U, S, V = sl.svd(L)

As we do not need U and S here, we can throw them away when unpacking the return value of svd:

_, _, V = sl.svd(L)

The result can be plotted using:

for i in range(k):
    ndx = where(clust_id == i)[0]
    plot(x[ndx, 0], x[ndx, 1],'o')
axis('equal')

The following figure shows the result of spectral clustering of a simple multiclass dataset:

Spectral clustering

Figure 14.3: An example of spectral clustering of a simple four class dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.140.204