An interesting application of eigenvectors is for clustering data. Using the eigenvectors of a matrix derived from a distance matrix, unlabelled data can be separated into groups. Spectral clustering methods get their name from the use of the spectrum of this matrix. A distance matrix for n elements (for example, the pairwise distance between data points) is an n × n symmetric matrix. Given such an n × n distance matrix M with distance values mij, we can create the Laplacian matrix of the data points as follows:
Here, I is the identity matrix and D is the diagonal matrix containing the row sums of M,
The data clusters are obtained from the eigenvectors of L. In the simplest case of data points with only two classes, the first eigenvector (that is, the one corresponding to the largest eigenvalue) is often enough to separate the data.
Here is an example for simple two-class clustering. The following code creates some 2D data points and clusters them based on the first eigenvector of the Laplacian matrix:
import scipy.linalg as sl # create some data points n = 100 x1 = 1.2 * random.randn(n, 2) x2 = 0.8 * random.randn(n, 2) + tile([7, 0],(n, 1)) x = vstack((x1, x2)) # pairwise distance matrix M = array([[ sqrt(sum((x[i] - x[j])**2)) for i in range(2*n)] for j in range(2 * n)]) # create the Laplacian matrix D = diag(1 / sqrt( M.sum(axis = 0) )) L = identity(2 * n) - dot(D, dot(M, D)) # compute eigenvectors of L S, V = sl.eig(L) # As L is symmetric the imaginary parts # in the eigenvalues are only due to negligible numerical errors S=S.real V=V.real
The eigenvector corresponding to the largest eigenvalue gives the grouping (for example, by thresholding at 0) and can be shown with:
largest=abs(S).argmax() plot(V[:,largest])
The following figure (Figure 14.2) shows the result of spectral clustering of a simple two-class dataset:
Figure 14.2: shows result of simple two-class clustering
For more difficult datasets and more classes, one usually takes the k eigenvectors corresponding to the k largest eigenvalues and then clusters the data with some other method, but using the eigenvectors instead of the original data points. A common choice is the k-means clustering algorithm, which is the topic of the next example:
The eigenvectors are used as input to k-means clustering, as follows:
import scipy.linalg as sl import scipy.cluster.vq as sc # simple 4 class data x = random.rand(1000,2) ndx = ((x[:,0] < 0.4) | (x[:,0] > 0.6)) & ((x[:,1] < 0.4) | (x[:,1] > 0.6)) x = x[ndx] n = x.shape[0] # pairwise distance matrix M = array([[ sqrt(sum((x[i]-x[j])**2)) for i in range(n) ] for j in range(n)]) # create the Laplacian matrix D = diag(1 / sqrt( M.sum(axis=0) )) L = identity(n) - dot(D, dot(M, D)) # compute eigenvectors of L _,_,V = sl.svd(L) k = 4 # take k first eigenvectors eigv = V[:k,:].T # k-means centroids,dist = sc.kmeans(eigv,k) clust_id = sc.vq(eigv,centroids)[0]
Note that we computed the eigenvectors here using the singular value decomposition, sl.svd
. As L is symmetric, the result is the same as if we would have used sl.eig
, but the eigenvectors come already ordered corresponding to the ordering of the eigenvalues. We also used throw-away variables. svd
returns a list with three arrays, the left and right singular vectors U
, V
, and the singular values S
, as follows:
U, S, V = sl.svd(L)
As we do not need U
and S
here, we can throw them away when unpacking the return value of svd
:
_, _, V = sl.svd(L)
The result can be plotted using:
for i in range(k): ndx = where(clust_id == i)[0] plot(x[ndx, 0], x[ndx, 1],'o') axis('equal')
The following figure shows the result of spectral clustering of a simple multiclass dataset:
Figure 14.3: An example of spectral clustering of a simple four class dataset.
18.118.140.204