The third caveat – cluster boundaries are linear

The k-means algorithm is based on a simple assumption, which is that points will be closer to their own cluster center than to others. Consequently, k-means always assumes linear boundaries between clusters, meaning that it will fail whenever the geometry of the clusters is more complicated than that.

We see this limitation for ourselves by generating a slightly more complicated dataset. Instead of generating data points from Gaussian blobs, we want to organize the data into two overlapping half circles. We can do this using scikit-learn's make_moons. Here, we choose 200 data points belonging to two half circles, in combination with some Gaussian noise:

In [14]: from sklearn.datasets import make_moons
... X, y = make_moons(200, noise=.05, random_state=12)

This time, we tell k-means to look for two clusters:

In [15]: criteria = (cv2.TERM_CRITERIA_EPS +
... cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
... flags = cv2.KMEANS_RANDOM_CENTERS
... compactness, labels, centers = cv2.kmeans(X.astype(np.float32),
... 2, None, criteria,
... 10, flags)
... plt.scatter(X[:, 0], X[:, 1], c=labels, s=100, cmap='viridis');

The resulting scatter plot looks like this diagram:

The preceding diagram shows an example of k-means finding linear boundaries in nonlinear data. As is evident from the plot, k-means failed to identify the two half circles and instead split the data with what looks like a diagonal straight line (from bottom-left to top-right).

This scenario should ring a bell. We had the same problem when we talked about linear SVMs in Chapter 6, Detecting Pedestrians with Support Vector Machines. The idea there was to use the kernel trick to transform the data into a higher-dimensional feature space. Can we do the same here?

We most certainly can. There is a form of kernelized k-means that works akin to the kernel trick for SVMs, called spectral clustering. Unfortunately, OpenCV does not provide an implementation of spectral clustering. Fortunately, scikit-learn does:

In [16]: from sklearn.cluster import SpectralClustering

The algorithm uses the same API as all other statistical models: we set optional arguments in the constructor and then call fit_predict on the data. Here, we want to use the graph of the nearest neighbors to compute a higher-dimensional representation of the data and then assign labels using k-means:

In [17]: model = SpectralClustering(n_clusters=2,
... affinity='nearest_neighbors',
... assign_labels='kmeans')
... labels = model.fit_predict(X)
... plt.scatter(X[:, 0], X[:, 1], c=labels, s=100, cmap='viridis');

The output of spectral clustering looks like this:

We see that spectral clustering gets the job done. Alternatively, we could have transformed the data into a more suitable representation ourselves and then applied OpenCV's linear k-means to it. The lesson of all of this is that, perhaps, again, feature engineering saved the day.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.145.82