The second caveat – we must select the number of clusters beforehand

Another potential limitation is that k-means cannot learn the number of clusters from the data. Instead, we must tell it how many clusters we expect beforehand. You can see how this could be problematic for complicated real-world data that you don't fully understand yet.

From the viewpoint of k-means, there is no wrong or nonsensical number of clusters. For example, if we ask the algorithm to identify six clusters in the dataset generated in the preceding section, it will happily proceed and find the best six clusters:

In [10]: criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER,
... 10, 1.0)
... flags = cv2.KMEANS_RANDOM_CENTERS
... compactness, labels, centers = cv2.kmeans(X.astype(np.float32),
... 6, None, criteria,
... 10, flags)

Here, we used the same preceding code and only changed the number of clusters from four to six. We can plot the data points again and color them according to the target labels:

In [11]: plt.scatter(X[:, 0], X[:, 1], c=labels, s=100, cmap='viridis');

The output looks like this:

The preceding diagram shows an example of k-means discovering more clusters than there actually are. Since we generated the data ourselves, we know that each data point comes from a total of four distinct clusters. For more complicated data where we are uncertain about the right number of clusters, there are a few things we could try.

First and foremost, there is the elbow method, which asks us to repeat the clustering for a whole range of k values and record the compactness value:

In [12]: kvals = np.arange(2, 10)
... compactness = []
... for k in kvals:
... c, _, _ = cv2.kmeans(X.astype(np.float32), k, None,
... criteria, 10, flags)
... compactness.append(c)

Then, we will plot the compactness as a function of k:

In [13]: plt.plot(kvals, compactness, 'o-', linewidth=4,
... markersize=12)
... plt.xlabel('number of clusters')
... plt.ylabel('compactness')

This will result in a plot that looks like an arm. The point that lies at the elbow is the number of clusters to pick:

In our case, we would pick the number four. The reasoning behind this is that, moving from left to right, four is the smallest number of clusters that gives a very compact representation. Three clusters give us a representation that is only half as compact as four clusters. Choosing five clusters or more does not further improve the compactness. Hence, four should be our best guess as to the number of clusters.

If our results are more complicated than that, we might want to consider a more sophisticated method. Some of the most common ones include the following:

  • Silhouette analysis allows us to study the separation between the resulting clusters. If a lot of the points in one cluster are closer to a neighboring cluster than their own, we might be better off putting them all in a single cluster.
  • Gaussian mixture models use something called the Bayesian information criterion to assess the right number of clusters for us.
  • Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and affinity propagation are two more sophisticated clustering algorithms that can choose a suitable number of clusters for us.

In addition, we can also study the separation distance between the resulting clusters using what is known as a silhouette plot. This plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and hence provides a way to assess parameters, such as the number of clusters, visually. This measure has a range of [-1, 1].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.116.51