Choosing the value of k

Let's return to our earlier discussion on what is the right value for k. In the preceding example, it is more intuitive to set it to 3 since we know there are three classes in total. However, in most cases, we don't know how many groups are sufficient or efficient, while the algorithm needs a specific value of k to start with. So, how can we choose the value for k? There is a famous approach called the Elbow method.

In the Elbow method, different values of k are chosen and corresponding models are trained; for each trained model, the sum of squared errors, or SSE (also called the sum of within-cluster distances) to centroids is calculated and is plotted against k. Note, for one cluster, the squared error (or the within-cluster distance) is computed as the sum of the squared distances from individual samples in the cluster to the centroid. The optimal k is chosen where the marginal drop of SSE starts to decrease dramatically. It means further clustering does not provide any substantial gain.

Let's apply the Elbow method to the example we covered in the previous section (that's what this book is all about). We perform k-means clustering under different values of k on the iris data:

>>> iris = datasets.load_iris()
>>> X = iris.data
>>> y = iris.target
>>> k_list = list(range(1, 7))
>>> sse_list = [0] * len(k_list)

We, herein, use the whole feature space and k ranges from 1 to 6. Then, we train individual models and record the resulting SSE respectively:

>>> for k_ind, k in enumerate(k_list):
... kmeans = KMeans(n_clusters=k, random_state=42)
... kmeans.fit(X)
... clusters = kmeans.labels_
... centroids = kmeans.clustercenters
... sse = 0
... for i in range(k):
... cluster_i = np.where(clusters == i)
... sse += np.linalg.norm(X[cluster_i] - centroids[i])
... print('k={}, SSE={}'.format(k, sse))
... sse_list[k_ind] = sse
k=1, SSE=26.103076447039722
k=2, SSE=16.469773740281195
k=3, SSE=15.089477089696558
k=4, SSE=15.0307321707491
k=5, SSE=14.858930749063735
k=6, SSE=14.883090350867239

Finally, plot the SSE versus the various k ranges as follows:

>>> plt.plot(k_list, sse_list)
>>> plt.show()

This will result in the following output:

Apparently, the Elbow point is k=3, since the drop in SSE slows down dramatically right after 3. Hence, k=3 is an optimal solution in this case, which is consistent with the fact.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.2.157