Implementing k-means from scratch

We use the iris dataset from scikit-learn as an example. Let's first load the data and visualize it. We herein only use two features out of the original four for simplicity:

>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> X = iris.data[:, 2:4]
>>> y = iris.target

Since the dataset contains three iris classes, we plot it in three different colors, as follows:

>>> import numpy as np
>>> from matplotlib import pyplot as plt
>>> y_0 = np.where(y==0)
>>> plt.scatter(X[y_0, 0], X[y_0, 1])
>>> y_1 = np.where(y==1)
>>> plt.scatter(X[y_1, 0], X[y_1, 1])
>>> y_2 = np.where(y==2)
>>> plt.scatter(X[y_2, 0], X[y_2, 1])
>>> plt.show()

This will give you the following output for the origin data plot:

Assuming we know nothing about the label y, we try to cluster the data into three groups, as there seem to be three crows in the preceding plot (or you may say two, which we will come back to later). Let's perform step 1, specifying k, and step 2, initializing centroids, by randomly selecting three samples as initial centroids:

>>> k = 3
>>> random_index = np.random.choice(range(len(X)), k)
>>> centroids = X[random_index]

We visualize the data (without labels any more) along with the initial random centroids:

>>> def visualize_centroids(X, centroids):
...     plt.scatter(X[:, 0], X[:, 1])
...     plt.scatter(centroids[:, 0], centroids[:, 1], marker='*',
                                             s=200, c='#050505')
...     plt.show()
>>> visualize_centroids(X, centroids)

Refer to the following screenshot for the data, along with the initial random centroids:

Now we perform step 3, which entails assigning clusters based on the nearest centroids. First, we need to define a function calculating distance that is measured by the Euclidean distance, as demonstrated herein:

>>> def dist(a, b):
...     return np.linalg.norm(a - b, axis=1)

Then, we develop a function that assigns a sample to the cluster of the nearest centroid:

>>> def assign_cluster(x, centroids):
...     distances = dist(x, centroids)
...     cluster = np.argmin(distances)
...     return cluster

With the clusters assigned, we perform step 4, which involves updating the centroids to the mean of all samples in the individual clusters:

>>> def update_centroids(X, centroids, clusters):
...     for i in range(k):
...         cluster_i = np.where(clusters == i)
...         centroids[i] = np.mean(X[cluster_i], axis=0)

Finally, we have step 5, which involves repeating step 3 and step 4 until the model converges and whichever of the following occurs:

Centroids move small enough
Sufficient iterations have been taken

We set the tolerance of the first condition and the maximum number of iterations as follows:

>>> tol = 0.0001
>>> max_iter = 100

Initialize their starting values, along with the starting clusters for all samples as follows:

>>> iter = 0
>>> centroids_diff = 100000
>>> clusters = np.zeros(len(X))

With all the components ready, we can train the model iteration by iteration where it first checks convergence, before performing steps 3 and step 4, and visualizes the latest centroids:

>>> from copy import deepcopy
>>> while iter < max_iter and centroids_diff > tol:
...     for i in range(len(X)):
...         clusters[i] = assign_cluster(X[i], centroids)
...     centroids_prev = deepcopy(centroids)
...     update_centroids(X, centroids, clusters)
...     iter += 1
...     centroids_diff = np.linalg.norm(centroids -
                                       centroids_prev)
...     print('Iteration:', str(iter))
...     print('Centroids:
', centroids)
...     print('Centroids move: {:5.4f}'.format(centroids_diff))
...     visualize_centroids(X, centroids)

Let's take a look at the following outputs generated from the preceding commands:

Iteration 1: Take a look at the following output of iteration 1:

Iteration: 1
Centroids:
[[5.01827957 1.72258065]
[3.41428571 1.05714286]
[1.464      0.244 ]]
Centroids move: 0.8274

The plot of centroids after iteration 1 is as follows:

Iteration 2: Take a look at the following output of iteration 2:

Iteration: 2
Centroids:
[[5.20897436 1.81923077]
[3.83181818 1.16818182]
[1.464      0.244 ]]
Centroids move: 0.4820

The plot of centroids after iteration 2 is as follows:

Iteration 3: Take a look at the following output of iteration 3:

Iteration: 3
Centroids:
[[5.3796875  1.9125 ]
[4.06388889 1.25555556]
[1.464      0.244 ]]
Centroids move: 0.3152

The plot of centroids after iteration 3 is as follows:

Iteration 4: Take a look at the following output of iteration 4:

Iteration: 4
Centroids:
[[5.51481481 1.99444444]
[4.19130435 1.30217391]
[1.464      0.244 ]]
Centroids move: 0.2083

The plot of centroids after iteration 4 is as follows:

Iteration 5: Take a look at the following output of iteration 5:

Iteration: 5
Centroids:
[[5.53846154 2.01346154]
[4.22083333 1.31041667]
[1.464      0.244 ]]
Centroids move: 0.0431

The plot of centroids after iteration 5 is as follows:

Iteration 6: Take a look at the following output of iteration 6:

Iteration: 6
Centroids:
[[5.58367347 2.02653061]
[4.25490196 1.33921569]
[1.464 0.244 ]]
Centroids move: 0.0648

The plot of centroids after iteration 6 is as follows:

Iteration 7: Take a look at the following output of iteration 7:

Iteration: 7
Centroids:
[[5.59583333 2.0375 ]
[4.26923077 1.34230769]
[1.464 0.244 ]]
Centroids move: 0.0220

The plot of centroids after iteration 7 is as follows:

Iteration 8: Take a look at the following output of iteration 8:

Iteration: 8
Centroids:
[[5.59583333 2.0375 ]
[4.26923077 1.34230769]
[1.464 0.244 ]]
Centroids move: 0.0000

The plot of centroids after iteration 8 is as follows:

The model converges after eight iterations. The resulting centroids look promising, and we can also plot the clusters:

>>> for i in range(k):
...     cluster_i = np.where(clusters == i)
...     plt.scatter(X[cluster_i, 0], X[cluster_i, 1])
>>> plt.scatter(centroids[:, 0], centroids[:, 1], marker='*',
                                           s=200, c='#050505')
>>> plt.show()

Refer to the following screenshot for the end result:

Table of Contents for Implementing k-means from scratch

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementing k-means from scratch