Working mechanics of the k-means algorithm

The execution of the k-means algorithm involves the following steps:

  1. Randomly select k observations from the dataset as the initial cluster centroids.
  2. For each observation in the dataset, perform the following:
    1. Compute the distance between the observation and each of the cluster centroids.
    2. Identify the cluster centroid that has minimum distance with the observation.
    3. Assign the observation to such closest centroid.
  1. With all points assigned to one of the cluster centroids, compute new cluster centroids. This can be done by taking the mean of all the points assigned to a cluster.
  2. Perform step 2 and step 3 repeatedly until the cluster centroids (mean) do not change or until a user-defined number of iterations is reached.

One key thing to note in k-means is that the cluster centroids in the initial step are selected randomly and the initial cluster assignments are done based on the distance between the actual observations and the randomly-picked cluster centroids. This essentially means that if we were to pick observations as cluster centroids in the initial step other than the observations that were chosen, we would obtain different clusters than the one we have obtained. In technical terms, this is called a non-globally-optimal solution or a locally-optimal solution. The cluster library's k-means function has the nstart option, which works around this problem of the non-globally-optimal solution encountered with the k-means algorithm.

The nstart option enables the algorithm to try several random starts (instead of just one) by drawing a number of center observations from the datasets. It then checks the cluster sum of squares and proceeds with the best start, resulting in a more stable output. In our case, we set the nstart value as 50, therefore the best start is chosen by k-means post checking it with 50 random initial sets of cluster centroids. The following diagram depicts the high-level steps involved in the k-means clustering algorithm:

Steps in k-means clustering

In supervised ML methods, such as classification, we have ground truth, therefore, we will be able to able to compare our predictions with the ground truth and measure to report the performance of our classification. Unlike the supervised ML method, in clustering, we do not have any ground truth. Therefore, computing the performance measurement with respect to clustering is a challenge.

As an alternative to performance measurement, we use a pseudo-measure called cluster quality. The cluster quality is generally computed through measures known as intra-cluster distance and inter-cluster distance, which are illustrated in the following diagram:

Intra-cluster distance and inter-cluster distance defined

The goal of the clustering task is to obtain good-quality clusters. Clusters are termed as high-quality clusters if the distance within the observations is minimum and the distance between the clusters themselves is maximum.

There are multiple ways inter-cluster and intra-cluster distances can be measured:

  • Intra-cluster: This distance can be measured as (sum, minimum, maximum, or mean) of the (absolute/squared) distance between all pairs of points in the cluster (or) diametertwo farthest points (or) between the centroid and all points in the cluster.
  • Inter-cluster: This distance is measured as sum of the (squared) distance between all pairs of clusters, where distance between two clusters itself is computed as one of the following:
    • Distance between their centroids
    • Distance between farthest pair of points
    • Distance between the closest pair of points belonging to the clusters

Unfortunately, there is no way to pinpoint the preferred inter-cluster distance and intra-cluster distance values. The Silhouette index is one metric that is based on inter-cluster distance and intra-cluster distance that can be readily computed and easily interpreted.

The Silhouette index is computed using the mean intra-cluster distance, a, and the mean nearest-cluster distance, b, for each of the observations participating in the clustering exercise. The Silhouette index for an observation is given by the following formula:

Here, b is the distance between an observation and the nearest cluster that the observation is not a part of.

Silhouette index value ranges between [-1, 1]. A value of +1 for an observation indicates that the observation is far away from its neighboring cluster and it is very close to the cluster it is assigned to. Similarly, a value of -1 tells us that the observation is close to its neighboring cluster than to the cluster it is assigned to. A value of 0 means it's at the boundary of the distance between the two clusters. A value of +1 is ideal and -1 is the least preferred. Hence, the higher the value, the better the quality of the cluster.

The cluster library offers the Silhouette function, which can be readily used on our k-means clustering output to understand the quality of the clusters that were formed. The following code computes the Silhouette index for our three clusters:

# computing the silhouette index for the clusters
si <- silhouette(kmout$cluster, dist(cust_data, "euclidean"))
# printing the summary of the computed silhouette index
print(summary(si))

This will give us the following output:

Silhouette of 440 units in 3 clusters from silhouette.default(x = kmout$cluster, dist = dist(cust_data, from "euclidean")) :
Cluster sizes and average silhouette widths:
60 50 330
0.2524346 0.1800059 0.5646307
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.1544 0.3338 0.5320 0.4784 0.6743 0.7329

As we have seen, the Silhouette index can range from -1 to +1, and the latter is preferred. From the output, we see that the clusters are all good quality clusters, as the average width is a positive number closer to 1 than -1.

In fact, the Silhouette index can be used not just to measure the quality of clusters formed but also to compute the k-value. Similar to Elbow method, we can iterate through multiple values of k and then identify the k that yields the maximum Silhouette index values across the clusters. Clustering can then be performed using the k that was identified.

There are numerous cluster-quality measures described in the literature. The Silhouette index is just one measure we covered in this chapter because of its popularity in the ML community. The clusterCrit library offers a wide range of indices to measure the quality of clusters. We are not going to explore the other cluster-quality metrics here, but interested readers should refer to this library for further information on how to compute cluster quality.

So far, we have covered the k-means clustering algorithm to identify the clusters, but the original segmentation task we started with does not end here. Segmentation further spans to the task of understanding what each of the clusters formed from the clustering exercise mean to businesses. For example, we take our cluster centroids obtained from k-means and an attempt is made to identify what these are:

Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
1 8253.47 3824.603 5280.455 2572.661 1773.058 1137.497
2 8000.04 18511.420 27573.900 1996.680 12407.360 2252.020
3 35941.40 6044.450 6288.617 6713.967 1039.667 3049.467

Here are some sample insights into each cluster:

  • Cluster 1 is low spenders (average spending: 22,841.744), with the majority of spending allocated to the fresh category
  • Cluster 2 is high spenders (average spending: 70,741.42), with the majority of spending in the grocery category
  • Cluster 3 is medium spenders (average spending : 59,077.568), with the majority of spending in the fresh category

Now, based on the business objective, one or more clusters may be selected to target. For example, if the objective is to have high spenders spend more, promotions may be rolled out to cluster 2 individuals with spending in the Frozen and Delicatessen products less than the centroid values (that is, Frozen: 1,996.680 and Delicatessen: 2,252.020).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.172.146