k-Means clustering

k-Means is the most well-known clustering algorithm and was first proposed by Stuart Lloyd at Bell Labs in 1957.

The algorithm finds K centroids and assigns each data point to exactly one cluster with the goal of minimizing the within-cluster variance (called inertia). It typically uses Euclidean distance, but other metrics can also be used. k-Means assumes that clusters are spherical and of equal size, and ignores the covariance among features.

The problem is computationally difficult (np-hard) because there are KN ways to partition the N observations into K clusters. The standard iterative algorithm delivers a local optimum for a given K and proceeds as follows:

  1. Randomly define K cluster centers and assign points to nearest centroid.
  2. Repeat as follows:
    • For each cluster, compute the centroid as the average of the features
    • Assign each observation to the closest centroid
  3. Convergence: assignments (or within-cluster variation) don't change.

The kmeans_implementation notebook shows how to code the algorithm using Python, and visualizes the algorithm's iterative optimization. The following screenshot highlights how the resulting centroids partition the feature space into areas called Voronoi which delineate the clusters:

The result is optimal for the given initialization, but alternative starting positions will produce different results. Hence, we compute multiple clusterings from different initial values and select the solution that minimizes within-cluster variance.

k-Means requires continuous or one-hot encoded categorical variables. Distance metrics are typically sensitive to scale so that standardizing features is necessary to make sure they have equal weight.

The strengths of k-Means include its wide range of applicability, fast convergence, and linear scalability to large data while producing clusters of even size.

The weaknesses include:

  • The need to tune the hyperparameter k
  • The lack of a guarantee to find a global optimum
  • Restrictive assumption that clusters are spheres and features are not correlated
  • Sensitivity to outliers

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
34.230.68.214