Robust clustering

Given certain variables, we usually want to find clusters of observations. These clusters should be as different as possible, and should contain "similar" observations inside. Suppose we had the following pairs of values [height=170,weight=50], [height=180,weight=70],[height=190,weight=90] and [height=200,weight=100] and we wanted to cluster them. A reasonable 2-cluster configuration would have the following centroids: [height=175,weight=60],[height=195,weight=95]. Obviously, the first two observations would fall under the first cluster, and the other two should fall under the second cluster. The simplest and most preferred algorithm for clustering is k-means. It works by picking k centroids at random and assigning each observation to the nearest centroid. The mean/center for each centroid is updated, and the procedure is repeated for the other variables. The algorithm finishes when no observation changes for a cluster.

The problem with k-means is that it uses the mean of each cluster as the point that is used to identify the center of each cluster. But means are non-robust, so this algorithm will suffer from outliers greatly.

The tclust package can be used to fix this issue: it works by intelligently discarding a small percentage of observations that are extreme.

Table of Contents for Robust clustering

Create new playlist

Sign In

Sign Up

Table of Contents for
Robust clustering