How to do it...

We will work with the geyser2 dataset, which contains two features (eruption length and previous eruption length). The objective is to cluster the data into three groups, taking care of the fact that there are some outliers that shouldn't be included in any group. It is worth noting that the number of clusters is usually decided by the modeler. In general, we don't want too many clusters, as interpreting more than 5-8 clusters is very hard.

We first load the tclust library and the geyser dataset. This contains two features:

library(tclust)
data ("geyser2")

Now, we try the standard kmeans and tkmeans, and we set up alpha=0.05. We add an extra variable to the geyser dataset, which we will then use in ggplot:

clus_kmeans <- kmeans (geyser2, centers = 3) 
clus_tkmeans <- tkmeans (geyser2, k = 3, alpha = 0.05) 
geyser2$cluster <- as.factor(clus_kmeans$cluster)

We now plot the kmeans results and tkmeans:

plot (clus,main="Standard k-means",xlab="X-axis label", ylab="y-axix label") 
ggplot(geyser2, aes(x=geyser2$`Eruption length`, y=geyser2$`Previous eruption  length`,color=geyser2$cluster))   + labs(x = "Eruption length", y = "Previous eruption length")+
theme(legend.position="none") +  geom_point(aes(size=3,alpha = 0.2))

The following screenshot shows the standard k-means: the green observations near (2,2) should have probably been flagged as outliers. Instead, they were assigned to the green cluster:

The following screenshot shows the Robust k-means plots: the observations on the lower-left part of the plot were correctly flagged as outliers. There are obviously a few other extra outliers that fall in between the clusters:

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...