Clustering

An alternative approach to PCA is k-means (unsupervised) clustering, which partitions the data into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. We can perform k-means clustering with the kmeans() function and plot the results with plot3d() as follows:

> set.seed(44)
> cl <- kmeans(fish.data[,1:3],5)
> fish.data$cluster <- as.factor(cl$cluster)
> plot3d(fish.log.pca$x[,1:3], col=fish.data$cluster, main="k-means clusters")
Clustering

Note

The color scheme used for the groups is different from the 3D plot of the PCA results. However, the overall distribution of the groups is similar.

Let's now evaluate how well it categorizes the data with a table as follows:

> with(fish.data, table(cluster, fish))
       fish
cluster Bluegill Bowfin Carp Goldeye Largemouth_Bass
      1        0      0   14      39              18
      2        0     27   12       0              22
      3        0     23   13       0               2
      4        0      0   11       0               0
      5       50      0    0      11               8

As you can see, it nicely groups all the Bluegill fish together but had a much harder time placing the other fish in the right group.

To help improve the classification of the fish into the five groups, we can perform hierarchical clustering as follows:

> di <- dist(fish.data[,1:3], method="euclidean")
> tree <- hclust(di, method="ward")

> fish.data$hcluster <- as.factor((cutree(tree, k=5)-2) %% 3 +1)
> plot(tree, xlab="", cex=0.2)

Let's add a red box around the five hierarchical clusters as follows:

> rect.hclust(tree, k=5, border="red") 

The result is shown in the following plot:

Clustering

Now, let's create a table to determine how well we can group the fish based on the hierarchical clustering as follows:

> with(fish.data, table(hcluster, fish))
        fish
hcluster Bluegill Bowfin Carp Goldeye Largemouth_Bass
      -1       50      0    0       0               4
      0         0     35    8       0              20
      1         0     15   23       0               0
      2         0      0   10       9              14
      3         0      0    9      41              12

As you can see, hierarchical clustering didn't drastically improve the classification of the fish. Therefore, you might consider collecting additional measurements to help classify the fish since the information provided by the length, weight, and speed is insufficient.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.184.102