An alternative approach to PCA is k-means (unsupervised) clustering, which partitions the data into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. We can perform k-means clustering with the kmeans()
function and plot the results with plot3d()
as follows:
> set.seed(44) > cl <- kmeans(fish.data[,1:3],5) > fish.data$cluster <- as.factor(cl$cluster) > plot3d(fish.log.pca$x[,1:3], col=fish.data$cluster, main="k-means clusters")
Let's now evaluate how well it categorizes the data with a table as follows:
> with(fish.data, table(cluster, fish)) fish cluster Bluegill Bowfin Carp Goldeye Largemouth_Bass 1 0 0 14 39 18 2 0 27 12 0 22 3 0 23 13 0 2 4 0 0 11 0 0 5 50 0 0 11 8
As you can see, it nicely groups all the Bluegill fish together but had a much harder time placing the other fish in the right group.
To help improve the classification of the fish into the five groups, we can perform hierarchical clustering as follows:
> di <- dist(fish.data[,1:3], method="euclidean") > tree <- hclust(di, method="ward") > fish.data$hcluster <- as.factor((cutree(tree, k=5)-2) %% 3 +1) > plot(tree, xlab="", cex=0.2)
Let's add a red box around the five hierarchical clusters as follows:
> rect.hclust(tree, k=5, border="red")
The result is shown in the following plot:
Now, let's create a table to determine how well we can group the fish based on the hierarchical clustering as follows:
> with(fish.data, table(hcluster, fish)) fish hcluster Bluegill Bowfin Carp Goldeye Largemouth_Bass -1 50 0 0 0 4 0 0 35 8 0 20 1 0 15 23 0 0 2 0 0 10 9 14 3 0 0 9 41 12
As you can see, hierarchical clustering didn't drastically improve the classification of the fish. Therefore, you might consider collecting additional measurements to help classify the fish since the information provided by the length, weight, and speed is insufficient.
3.137.184.102