In this example, we use R's cluster analysis functions to determine the clustering in the wheat dataset from http://www.ics.uci.edu/.
The R script we want to use in Jupyter is the following:
# load the wheat data set from uci.edu wheat <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt", sep=" ") # define useful column names colnames(wheat) <-c("area", "perimeter", "compactness", "length", "width", "asymmetry", "groove", "undefined") # exclude incomplete cases from the data wheat <- wheat[complete.cases(wheat),] # calculate the clusters fit <- kmeans(wheat, 5) fit
Once entered into a notebook, we have something like this:
The resulting generated cluster information is K-means clustering with five clusters of sizes 29, 57, 65, 15, and 32. (Note that, since I had not set the seed value for random number to use, your results may vary.)
Cluster means are:
area perimeter compactness length width asymmetry 1 16.45345 15.35310 0.8768000 5.882655 3.462517 3.913207 2 14.06456 14.17175 0.8788158 5.463825 3.211526 2.496354 3 11.91292 13.26692 0.8496292 5.237477 2.857908 4.844477 4 19.58333 16.64600 0.8877267 6.315867 3.835067 5.081533 5 18.95781 16.39563 0.8862125 6.250469 3.742781 2.719813
Clustering vectors are:
1 2 3 4 5 6 9 10 11 12 13 14 15 16 17 18 19 20 21 22 2 2 2 2 2 2 1 1 1 2 2 2 2 2 2 2 2 2 2 2 ...
Within cluster sum of squares by cluster are:
[1] 54.16095 146.71080 147.29278 25.81297 30.06596 (between_SS / total_SS = 85.0 %)
The available components are:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" [6] "betweenss" "size" "iter" "ifault"
So, we generated information about five clusters (the parameter passed in the fit
statement).
13.58.82.79