Data understanding and preparation

Let's start with installing the R packages needed for this chapter, if you have not done so already:

> library(magrittr)

> install.packages("cluster")

> install.packages("dendextend")

> install.packages("ggthemes")

> install.packages("HDclassif")

> install.packages("NbClust")

> install.packages("tidyverse")

> options(scipen=999)

The dataset is in the HDclassif package. Load the data and examine the structure with the str() function:

> library(HDclassif)

> data(wine)

> str(wine)
'data.frame': 178 obs. of 14 variables:
$ class: int 1 1 1 1 1 1 1 1 1 1 ...
$ V1 : num 14.2 13.2 13.2 14.4 13.2 ...
$ V2 : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
$ V3 : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
$ V4 : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
$ V5 : int 127 100 101 113 118 112 96 121 97 98 ...
$ V6 : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
$ V7 : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
$ V8 : num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
$ V9 : num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
$ V10 : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
$ V11 : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
$ V12 : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
$ V13 : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

The data consists of 178 wines with 13 variables of the chemical composition and one variable, Class, the label for the cultivar or plant variety. We won't use this in the clustering, but as a test of model performance. The variables, V1 through V13, are the measures of the chemical composition, as follows:

  • V1: alcohol
  • V2: malic acid
  • V3: ash
  • V4: alkalinity of ash
  • V5: magnesium
  • V6: total phenols
  • V7: flavonoids
  • V8: non-flavonoid phenols
  • V9: proanthocyanidins
  • V10: color intensity
  • V11: hue
  • V12: OD280/OD315
  • V13: proline

The variables are all quantitative. We should rename them to something meaningful for our analysis. This is easily done with the colnames() function:

> colnames(wine) <- c(
"Class",
"Alcohol",
"MalicAcid",
"Ash",
"Alk_ash",
"magnesium",
"T_phenols",
"Flavanoids",
"Non_flav",
"Proantho",
"C_Intensity",
"Hue",
"OD280_315",
"Proline"
)

As the variables are not scaled, we will need to do this using the scale() function. This will first center the data where the column mean is subtracted from each individual in the column. Then the centered values will be divided by the corresponding column's standard deviation. We can also use this transformation to make sure that we only include columns 2 through 14, dropping class and putting it in a data frame. This can all be done with one line of code:

  > wine_df <- as.data.frame(scale(wine[, -1]))

Before moving on, and out of curiosity, let's do a quick table to see the distribution of the cultivars or Class:

> table(wine$Class)

1 2 3
59 71 48

We can now move on to the unsupervised learning models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.202.27