Data understanding and preparation

Let's start with installing the R packages needed for this chapter, if you have not done so already:

> library(magrittr)

> install.packages("cluster")

> install.packages("dendextend")

> install.packages("ggthemes")

> install.packages("HDclassif") 

> install.packages("NbClust") 

> install.packages("tidyverse")

> options(scipen=999)

The dataset is in the HDclassif package. Load the data and examine the structure with the str() function:

> library(HDclassif)

> data(wine)

> str(wine)
'data.frame': 178 obs. of 14 variables:
 $ class: int 1 1 1 1 1 1 1 1 1 1 ...
 $ V1 : num 14.2 13.2 13.2 14.4 13.2 ...
 $ V2 : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
 $ V3 : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
 $ V4 : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
 $ V5 : int 127 100 101 113 118 112 96 121 97 98 ...
 $ V6 : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
 $ V7 : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
 $ V8 : num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
 $ V9 : num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
 $ V10 : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
 $ V11 : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
 $ V12 : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
 $ V13 : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

The data consists of 178 wines with 13 variables of the chemical composition and one variable, Class, the label for the cultivar or plant variety. We won't use this in the clustering, but as a test of model performance. The variables, V1 through V13, are the measures of the chemical composition, as follows:

V1: alcohol
V2: malic acid
V3: ash
V4: alkalinity of ash
V5: magnesium
V6: total phenols
V7: flavonoids
V8: non-flavonoid phenols
V9: proanthocyanidins
V10: color intensity
V11: hue
V12: OD280/OD315
V13: proline

The variables are all quantitative. We should rename them to something meaningful for our analysis. This is easily done with the colnames() function:

> colnames(wine) <- c(
    "Class",
    "Alcohol",
    "MalicAcid",
    "Ash",
    "Alk_ash",
    "magnesium",
    "T_phenols",
    "Flavanoids",
    "Non_flav",
    "Proantho",
    "C_Intensity",
    "Hue",
    "OD280_315",
    "Proline"
 )

As the variables are not scaled, we will need to do this using the scale() function. This will first center the data where the column mean is subtracted from each individual in the column. Then the centered values will be divided by the corresponding column's standard deviation. We can also use this transformation to make sure that we only include columns 2 through 14, dropping class and putting it in a data frame. This can all be done with one line of code:

  > wine_df <- as.data.frame(scale(wine[, -1]))

Before moving on, and out of curiosity, let's do a quick table to see the distribution of the cultivars or Class:

> table(wine$Class)

 1  2  3 
59 71 48

We can now move on to the unsupervised learning models.

Table of Contents for Data understanding and preparation

Create new playlist

Sign In

Sign Up

Table of Contents for
Data understanding and preparation