Understanding the wholesale customer dataset and the segmentation problem

The UCI Machine Learning Repository offers the wholesale customer dataset at https://archive.ics.uci.edu/ml/datasets/wholesale+customers. The dataset refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories. The goal of these projects is to apply clustering techniques to identify segments that are relevant for certain business activities, such as rolling out a marketing campaign. Before we actually use the clustering algorithms to get clusters, let's first read the data and perform some EDA to understand the data using the following code block:

# setting the working directory to a folder where dataset is located
setwd('/home/sunil/Desktop/chapter5/')
# reading the dataset to cust_data dataframe
cust_data = read.csv(file='Wholesale_customers_ data.csv', header = TRUE)
# knowing the dimensions of the dataframe
print(dim(cust_data))
Output : 
440 8
# printing the data structure
print(str(cust_data))
'data.frame': 440 obs. of 8 variables:
 $ Channel : int 2 2 2 1 2 2 2 2 1 2 ...
 $ Region : int 3 3 3 3 3 3 3 3 3 3 ...
 $ Fresh : int 12669 7057 6353 13265 22615 9413 12126 7579...
 $ Milk : int 9656 9810 8808 1196 5410 8259 3199 4956...
 $ Grocery : int 7561 9568 7684 4221 7198 5126 6975 9426...
 $ Frozen : int 214 1762 2405 6404 3915 666 480 1669...
 $ Detergents_Paper: int 2674 3293 3516 507 1777 1795 3140 3321...
 $ Delicassen : int 1338 1776 7844 1788 5185 1451 545 2566...
# Viewing the data to get an intuition of the data 
View(cust_data)

This will give the following output:

Now let's check whether there are any entries with missing fields in our dataset:

# checking if there are any NAs in data
print(apply(cust_data, 2, function (x) sum(is.na(x))))
Output :
Channel Region Fresh Milk 
0 0 0 0 
Grocery Frozen Detergents_Paper Delicassen


0 0 0 0 
# printing the summary of the dataset 
print(summary(cust_data))

This will give the following output:

Channel Region Fresh Milk 
 Min. :1.000 Min. :1.000 Min. : 3 Min. : 55 
 1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 3128 1st Qu.: 1533 
 Median :1.000 Median :3.000 Median : 8504 Median : 3627 
 Mean :1.323 Mean :2.543 Mean : 12000 Mean : 5796 
 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.: 16934 3rd Qu.: 7190 
 Max. :2.000 Max. :3.000 Max. :112151 Max. :73498 
 Grocery Frozen Detergents_Paper Delicassen 
 Min. : 3.0 Min. : 3.0 Min. : 3 Min. : 25.0 
 1st Qu.: 256.8 1st Qu.: 408.2 1st Qu.: 2153 1st Qu.: 742.2
 Median : 816.5 Median : 965.5 Median : 4756 Median : 1526.0
 Mean : 2881.5 Mean : 1524.9 Mean : 7951 Mean : 3071.9
 3rd Qu.: 3922.0 3rd Qu.: 1820.2 3rd Qu.:10656 3rd Qu.: 3554.2
 Max. :40827.0 Max. :47943.0 Max. :92780 Max. :60869.0

From the EDA, we see that there are 440 observations available in this dataset and there are eight variables. The dataset does not have any missing values. While the last six variables are goods that were brought by distributors from the wholesaler, the first two variables are factors (categorical variables) representing the location and channel of purchase. In our projects, we intend to identify the segments based on the sales into different products, therefore, the location and channel variables in the data are not very useful. Let's delete them from the dataset using the following code:

# excluding the non-useful columns from the dataset
cust_data<-cust_data[,c(-1,-2)]
# verifying the dataset post columns deletion
dim(cust_data)

This gives us the following output:

440 6

We see that only six columns are retained, confirming that the deletion of non-required columns is successful. From the summary output in the EDA code, we can also observe that the scale across all the retained columns is the same so we do not have to explicitly normalize the data.

It may be noted that most clustering algorithms involve computation of distance of some form (such as Euclidean, Manhattan, Grower). It is important that data is scaled across the columns in the dataset so as to ensure a variable does not end up as a dominating one in distance computation just because of high scale. In case of different scales observed in columns of the data, we will rely on techniques such as Z-transform or min-max transform. Applying one of these techniques on the data ensures that the columns of the dataset are scaled appropriately therefore leaving no dominating variables in the dataset to be used with clustering algorithms. Fortunately, we do not have this issue so we can continue with the dataset as it is.

Clustering algorithms impose identification of subgroups in the input dataset even if there are no clusters present. To ensure that we get meaningful clusters as output from the clustering algorithms, it is important to check whether clusters exist in the data at all. Clustering tendency, or the feasibility of the clustering analysis, is the process of identifying whether the clusters exist in the dataset. Given an input dataset, the process determines whether it has a non-random or non-uniform data structure distribution that will lead to meaningful clusters. The Hopkins statistic measure is used to determine cluster tendency. It takes a value between 0 and 1, and if the value of the Hopkins statistic is close to 0 (far below 0.5), it indicates the existence of valid clusters in the dataset. A Hopkins value closer to 1 indicates random structures in the dataset.

The factoextra library has a built-in get_clust_tendency() function that computes the Hopkins statistic on the input dataset. Let's apply this function on our wholesale dataset to determine whether the dataset is valid for clustering at all. The following code accomplishes the computation of the Hopkins statistic:

# setting the working directory to a folder where dataset is located
setwd('/home/sunil/Desktop/chapter18/')
# reading the dataset to cust_data dataframe
cust_data = read.csv(file='Wholesale_customers_ data.csv', header = TRUE)
# removing the non-required columns
cust_data<-cust_data[,c(-1,-2)]
# inlcuding the facto extra library 
library(factoextra)
# computing and printing the hopikins statistic
print(get_clust_tendency(cust_data, graph=FALSE,n=50,seed = 123))

This will give the following output:

$hopkins_stat
[1] 0.06354846

The Hopkins statistic output for our dataset is very close to 0, so we can conclude that we have a dataset that is a good candidate for our clustering exercise.

Table of Contents for Understanding the wholesale customer dataset and the segmentation problem

Create new playlist

Sign In

Sign Up

Table of Contents for
Understanding the wholesale customer dataset and the segmentation problem