Akshay Kulkarni1, Adarsha Shivananda2, Anoosh Kulkarni3 and V Adithya Krishnan4
(1)
Bangalore, Karnataka, India
(2)
Hosanagara tq, Shimoga dt, Karnataka, India
(3)
Bangalore, India
(4)
Navi Mumbai, India
Recommender systems based on unsupervised machine learning algorithms are very popular because they overcome many challenges that collaborative, hybrid, and classification-based systems face. A clustering technique is used to recommend the products/items based on the patterns and behaviors captured within each segment/cluster. This technique is good when data is limited, and there is no labeled data to work with.
Unsupervised learning is a machine learning category where labeled data is not leveraged, but still, inferences are discovered using the data at hand. Let’s find the patterns without the dependent variables to solve business problems. Figure 7-1 shows the clustering outcome.
Grouping similar things into segments is called clustering; in our terms, “things” are not data points but a collection of observations. They are
Similar to each other in the same group
Dissimilar to the observations in other groups
There are mainly two important algorithms that are highly being used in the industry. Before getting into the projects, let’s briefly examine how algorithms work.
Approach
The following basic steps build a model based on similar users’ recommendations.
1.
Data collection
2.
Data preprocessing
3.
Exploratory data analysis
4.
Model building
5.
Recommendations
Figure 7-2 shows the step to building the clustering-based model.
Implementation
Let’s install and import the required libraries.
#Importing the libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
import seaborn as sns
import os
from sklearn import preprocessing
Data Collection and Downloading Required Word Embeddings
Let’s consider an e-commerce dataset. Download the dataset from the GitHub link of this book.
Let’s look at the DataFrame after encoding the values.
df_customer.iloc[:,6:]
Figure 7-10 shows the output of the DataFrame after encoding the values.
Model Building
This phase builds clusters using k-means clustering. To define an optimal number of clusters, you can also consider the elbow method or the dendrogram method.
K-Means Clustering
k-means clustering is an efficient and widely used technique that groups the data based on the distance between the points. Its objective is to minimize total variance within the cluster, as shown in Figure 7-11.
The following steps generate clusters.
1.
Use the elbow method to identify the optimum number of clusters. This acts as k.
2.
Select random k points as cluster centers from the overall observations or points.
3.
Calculate the distance between these centers and other points in the data and assign it to the closest center cluster that a particular point belongs to using any of the following distance metrics.
Euclidean distance
Manhattan distance
Cosine distance
Hamming distance
4.
Recalculate the cluster center or centroid for each cluster.
Repeat steps 2, 3, and 4 until the same points are assigned to each cluster, and the cluster centroid is stabilized.
The Elbow Method
The elbow method checks the consistency of clusters. It finds the ideal number of clusters in data. Explained variance considers the percentage of variance explained and derives an ideal number of clusters. Suppose the deviation percentage explained is compared with the number of clusters. In that case, the first cluster adds a lot of information, but at some point, explained variance decreases, giving an angle on the graph. At the moment, the number of clusters is selected.
The elbow method runs k-means clustering on the dataset for a range of values for k (e.g., from 1–10), and then for each value of k, it computes an average score for all clusters.
Hierarchical Clustering
Hierarchical clustering is another type of clustering technique that also uses distance to create the groups. The following steps generate clusters.
1.
Hierarchical clustering starts by creating each observation or point as a single cluster.
2.
It identifies the two observations or points that are closest together based on the distance metrics discussed earlier.
3.
Combine these two most similar points and form one cluster.
4.
This continues until all the clusters are merged and form a final single cluster.
5.
Finally, using a dendrogram, decide the ideal number of clusters.
The tree is cut to decide the number of clusters. The tree cutting happens where there is a maximum jump from one level to another, as shown in Figure 7-12.
Usually, the distance between two clusters has been computed based on Euclidean distance. Many other distance metrics can be leveraged to do the same.
Let’s build a k-means model for this use case. Before building the model, let’s execute the elbow method and the dendrogram method to find the optimal clusters.
The optimal or least number of clusters for both methods is two. But let’s consider 15 clusters for this use case.
Note
You can consider any number of clusters for implementation, but it should be greater than the optimal, or the least, number of clusters from k-means clustering or dendrogram.
Let’s build a k-means algorithm considering 15 clusters.
# K-means
# Perform kmeans
km = KMeans(n_clusters=15)
clusters = km.fit_predict(df_customer.iloc[:,6:])
# saving prediction back to raw dataset
df_customer['cluster'] = clusters
df_customer
Figure 7-15 shows the output of df_cluster after creating the clusters.
Figure 7-22 shows the output after creating score_df.
The score_df data is ready to recommend new products to a customer. Other customers in the same cluster have bought the recommended products. This is based on similar users.
Let’s focus on product data to recommend products based on similarity.
The preprocessing function for customer analysis is used to check the missing values.
Figure 7-26 highlights the final recommendations to customer 13137.
The first set highlights similar user recommendations. The second set highlights similar item recommendations.
Summary
In this chapter, you learned how to build a recommendation engine using unsupervised ML algorithms, which is clustering. Customer and order data were used to recommend the products/items based on similar users. The product data was used to recommend the products/items using similar items.