Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

A. Kulkarni et al.Applied Recommender Systems with Pythonhttps://doi.org/10.1007/978-1-4842-8954-9_8

8. Classification Algorithm–Based Recommender Systems

Akshay Kulkarni¹, Adarsha Shivananda², Anoosh Kulkarni³ and V Adithya Krishnan⁴

(1)

Bangalore, Karnataka, India

(2)

Hosanagara tq, Shimoga dt, Karnataka, India

(3)

Bangalore, India

(4)

Navi Mumbai, India

A classification algorithm-based recommender system is also known as the buying propensity model. The goal here is to predict the propensity of customers to buy a product using historical behavior and purchases.

The more accurately you predict future purchases, the better recommendations and, in turn, sales. This kind of recommender system is used more often to ensure 100% conversion from the users who are likely to purchase with certain probabilities. Promotions are offered on those products, enticing users to make a purchase.

Approach

The following basic steps build a classification algorithm-based recommender engine.

1.
Data collection
2.
Data preprocessing and cleaning
3.
Feature engineering
4.
Exploratory data analysis
5.
Model building
6.
Evaluation
7.
Predictions and recommendations

Figure 8-1 shows the steps for building a classification algorithm-based model.

Implementation

Let’s install and import the required libraries.

#Importing the libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns.display import Image

import os

from sklearn import preprocessing

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

from sklearn.linear_model import LogisticRegression

from imblearn.combine import SMOTETomek

from collections import Counter

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import roc_auc_score, roc_curve

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import roc_curve, roc_auc_score

from sklearn.naive_bayes import GaussianNB

from sklearn import tree

from sklearn.tree import DecisionTreeClassifier

from xgboost import XGBClassifier

Data Collection and Download Word Embeddings

Let’s consider an e-commerce dataset. Download the dataset from GitHub link.

Importing the Data as a DataFrame (pandas)

Import the records, customers, and product data.

# read Record dataset

record_df = pd.read_excel("Rec_sys_data.xlsx")

#read Customer Dataset

customer_df = pd.read_excel("Rec_sys_data.xlsx", sheet_name = 'customer')

# read product dataset

prod_df = pd.read_excel("Rec_sys_data.xlsx", sheet_name = 'product')

Print the top five rows of the DataFrame.

#Viewing Top 5 Rows

print(record_df.head())

print(customer_df.head())

print(prod_df.head())

Figure 8-2 shows the output of the first five rows of records data.

Figure 8-3 shows the output of the first five rows of customer data.

Figure 8-4 shows the output of the first five rows of product data.

Preprocessing the Data

Before building any model, the initial step is to clean and preprocess the data.

Analyze, clean, and merge the three datasets so that the merged DataFrame can build ML models.

Now, let’s check what the total quantity of each product taken by each customer is.

# group By Stockcode and CustomerID and sum the Quantity

group = pd.DataFrame(record_df.groupby(['StockCode', 'CustomerID']).Quantity.sum())

print(group.shape)

group.head()

Figure 8-5 shows the output of grouping by stock code and customer ID and sums the quantity.

Next, check for null values for customers and records datasets.

#Check for null values

print(record_df.isnull().sum())

print("-------------- ")

print(customer_df.isnull().sum())

The following is the output.

InvoiceNo 0

StockCode 0

Quantity 0

InvoiceDate 0

DeliveryDate 0

Discount% 0

ShipMode 0

ShippingCost 0

CustomerID 0

dtype: int64

--------------

CustomerID 0

Gender 0

Age 0

Income 0

Zipcode 0

Customer Segment 0

dtype: int64

There are no null values present in the datasets. So, dropping or treating them is not required.

Let’s load CustomerID and StockCode into different variables and create a cross-product for further usage.

#Loading the CustomerID and StockCode into different variable d1, d2

d2 = customer_df['CustomerID']

d1 = record_df["StockCode"]

# Taking the sample of data and storing into two variables

row = d1.sample(n= 900)

row1 = d2.sample(n=900)

# Cross product of row and row1

index = pd.MultiIndex.from_product([row, row1])

a = pd.DataFrame(index = index).reset_index()

a.head()

Figure 8-6 shows the output.

Now, let’s merge 'group' and 'a' with 'CustomerID' and 'StockCode'.

#merge customerID and StockCode

data = pd.merge(group,a, on = ['CustomerID', 'StockCode'], how = 'right')

data.head()

Figure 8-7 shows the output.

As you can see, null values are present in the Quantity column.

Let’s check for nulls.

#check total number of null values in quantity column

print(data['Quantity'].isnull().sum())

# check the shape of data that is number of rows and columns

print(data.shape)

The following is the output.

779771

(810000, 3)

Let’s treat missing values by replacing null with zeros and checking for unique values.

#replacing nan values with 0

data['Quantity'] = data['Quantity'].replace(np.nan, 0).astype(int)

# Check all unique value of quantity column

print(data['Quantity'].unique())

Figure 8-8 shows the output.

Let’s now drop unnecessary columns from the product table.

## drop product name and description column

product_data = prod_df.drop(['Product Name', 'Description'], axis = 1)

product_data['Category'].str.split('::').str[0]

product_data.head()

Figure 8-9 shows the output of the first five rows.

Let’s extract the first hierarchy level from the category column and join the product_data table.

# extract the first string category column

cate = product_data['Category'].str.extract(r"(w+)", expand=True)

# join cat column with original dataset

df2 = product_data.join(cate, lsuffix="_left")

df2.drop(['Category'], axis = 1, inplace = True)

# rename column to Category

df2 = df2.rename(columns = {0: 'Category'})

print(df2.shape)

df2.head()

Figure 8-10 shows the output.

Let’s check and drop null values, if any, after joining.

#check for null values and drop it

df2.isnull().sum()

df2.dropna(inplace = True)

df2.isnull().sum()

The following is the output.

StockCode 0

Brand 0

Unit Price 0

Category 0

dtype: int64

Save the preprocessed file and read it again.

## save to csv file

df2.to_csv("Products.csv")

# Load product dataset

product = pd.read_csv("/content/Products.csv")

Merge the data, product, and customer tables.

## Merge data and product dataset

final_data = pd.merge(data, product, on= 'StockCode')

# create final dataset by merging customer & final data

final_data1 = pd.merge(customer_df, final_data, on = 'CustomerID')

# Drop Unnamed and zipcode column

final_data1.drop(['Unnamed: 0', 'Zipcode'], axis = 1, inplace = True)

final_data1.head()

Figure 8-11 shows the output of the first five rows after merging.

Check for null values in the final table.

print(final_data1.shape)

# Check for null values in each columns

final_data1.isnull().sum()

The following is the output.

(61200, 10)

CustomerID 0

Gender 0

Age 0

Income 0

Customer Segment 0

StockCode 0

Quantity 0

Brand 0

Unit Price 0

Category 0

dtype: int64

Check the unique categories in each column.

#Check for unique value in each categorical columns

print(final_data1['Category'].unique())

print('------------ ')

print(final_data1['Income'].unique())

print('------------ ')

print(final_data1['Brand'].unique())

print('------------ ')

print(final_data1['Customer Segment'].unique())

print('------------ ')

print(final_data1['Gender'].unique())

print('------------ ')

print(final_data1['Quantity'].unique())

The following is the output.

['Electronics' 'Clothing' 'Sports' 'Health' 'Beauty' 'Jewelry' 'Home'

'Office' 'Auto' 'Cell' 'Pets' 'Food' 'Household' 'Shop']

------------

['Low' 'Medium' 'High']

------------

['Mightyskins' 'Dr. Comfort' 'Mediven' 'Tom Ford' 'Eye Buy Express'

'MusicBoxAttic' 'Duda Energy' 'Business Essentials' 'Medi'

'Seat Belt Extender Pros' 'Boss (hub)' 'Ishow Hair' 'Ekena Milwork'

'JustVH' 'UNOTUX' 'Envelopes.com' 'Auburn Leathercrafters'

'Style & Apply' 'Edwards' 'Larissa Veronica' 'Awkward Styles' 'New Way'

'McDonalds' 'Ekena Millwork' 'Omega' "Medaglia D'Oro" 'allwitty' 'Prop?t'

'Unique Bargains' 'CafePress' "Ron's Optical" 'Wrangler' 'AARCO']

------------

['Small Business' 'Middle class' 'Corporate']

------------

['male' 'female']

------------

[ 0 1 3 5 15 2 4 8 6 24 7 30 9 10

62 20 18 12 72 50 400 36 27 242 58 25 60 48

22 148 16 152 11 31 64 147 42 23 43 26 14 21

1200 500 28 112 90 128 44 200 34 96 140 19 160 17

100 320 370 300 350 32 78 101 66 29]

From this output, you can see some special characters in the brand column. Let’s remove them.

## test cleaning

final_data1['Brand'] = final_data1['Brand'].str.replace('?', '')

final_data1['Brand'] = final_data1['Brand'].str.replace('&', 'and')

final_data1['Brand'] = final_data1['Brand'].str.replace('(', '')

final_data1['Brand'] = final_data1['Brand'].str.replace(')', '')

print(final_data1['Brand'].unique())

The following is the output.

['Mightyskins' 'Dr. Comfort' 'Mediven' 'Tom Ford' 'Eye Buy Express'

'MusicBoxAttic' 'Duda Energy' 'Business Essentials' 'Medi'

'Seat Belt Extender Pros' 'Boss hub' 'Ishow Hair' 'Ekena Milwork'

'JustVH' 'UNOTUX' 'Envelopes.com' 'Auburn Leathercrafters'

'Style and Apply' 'Edwards' 'Larissa Veronica' 'Awkward Styles' 'New Way'

'McDonalds' 'Ekena Millwork' 'Omega' "Medaglia D'Oro" 'allwitty' 'Propt'

'Unique Bargains' 'CafePress' "Ron's Optical" 'Wrangler' 'AARCO']

All the datasets have merged, and the required data preprocessing and cleaning are completed.

Feature Engineering

Once the data is preprocessed and cleaned, the next step is to perform feature engineering.

Let’s create a flag column, using the Quantity column, that indicates whether the customer has bought the product or not.

If the Quantity column is 0, the customer has not bought the product.

#creating buy_falg column

final_data1.loc[final_data1.Quantity == 0 ,"flag_buy" ] = 0

final_data1.loc[final_data1.Quantity != 0 ,"flag_buy" ] = 1

# Converting the values of flag_buy column into integer

final_data1['flag_buy'] = final_data1.flag_buy.astype(int)

final_data1.tail()

Figure 8-12 shows the first five rows' output after creating the target column.

A new flag_buy column is created. Let’s do some basic exploration of that column.

#Check for the unique value in flag buy column

print(final_data1['flag_buy'].unique())

# Gives the description of columns

print(final_data1.describe())

##Information about the data

print(final_data1.info())

Figure 8-13 shows the description output.

array([0, 1])

Int64Index: 61200 entries, 0 to 61199

Data columns (total 11 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 CustomerID 61200 non-null int64

1 Gender 61200 non-null object

2 Age 61200 non-null int64

3 Income 61200 non-null object

4 Customer Segment 61200 non-null object

5 StockCode 61200 non-null object

6 Quantity 61200 non-null int64

7 Brand 61200 non-null object

8 Unit Price 61200 non-null float64

9 Category 61200 non-null object

10 flag_buy 61200 non-null int64

dtypes: float64(1), int64(4), object(6)

memory usage: 5.6+ MB

Exploratory Data Analysis

Feature engineering is a must for model data preprocessing. However, exploratory data analysis (EDA) also plays a vital role.

You can get more business insights by looking at the historical data itself.

Let’s start exploring the data. Plot a chart of the brand column.

plt.figure(figsize=(50,20))

sns.set_theme(style="darkgrid")

sns.countplot(x = 'Brand', data = final_data1)

Figure 8-14 shows the output of the brand column.

The key insight from this chart is that the Mightyskins brand has the highest sales.

Let’s plot the Income column.

# Count of Income Category

plt.figure(figsize=(10,5))

sns.set_theme(style="darkgrid")

sns.countplot(x = 'Income', data = final_data1)

Figure 8-15 shows the count chart’s Income column output.

The key takeaway insight from this chart is that low-income customers are buying more products. However, there is not a major difference between medium and high-income customers.

Let’s dump a few charts here. For more information, please refer to the notebook.

Plot a histogram to show age distribution.

# histogram plot to show distribution age

plt.figure(figsize=(10,5))

sns.set_theme(style="darkgrid")

sns.histplot(data=final_data1, x="Age", kde = True)

Figure 8-16 shows the age distribution output.

Plot an area chart to show age distribution with hue by category.

plt.figure(figsize=(10,5))

sns.set_theme(style="darkgrid")

sns.histplot(data=final_data1, x="Age", hue="Category", element= "poly")

Figure 8-17 shows the age distributions by category.

Create a bar plot to check the target distribution.

# Count plot to show number of customer bought the product

plt.figure(figsize=(10,5))

sns.set_theme(style="darkgrid")

sns.countplot(x = 'flag_buy', data = final_data1)

Figure 8-18 is the target distribution bar plot.

It looks like this particular use case has a data imbalance. Let’s build the model after sampling the data.

Model Building

Let’s encode all the categorical variables before building the model. Also, store the stock code for further usage.

#Encoding categorical variable using Label Encoder

from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()

final_data1['StockCode'] = label_encoder.fit_transform(final_data1['StockCode'])

mappings = {}

mappings['StockCode'] = dict(zip(label_encoder.classes_,range(len(label_encoder.classes_))))

final_data1['Gender'] = label_encoder.fit_transform(final_data1['Gender'])

final_data1['Customer Segment'] = label_encoder.fit_transform(final_data1['Customer Segment'])

final_data1['Brand'] = label_encoder.fit_transform(final_data1['Brand'])

final_data1['Category'] = label_encoder.fit_transform(final_data1['Category'])

final_data1['Income'] = label_encoder.fit_transform(final_data1['Income'])

final_data1.head()

Figure 8-19 shows the first five rows after encoding.

Train-Test Split

The data is split into two parts: one for training the model, which is the training set, and another for evaluating the model, which is the test set. The train_test_split library from sklearn.model_selection is imported to split the DataFrame into two parts.

## separating dependent and independent variables

x = final_data1.drop(['flag_buy'], axis = 1)

y = final_data1['flag_buy']

# check the shape of dependent and independent variable

print((x.shape, y.shape))

# splitting data into train and test

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.6, random_state = 42)

Logistic Regression

Linear regression is needed to predict a numerical value. But you also encounter classification problems where dependent variables are binary, like yes or no, 1 or 0, true or false, and so on. In that case, logistic regression is needed. It is a classification algorithm and continues linear regression. Here, log odds are used to restrict the dependent variable between 0 and 1.

Figure 8-20 shows the logistic regression formula.

Where (P/1 – P) is the odds ratio, β₀ is constant, and β is the coefficient.

Figure 8-21 shows how logistic regression works.

Now let’s look at how to evaluate the classification model.

Accuracy is the number of correct predictions divided by the total number of predictions. The values lie between 0 and 1; to convert it into a percentage, multiply the answer by 100. But only considering accuracy as the evaluation parameter is not an ideal thing to do. For example, if the data is imbalanced, you can obtain very high accuracy.
The crosstab between an actual and a predicted class is called a confusion matrix. It's not only for binary, but you can also use it for multiclass classification. Figure 8-22 represents a confusion matrix.

The ROC (receiver operating characteristic) curve is an evaluation metric for classification tasks. A plot with a false positive rate on the x axis and a true positive rate on the y axis is the ROC curve plot. It says how strongly the classes are distinguished when the thresholds are varied. Higher the value of the area under the ROC curve, the higher the predictive power. Figure 8-23 shows the ROC curve.

Linear and logistic regression are the traditional way of using statistics as a base to predict the dependent variable. But there are a few drawbacks to these algorithms.

Statistical modeling must satisfy the assumptions that are discussed previously. If they are not satisfied, models won’t be reliable and thoroughly fit random predictions.
These algorithm face challenge when data and target feature is non-linear. Complex patterns are hard to decode.
Data should be clean (missing values and outliers should be treated).

Advanced machine learning concepts like decision tree, random forest, SVM, and neural networks can be used to overcome these limitations.

Implementation

##training using logistic regression

from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

from sklearn.linear_model import LogisticRegression

logistic = LogisticRegression()

logistic.fit(x_train, y_train)

# calculate score

pred=logistic.predict(x_test)

print(confusion_matrix(y_test, pred))

print(accuracy_score(y_test, pred))

print(classification_report(y_test, pred))

The following is the output.

[[23633 0]

[ 2 845]]

0.9999183006535948

precision recall f1-score support

0 1.00 1.00 1.00 23633

1 1.00 1.00 1.00 847

accuracy 1.00 24480

macro avg 1.00 1.00 1.00 24480

weighted avg 1.00 1.00 1.00 24480

This chapter’s “Exploratory Data Analysis” section discussed the target distribution and its imbalances. Let’s apply a sampling technique, make it balanced data, and then build the model.

# Sampling technique to handle imbalanced data

smk = SMOTETomek(0.50)

X_res,y_res=smk.fit_resample(x_train,y_train)

# Count the number of classes

from collections import Counter

print("The number of classes before fit {}".format(Counter(y)))

print("The number of classes after fit {}".format(Counter(y_res)))

The following is the output.

The number of classes before fit Counter({0: 59129, 1: 2071})

The number of classes after fit Counter({0: 35428, 1: 17680})

Build the same model after sampling.

## Training model with Logistics Regression

from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

from sklearn.linear_model import LogisticRegression

logistic = LogisticRegression()

logistic.fit(X_res, y_res)

# Calculate Score

y_pred=logistic.predict(x_test)

print(confusion_matrix(y_test,y_pred))

print(accuracy_score(y_test,y_pred))

print(classification_report(y_test,y_pred))

The following is the output.

[[23633 0]

[ 0 847]]

1.0

precision recall f1-score support

0 1.00 1.00 1.00 23633

1 1.00 1.00 1.00 847

accuracy 1.00 24480

macro avg 1.00 1.00 1.00 24480

weighted avg 1.00 1.00 1.00 24480

Decision Tree

The decision is a type of supervised learning in which the data is split into similar groups based on the most important variable to the least. It looks like a tree-shaped structure when all the variables split hence the name tree-based models.

The tree comprises a root node, a decision node, and a leaf node. A decision node can have two or more branches, and a leaf node represents a decision. Decision trees handle any type of data, be it quantitative or qualitative. Figure 8-24 shows how the decision tree works.

Let’s examine how tree splitting happens, which is the key concept in decision trees. The core of the decision tree algorithm is the process of splitting the tree. It uses different algorithms to split the node and is different for classification and regression problems.

The following are for classification problems.

The Gini index is a probabilistic way of splitting the trees. It uses the sum of the probability square for success and failure and decides the purity of the nodes. CART (classification and regression tree) uses the Gini index to create splits.
Chi-square is the statistical significance between subnodes, and the parent node decides the splitting. Chi-square = ((actual – expected)^2 / expected)^1/2. CHAID (Chi-square Automatic Interaction Detector) is an example of this.

The following pertains to regression problems.

Reduction in variance works based on the variance between two features (target and independent feature) to split a tree.
Overfitting occurs when the algorithms tightly fit the given training data but is inaccurate in predicting the outcomes of the untrained or test data. The same is the case with decision trees as well. It occurs when the tree is created to perfectly fit all samples in the training dataset, affecting test data accuracy.

Implementation

##Training model using decision tree

from sklearn import tree

from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()

dt.fit(X_res, y_res)

y_pred = dt.predict(x_test)

print(dt.score(x_train, y_train))

print(confusion_matrix(y_test,y_pred))

print(accuracy_score(y_test,y_pred))

print(classification_report(y_test,y_pred))

The following is the output.

1.0

[[23633 0]

[ 0 847]]

1.0

precision recall f1-score support

0 1.00 1.00 1.00 23633

1 1.00 1.00 1.00 847

accuracy 1.00 24480

macro avg 1.00 1.00 1.00 24480

weighted avg 1.00 1.00 1.00 24480

Random Forest

Random forest is the most widely used machine learning algorithm because of its flexibility and ability to overcome the overfitting problem. A random forest is an ensemble algorithm that is an ensemble of multiple decision trees. The higher the number of trees, the better the accuracy.

The random forest can perform both classification and regression tasks. The following are some of its advantages.

It is insensitive to missing values and outliers.
It prevents the algorithm from overfitting.

How does it work? It works on bagging and bootstrap sample techniques.

Randomly takes the square root of m features and 2/3 bootstrap data sample with a replacement for training each decision tree randomly and predicts the outcome
Builds n number of trees until the out-of-bag error rate is minimized and stabilized
Computes the votes for each predicted target and considers the mode as a final prediction in terms of classification

Figure 8-25 shows the working of the random forest model.

Implementation

##Training model using Random forest

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()

rf.fit(X_res, y_res)

# Calculate Score

y_pred=rf.predict(x_test)

print(confusion_matrix(y_test,y_pred))

print(accuracy_score(y_test,y_pred))

print(classification_report(y_test,y_pred))

The following is the output.

[[23633 0]

[ 0 847]]

1.0

precision recall f1-score support

0 1.00 1.00 1.00 23633

1 1.00 1.00 1.00 847

accuracy 1.00 24480

macro avg 1.00 1.00 1.00 24480

weighted avg 1.00 1.00 1.00 24480

KNN

For more information on the algorithm, please refer to Chapter 4.

Implementation

#Training model using KNN

from sklearn.metrics import roc_auc_score, roc_curve

from sklearn.neighbors import KNeighborsClassifier

model1 = KNeighborsClassifier(n_neighbors=3)

model1.fit(X_res,y_res)

y_predict = model1.predict(x_test)

# Calculate Score

print(model1.score(x_train, y_train))

print(confusion_matrix(y_test,y_predict))

print(accuracy_score(y_test,y_predict))

print(classification_report(y_test,y_predict))

# plot AUROC curve

r_auc = roc_auc_score(y_test, y_predict)

r_fpr, r_tpr, _ = roc_curve(y_test, y_predict)

plt.plot(r_fpr, r_tpr, linestyle='--', label='KNN prediction (AUROC = %0.3f)' % r_auc)

plt.title('ROC Plot')

# Axis labels

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

# Show legend

plt.legend()

# Show plot

plt.show()

Figure 8-26 shows the KNN output.

Note

Naive Bayes and XGBoost implementations are also in the notebooks.

In the preceding models, the logistic regression performance is better than all other models.

So, using that model, let’s recommend the products to one customer.

# x_test has all the features, lets us take the copy of it

test_data = x_test.copy()

#let us store predictions in one variable

test_data['predictions'] = pred

#filter the data and recommend.

recomm_one_cust = test_data[(test_data['CustomerID']== 17315) & (test_data['predictions']== 1)]

# to build the model we have encoded the stockcode column now we will decode and recommend.

items = []

for item_id in recomm_one_cust['StockCode'].unique().tolist():

prod = {v: k for k, v in mappings['StockCode'].items()}[item_id]

items.append(str(prod))

items

The following is the output.

['85123A', '85099C', '84970L', 'POST', '84970S', '82494L', '48173C', '85099B']

These are the product IDs that should be recommended for customer 17315.

If you want recommendations with product names, filter these IDs in the product table.

recommendations = []

for i in items:

recommendations.append(prod_df[prod_df['StockCode']== i]['Product Name'])

recommendations

The following is the output.

[135 Mediven Sheer and Soft 15-20 mmHg Thigh w/ Lac...