© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
A. Kulkarni et al.Applied Recommender Systems with Pythonhttps://doi.org/10.1007/978-1-4842-8954-9_8

8. Classification Algorithm–Based Recommender Systems

Akshay Kulkarni1  , Adarsha Shivananda2, Anoosh Kulkarni3 and V Adithya Krishnan4
(1)
Bangalore, Karnataka, India
(2)
Hosanagara tq, Shimoga dt, Karnataka, India
(3)
Bangalore, India
(4)
Navi Mumbai, India
 

A classification algorithm-based recommender system is also known as the buying propensity model. The goal here is to predict the propensity of customers to buy a product using historical behavior and purchases.

The more accurately you predict future purchases, the better recommendations and, in turn, sales. This kind of recommender system is used more often to ensure 100% conversion from the users who are likely to purchase with certain probabilities. Promotions are offered on those products, enticing users to make a purchase.

Approach

The following basic steps build a classification algorithm-based recommender engine.
  1. 1.

    Data collection

     
  2. 2.

    Data preprocessing and cleaning

     
  3. 3.

    Feature engineering

     
  4. 4.

    Exploratory data analysis

     
  5. 5.

    Model building

     
  6. 6.

    Evaluation

     
  7. 7.

    Predictions and recommendations

     
Figure 8-1 shows the steps for building a classification algorithm-based model.

A framework exposes the steps involved in building a classification-based model. It includes data collection, data pre-processing, feature engineering, E D A, model building, evaluation, and predictions.

Figure 8-1

Classification-based model

Implementation

Let’s install and import the required libraries.
#Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns.display import Image
import os
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
from sklearn.linear_model import LogisticRegression
from imblearn.combine import SMOTETomek
from collections import Counter
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.naive_bayes import GaussianNB
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

Data Collection and Download Word Embeddings

Let’s consider an e-commerce dataset. Download the dataset from GitHub link.

Importing the Data as a DataFrame (pandas)

Import the records, customers, and product data.
# read Record dataset
record_df = pd.read_excel("Rec_sys_data.xlsx")
#read Customer Dataset
customer_df = pd.read_excel("Rec_sys_data.xlsx", sheet_name = 'customer')
# read product dataset
prod_df = pd.read_excel("Rec_sys_data.xlsx", sheet_name = 'product')
Print the top five rows of the DataFrame.
#Viewing Top 5 Rows
print(record_df.head())
print(customer_df.head())
print(prod_df.head())
Figure 8-2 shows the output of the first five rows of records data.

A data frame exposes the output of the first 5 rows (0 to 4) of records data. The data includes invoice number, stock code, quantity, invoice date, shipping cost, and customer I D are represented.

Figure 8-2

The output

Figure 8-3 shows the output of the first five rows of customer data.

A data frame exposes the output of the first 5 rows (0 to 4) of the customer data. The data includes customer I D, gender, age, income, zip code, and customer segment are represented.

Figure 8-3

The output

Figure 8-4 shows the output of the first five rows of product data.

A data frame exposes the output of the first 5 rows (0 to 4) of the product data. The data includes stock code, product name, description, category, brand, and unit price are represented.

Figure 8-4

The output

Preprocessing the Data

Before building any model, the initial step is to clean and preprocess the data.

Analyze, clean, and merge the three datasets so that the merged DataFrame can build ML models.

Now, let’s check what the total quantity of each product taken by each customer is.
# group By Stockcode and CustomerID and sum the Quantity
group = pd.DataFrame(record_df.groupby(['StockCode', 'CustomerID']).Quantity.sum())
print(group.shape)
group.head()
Figure 8-5 shows the output of grouping by stock code and customer ID and sums the quantity.

A data frame exposes the output of the grouped stock code (10002), customer I D (12451, 12510, 12583, 12637, and 12673), and the quantity (12, 24, 48, 12, and 1).

Figure 8-5

The output

Next, check for null values for customers and records datasets.
#Check for null values
print(record_df.isnull().sum())
print("-------------- ")
print(customer_df.isnull().sum())
The following is the output.
InvoiceNo       0
StockCode       0
Quantity        0
InvoiceDate     0
DeliveryDate    0
Discount%       0
ShipMode        0
ShippingCost    0
CustomerID      0
dtype: int64
--------------
CustomerID          0
Gender              0
Age                 0
Income              0
Zipcode             0
Customer Segment    0
dtype: int64

There are no null values present in the datasets. So, dropping or treating them is not required.

Let’s load CustomerID and StockCode into different variables and create a cross-product for further usage.
#Loading the CustomerID and StockCode into different variable d1, d2
d2 = customer_df['CustomerID']
d1 = record_df["StockCode"]
# Taking the sample of data and storing into two variables
row = d1.sample(n= 900)
row1 = d2.sample(n=900)
# Cross product of row and row1
index = pd.MultiIndex.from_product([row, row1])
a = pd.DataFrame(index = index).reset_index()
a.head()
Figure 8-6 shows the output.

A data frame exposes the output data of stock code (48129) and the customer I D (13736, 17252, 16005, 17288, and 14267) for 0 to 4.

Figure 8-6

The output

Now, let’s merge 'group' and 'a' with 'CustomerID' and 'StockCode'.
#merge customerID and StockCode
data = pd.merge(group,a, on = ['CustomerID', 'StockCode'], how = 'right')
data.head()
Figure 8-7 shows the output.

A data frame exposes the output data of stock code (48129), customer I D (13736, 17252, 16005, 17288, and 14267), and quantity (NaN, NaN, 1.0, NaN, and NaN) for 0 to 4.

Figure 8-7

The output

As you can see, null values are present in the Quantity column.

Let’s check for nulls.
#check total number of null values in quantity column
print(data['Quantity'].isnull().sum())
# check the shape of data that is number of rows and columns
print(data.shape)
The following is the output.
779771
(810000, 3)
Let’s treat missing values by replacing null with zeros and checking for unique values.
#replacing nan values with 0
data['Quantity'] = data['Quantity'].replace(np.nan, 0).astype(int)
# Check all unique value of quantity column
print(data['Quantity'].unique())
Figure 8-8 shows the output.

A representation exhibits the output with unique values.

Figure 8-8

The output

Let’s now drop unnecessary columns from the product table.
## drop product name and description column
product_data = prod_df.drop(['Product Name', 'Description'], axis = 1)
product_data['Category'].str.split('::').str[0]
product_data.head()
Figure 8-9 shows the output of the first five rows.

A data frame exposes the output of the first 5 rows (0 to 4). The data includes stock code, category, brand, and unit price are represented.

Figure 8-9

The output

Let’s extract the first hierarchy level from the category column and join the product_data table.
# extract the first string category column
cate = product_data['Category'].str.extract(r"(w+)", expand=True)
# join cat column with original dataset
df2 = product_data.join(cate, lsuffix="_left")
df2.drop(['Category'], axis = 1, inplace = True)
# rename column to Category
df2 = df2.rename(columns = {0: 'Category'})
print(df2.shape)
df2.head()
Figure 8-10 shows the output.

A data frame exposes the output data of stock code, brand, unit price, and category (that includes cell, health, video, health, and home).

Figure 8-10

The output

Let’s check and drop null values, if any, after joining.
#check for null values and drop it
df2.isnull().sum()
df2.dropna(inplace = True)
df2.isnull().sum()
The following is the output.
StockCode     0
Brand         0
Unit Price    0
Category      0
dtype: int64
Save the preprocessed file and read it again.
## save to csv file
df2.to_csv("Products.csv")
# Load product dataset
product = pd.read_csv("/content/Products.csv")
Merge the data, product, and customer tables.
## Merge data and product dataset
final_data = pd.merge(data, product, on= 'StockCode')
# create final dataset by merging customer & final data
final_data1 = pd.merge(customer_df, final_data, on = 'CustomerID')
# Drop Unnamed and zipcode column
final_data1.drop(['Unnamed: 0', 'Zipcode'], axis = 1, inplace = True)
final_data1.head()
Figure 8-11 shows the output of the first five rows after merging.

A data frame exposes the output of the first 5 rows (0 to 4) after merging. The data includes customer I D, gender, age, income, customer segment, stock code, and quantity are represented.

Figure 8-11

The output

Check for null values in the final table.
print(final_data1.shape)
# Check for null values in each columns
final_data1.isnull().sum()
The following is the output.
(61200, 10)
CustomerID          0
Gender              0
Age                 0
Income              0
Customer Segment    0
StockCode           0
Quantity            0
Brand               0
Unit Price          0
Category            0
dtype: int64
Check the unique categories in each column.
#Check for unique value in each categorical columns
print(final_data1['Category'].unique())
print('------------ ')
print(final_data1['Income'].unique())
print('------------ ')
print(final_data1['Brand'].unique())
print('------------ ')
print(final_data1['Customer Segment'].unique())
print('------------ ')
print(final_data1['Gender'].unique())
print('------------ ')
print(final_data1['Quantity'].unique())
The following is the output.
['Electronics' 'Clothing' 'Sports' 'Health' 'Beauty' 'Jewelry' 'Home'
 'Office' 'Auto' 'Cell' 'Pets' 'Food' 'Household' 'Shop']
------------
['Low' 'Medium' 'High']
------------
['Mightyskins' 'Dr. Comfort' 'Mediven' 'Tom Ford' 'Eye Buy Express'
 'MusicBoxAttic' 'Duda Energy' 'Business Essentials' 'Medi'
 'Seat Belt Extender Pros' 'Boss (hub)' 'Ishow Hair' 'Ekena Milwork'
 'JustVH' 'UNOTUX' 'Envelopes.com' 'Auburn Leathercrafters'
 'Style & Apply' 'Edwards' 'Larissa Veronica' 'Awkward Styles' 'New Way'
 'McDonalds' 'Ekena Millwork' 'Omega' "Medaglia D'Oro" 'allwitty' 'Prop?t'
 'Unique Bargains' 'CafePress' "Ron's Optical" 'Wrangler' 'AARCO']
------------
['Small Business' 'Middle class' 'Corporate']
------------
['male' 'female']
------------
[   0    1    3    5   15    2    4    8    6   24    7   30    9   10
   62   20   18   12   72   50  400   36   27  242   58   25   60   48
   22  148   16  152   11   31   64  147   42   23   43   26   14   21
 1200  500   28  112   90  128   44  200   34   96  140   19  160   17
  100  320  370  300  350   32   78  101   66   29]
From this output, you can see some special characters in the brand column. Let’s remove them.
## test cleaning
final_data1['Brand'] = final_data1['Brand'].str.replace('?', '')
final_data1['Brand'] = final_data1['Brand'].str.replace('&', 'and')
final_data1['Brand'] = final_data1['Brand'].str.replace('(', '')
final_data1['Brand'] = final_data1['Brand'].str.replace(')', '')
print(final_data1['Brand'].unique())
The following is the output.
['Mightyskins' 'Dr. Comfort' 'Mediven' 'Tom Ford' 'Eye Buy Express'
 'MusicBoxAttic' 'Duda Energy' 'Business Essentials' 'Medi'
 'Seat Belt Extender Pros' 'Boss hub' 'Ishow Hair' 'Ekena Milwork'
 'JustVH' 'UNOTUX' 'Envelopes.com' 'Auburn Leathercrafters'
 'Style and Apply' 'Edwards' 'Larissa Veronica' 'Awkward Styles' 'New Way'
 'McDonalds' 'Ekena Millwork' 'Omega' "Medaglia D'Oro" 'allwitty' 'Propt'
 'Unique Bargains' 'CafePress' "Ron's Optical" 'Wrangler' 'AARCO']

All the datasets have merged, and the required data preprocessing and cleaning are completed.

Feature Engineering

Once the data is preprocessed and cleaned, the next step is to perform feature engineering.

Let’s create a flag column, using the Quantity column, that indicates whether the customer has bought the product or not.

If the Quantity column is 0, the customer has not bought the product.
#creating buy_falg column
final_data1.loc[final_data1.Quantity == 0 ,"flag_buy" ] = 0
final_data1.loc[final_data1.Quantity != 0 ,"flag_buy" ] = 1
# Converting the values of flag_buy column into integer
final_data1['flag_buy'] = final_data1.flag_buy.astype(int)
final_data1.tail()
Figure 8-12 shows the first five rows' output after creating the target column.

A data frame exposes the output of the first 5 rows after the creation of the target column. The data includes gender, age, customer I D, income, quantity, and brand are represented.

Figure 8-12

The output

A new flag_buy column is created. Let’s do some basic exploration of that column.
#Check for the unique value in flag buy column
print(final_data1['flag_buy'].unique())
# Gives the description of columns
print(final_data1.describe())
##Information about the data
print(final_data1.info())
Figure 8-13 shows the description output.
array([0, 1])

A data frame exposes the output of the description. The customer I D, age, quantity, unit price, and flag buy along with the count, mean, minimum and maximum values are represented.

Figure 8-13

The output

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61200 entries, 0 to 61199
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   CustomerID        61200 non-null  int64
 1   Gender            61200 non-null  object
 2   Age               61200 non-null  int64
 3   Income            61200 non-null  object
 4   Customer Segment  61200 non-null  object
 5   StockCode         61200 non-null  object
 6   Quantity          61200 non-null  int64
 7   Brand             61200 non-null  object
 8   Unit Price        61200 non-null  float64
 9   Category          61200 non-null  object
 10  flag_buy          61200 non-null  int64
dtypes: float64(1), int64(4), object(6)
memory usage: 5.6+ MB

Exploratory Data Analysis

Feature engineering is a must for model data preprocessing. However, exploratory data analysis (EDA) also plays a vital role.

You can get more business insights by looking at the historical data itself.

Let’s start exploring the data. Plot a chart of the brand column.
plt.figure(figsize=(50,20))
sns.set_theme(style="darkgrid")
sns.countplot(x = 'Brand', data = final_data1)
Figure 8-14 shows the output of the brand column.

A bar chart exposes the output of the brand column. It is observed that the Mightyskins brand has the highest range of all other brands.

Figure 8-14

The output

The key insight from this chart is that the Mightyskins brand has the highest sales.

Let’s plot the Income column.
# Count of Income Category
plt.figure(figsize=(10,5))
sns.set_theme(style="darkgrid")
sns.countplot(x = 'Income', data = final_data1)
Figure 8-15 shows the count chart’s Income column output.

A bar chart exposes the output of the counts and income (low, medium, and high). Low income has the highest count (above 20000) than medium and high income.

Figure 8-15

The output

The key takeaway insight from this chart is that low-income customers are buying more products. However, there is not a major difference between medium and high-income customers.

Let’s dump a few charts here. For more information, please refer to the notebook.

Plot a histogram to show age distribution.
# histogram plot to show distribution age
plt.figure(figsize=(10,5))
sns.set_theme(style="darkgrid")
sns.histplot(data=final_data1, x="Age", kde = True)
Figure 8-16 shows the age distribution output.

A graph exposes the output of age distribution. The x-axis indicates the age and the y-axis indicates the count. It is observed that the count increases and decreases according to the distribution of age.

Figure 8-16

The output

Plot an area chart to show age distribution with hue by category.
plt.figure(figsize=(10,5))
sns.set_theme(style="darkgrid")
sns.histplot(data=final_data1, x="Age", hue="Category", element= "poly")
Figure 8-17 shows the age distributions by category.

A graph exposes the output of distributions of age by category. The categories include electronics, clothing, sports, health, beauty, jewelry, home, office, and shop.

Figure 8-17

The output

Create a bar plot to check the target distribution.
# Count plot to show number of customer bought the product
plt.figure(figsize=(10,5))
sns.set_theme(style="darkgrid")
sns.countplot(x = 'flag_buy', data = final_data1)
Figure 8-18 is the target distribution bar plot.

A bar chart exposes the output of the distribution of the target. The x-axis indicates the flag buy and the y-axis indicates the count. It is observed that flag buy 0 has the highest count.

Figure 8-18

The output

It looks like this particular use case has a data imbalance. Let’s build the model after sampling the data.

Model Building

Let’s encode all the categorical variables before building the model. Also, store the stock code for further usage.
#Encoding categorical variable using Label Encoder
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
final_data1['StockCode'] = label_encoder.fit_transform(final_data1['StockCode'])
mappings = {}
mappings['StockCode'] = dict(zip(label_encoder.classes_,range(len(label_encoder.classes_))))
final_data1['Gender'] = label_encoder.fit_transform(final_data1['Gender'])
final_data1['Customer Segment'] = label_encoder.fit_transform(final_data1['Customer Segment'])
final_data1['Brand'] = label_encoder.fit_transform(final_data1['Brand'])
final_data1['Category'] = label_encoder.fit_transform(final_data1['Category'])
final_data1['Income'] = label_encoder.fit_transform(final_data1['Income'])
final_data1.head()
Figure 8-19 shows the first five rows after encoding.

A data frame exposes the output of the first 5 rows (0 to 4) after encoding. The data includes customer I D, gender, age, income, customer segment, stock code, and quantity are represented.

Figure 8-19

The output

Train-Test Split

The data is split into two parts: one for training the model, which is the training set, and another for evaluating the model, which is the test set. The train_test_split library from sklearn.model_selection is imported to split the DataFrame into two parts.
## separating dependent and independent variables
x = final_data1.drop(['flag_buy'], axis = 1)
y = final_data1['flag_buy']
# check the shape of dependent and independent variable
print((x.shape, y.shape))
# splitting data into train and test
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.6, random_state = 42)

Logistic Regression

Linear regression is needed to predict a numerical value. But you also encounter classification problems where dependent variables are binary, like yes or no, 1 or 0, true or false, and so on. In that case, logistic regression is needed. It is a classification algorithm and continues linear regression. Here, log odds are used to restrict the dependent variable between 0 and 1.

Figure 8-20 shows the logistic regression formula.

A formula that is used to calculate the logistic regression.

Figure 8-20

Formula

Where (P/1 – P) is the odds ratio, β0 is constant, and β is the coefficient.

Figure 8-21 shows how logistic regression works.

A logistic regression graph of probability (y = 1) versus x exhibits the plots that are plotted along 0.0 and 1.0 probability. The trend increases from 0.0 to 1.0.

Figure 8-21

Logistic regression

Now let’s look at how to evaluate the classification model.
  • Accuracy is the number of correct predictions divided by the total number of predictions. The values lie between 0 and 1; to convert it into a percentage, multiply the answer by 100. But only considering accuracy as the evaluation parameter is not an ideal thing to do. For example, if the data is imbalanced, you can obtain very high accuracy.

  • The crosstab between an actual and a predicted class is called a confusion matrix. It's not only for binary, but you can also use it for multiclass classification. Figure 8-22 represents a confusion matrix.

A framework exhibits a confusion matrix. Actual and predicted classes are represented with positive (true positive and false positive) and negative (false negative and true negative) rates.

Figure 8-22

Confusion matrix

  • The ROC (receiver operating characteristic) curve is an evaluation metric for classification tasks. A plot with a false positive rate on the x axis and a true positive rate on the y axis is the ROC curve plot. It says how strongly the classes are distinguished when the thresholds are varied. Higher the value of the area under the ROC curve, the higher the predictive power. Figure 8-23 shows the ROC curve.

A graph exposes the R O C curve along the false positive rate (x-axis) and true positive rate (y-axis). The trend of the curve increases from 0.0 to 1.0 rate.

Figure 8-23

ROC curve

Linear and logistic regression are the traditional way of using statistics as a base to predict the dependent variable. But there are a few drawbacks to these algorithms.
  • Statistical modeling must satisfy the assumptions that are discussed previously. If they are not satisfied, models won’t be reliable and thoroughly fit random predictions.

  • These algorithm face challenge when data and target feature is non-linear. Complex patterns are hard to decode.

  • Data should be clean (missing values and outliers should be treated).

Advanced machine learning concepts like decision tree, random forest, SVM, and neural networks can be used to overcome these limitations.

Implementation
##training using logistic regression
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
logistic.fit(x_train, y_train)
# calculate score
pred=logistic.predict(x_test)
print(confusion_matrix(y_test, pred))
print(accuracy_score(y_test, pred))
print(classification_report(y_test, pred))
The following is the output.
[[23633     0]
 [    2   845]]
0.9999183006535948
              precision    recall  f1-score   support
           0       1.00      1.00      1.00     23633
           1       1.00      1.00      1.00       847
    accuracy                           1.00     24480
   macro avg       1.00      1.00      1.00     24480
weighted avg       1.00      1.00      1.00     24480
This chapter’s “Exploratory Data Analysis” section discussed the target distribution and its imbalances. Let’s apply a sampling technique, make it balanced data, and then build the model.
# Sampling technique to handle imbalanced data
smk = SMOTETomek(0.50)
X_res,y_res=smk.fit_resample(x_train,y_train)
# Count the number of classes
from collections import Counter
print("The number of classes before fit {}".format(Counter(y)))
print("The number of classes after fit {}".format(Counter(y_res)))
The following is the output.
The number of classes before fit Counter({0: 59129, 1: 2071})
The number of classes after fit Counter({0: 35428, 1: 17680})
Build the same model after sampling.
## Training model with Logistics Regression
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
logistic.fit(X_res, y_res)
# Calculate Score
y_pred=logistic.predict(x_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))
The following is the output.
[[23633     0]
 [    0   847]]
1.0
              precision    recall  f1-score   support
           0       1.00      1.00      1.00     23633
           1       1.00      1.00      1.00       847
    accuracy                           1.00     24480
   macro avg       1.00      1.00      1.00     24480
weighted avg       1.00      1.00      1.00     24480

Decision Tree

The decision is a type of supervised learning in which the data is split into similar groups based on the most important variable to the least. It looks like a tree-shaped structure when all the variables split hence the name tree-based models.

The tree comprises a root node, a decision node, and a leaf node. A decision node can have two or more branches, and a leaf node represents a decision. Decision trees handle any type of data, be it quantitative or qualitative. Figure 8-24 shows how the decision tree works.

A decision tree exhibits the data for "is a person fit ?". The tree comprises a root node (age is less than 30), decision, and leaf nodes with yes or no options.

Figure 8-24

Decision tree

Let’s examine how tree splitting happens, which is the key concept in decision trees. The core of the decision tree algorithm is the process of splitting the tree. It uses different algorithms to split the node and is different for classification and regression problems.

The following are for classification problems.
  • The Gini index is a probabilistic way of splitting the trees. It uses the sum of the probability square for success and failure and decides the purity of the nodes. CART (classification and regression tree) uses the Gini index to create splits.

  • Chi-square is the statistical significance between subnodes, and the parent node decides the splitting. Chi-square = ((actual – expected)^2 / expected)^1/2. CHAID (Chi-square Automatic Interaction Detector) is an example of this.

The following pertains to regression problems.
  • Reduction in variance works based on the variance between two features (target and independent feature) to split a tree.

  • Overfitting occurs when the algorithms tightly fit the given training data but is inaccurate in predicting the outcomes of the untrained or test data. The same is the case with decision trees as well. It occurs when the tree is created to perfectly fit all samples in the training dataset, affecting test data accuracy.

Implementation
##Training model using decision tree
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_res, y_res)
y_pred = dt.predict(x_test)
print(dt.score(x_train, y_train))
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))
The following is the output.
1.0
[[23633     0]
 [    0   847]]
1.0
              precision    recall  f1-score   support
           0       1.00      1.00      1.00     23633
           1       1.00      1.00      1.00       847
    accuracy                           1.00     24480
   macro avg       1.00      1.00      1.00     24480
weighted avg       1.00      1.00      1.00     24480

Random Forest

Random forest is the most widely used machine learning algorithm because of its flexibility and ability to overcome the overfitting problem. A random forest is an ensemble algorithm that is an ensemble of multiple decision trees. The higher the number of trees, the better the accuracy.

The random forest can perform both classification and regression tasks. The following are some of its advantages.
  • It is insensitive to missing values and outliers.

  • It prevents the algorithm from overfitting.

How does it work? It works on bagging and bootstrap sample techniques.
  • Randomly takes the square root of m features and 2/3 bootstrap data sample with a replacement for training each decision tree randomly and predicts the outcome

  • Builds n number of trees until the out-of-bag error rate is minimized and stabilized

  • Computes the votes for each predicted target and considers the mode as a final prediction in terms of classification

Figure 8-25 shows the working of the random forest model.

A representation exhibits the function of the random forest. X is divided into three trees sub 1, 2, and 3. k sub 1, 2, and 3 from trees sub 1, 2, and 3 goes under voting or averaging and that delivers out k.

Figure 8-25

Random forest

Implementation
##Training model using Random forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_res, y_res)
# Calculate Score
y_pred=rf.predict(x_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))
The following is the output.
[[23633     0]
 [    0   847]]
1.0
              precision    recall  f1-score   support
           0       1.00      1.00      1.00     23633
           1       1.00      1.00      1.00       847
    accuracy                           1.00     24480
   macro avg       1.00      1.00      1.00     24480
weighted avg       1.00      1.00      1.00     24480

KNN

For more information on the algorithm, please refer to Chapter 4.

Implementation
#Training model using KNN
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.neighbors import KNeighborsClassifier
model1 = KNeighborsClassifier(n_neighbors=3)
model1.fit(X_res,y_res)
y_predict = model1.predict(x_test)
# Calculate Score
print(model1.score(x_train, y_train))
print(confusion_matrix(y_test,y_predict))
print(accuracy_score(y_test,y_predict))
print(classification_report(y_test,y_predict))
# plot AUROC curve
r_auc = roc_auc_score(y_test, y_predict)
r_fpr, r_tpr, _ = roc_curve(y_test, y_predict)
plt.plot(r_fpr, r_tpr, linestyle='--', label='KNN prediction (AUROC = %0.3f)' % r_auc)
plt.title('ROC Plot')
# Axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# Show legend
plt.legend()
# Show plot
plt.show()
Figure 8-26 shows the KNN output.

A graph of false positive rate versus true positive rate exposes the output of R O C plot. The dotted line indicates K N N prediction (A U R O C = 0.852). The trend increases from 0.0 to 1.0 rate.

Figure 8-26

The output

Note

Naive Bayes and XGBoost implementations are also in the notebooks.

In the preceding models, the logistic regression performance is better than all other models.

So, using that model, let’s recommend the products to one customer.
# x_test has all the features, lets us take the copy of it
test_data = x_test.copy()
#let us store predictions in one variable
test_data['predictions'] = pred
#filter the data and recommend.
recomm_one_cust = test_data[(test_data['CustomerID']== 17315) & (test_data['predictions']== 1)]
# to build the model we have encoded the stockcode column now we will decode and recommend.
items = []
for item_id in recomm_one_cust['StockCode'].unique().tolist():
    prod =  {v: k for k, v in mappings['StockCode'].items()}[item_id]
    items.append(str(prod))
items
The following is the output.
['85123A', '85099C', '84970L', 'POST', '84970S', '82494L', '48173C', '85099B']

These are the product IDs that should be recommended for customer 17315.

If you want recommendations with product names, filter these IDs in the product table.
recommendations = []
for i in items:
    recommendations.append(prod_df[prod_df['StockCode']== i]['Product Name'])
recommendations
The following is the output.
[135    Mediven Sheer and Soft 15-20 mmHg Thigh w/ Lac...
 Name: Product Name, dtype: object,
 551    Mediven Sheer and Soft 15-20 mmHg Thigh w/ Lac...
 Name: Product Name, dtype: object,
 1282    Eye Buy Express Kids Childrens Reading Glasses...
 Name: Product Name, dtype: object,
 7    MightySkins Skin Decal Wrap Compatible with Ot...
 Name: Product Name, dtype: object,
 160    Union 3" Female Ports Stainless Steel Pipe Fit...
 Name: Product Name, dtype: object,
 179    AARCO Enclosed Wall Mounted Bulletin Board
 Name: Product Name, dtype: object,
 287    Mediven Sheer and Soft 15-20 mmHg Thigh w/ Lac...
 Name: Product Name, dtype: object,
 77    Ebe Women Reading Glasses Reader Cheaters Anti...
 Name: Product Name, dtype: object]

You can also do this recommendation using the probability output from the model by sorting them.

Summary

In this chapter, you learned how to recommend a product/item to the customers using various classification algorithms, from data cleaning to model building. These types of recommendations are an add-on to the e-commerce platform. With classification-based algorithm output, you can show the hidden products to the user, and the customer is more likely to be interested in those products/items. The conversion rate of these recommendations is high compared to other recommender techniques.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.140.4