A classification algorithm-based recommender system is also known as the buying propensity model. The goal here is to predict the propensity of customers to buy a product using historical behavior and purchases.
The more accurately you predict future purchases, the better recommendations and, in turn, sales. This kind of recommender system is used more often to ensure 100% conversion from the users who are likely to purchase with certain probabilities. Promotions are offered on those products, enticing users to make a purchase.
Approach
- 1.
Data collection
- 2.
Data preprocessing and cleaning
- 3.
Feature engineering
- 4.
Exploratory data analysis
- 5.
Model building
- 6.
Evaluation
- 7.
Predictions and recommendations
Implementation
Data Collection and Download Word Embeddings
Let’s consider an e-commerce dataset. Download the dataset from GitHub link.
Importing the Data as a DataFrame (pandas)
Preprocessing the Data
Before building any model, the initial step is to clean and preprocess the data.
Analyze, clean, and merge the three datasets so that the merged DataFrame can build ML models.
There are no null values present in the datasets. So, dropping or treating them is not required.
As you can see, null values are present in the Quantity column.
All the datasets have merged, and the required data preprocessing and cleaning are completed.
Feature Engineering
Once the data is preprocessed and cleaned, the next step is to perform feature engineering.
Let’s create a flag column, using the Quantity column, that indicates whether the customer has bought the product or not.
Exploratory Data Analysis
Feature engineering is a must for model data preprocessing. However, exploratory data analysis (EDA) also plays a vital role.
You can get more business insights by looking at the historical data itself.
The key insight from this chart is that the Mightyskins brand has the highest sales.
The key takeaway insight from this chart is that low-income customers are buying more products. However, there is not a major difference between medium and high-income customers.
Let’s dump a few charts here. For more information, please refer to the notebook.
It looks like this particular use case has a data imbalance. Let’s build the model after sampling the data.
Model Building
Train-Test Split
Logistic Regression
Linear regression is needed to predict a numerical value. But you also encounter classification problems where dependent variables are binary, like yes or no, 1 or 0, true or false, and so on. In that case, logistic regression is needed. It is a classification algorithm and continues linear regression. Here, log odds are used to restrict the dependent variable between 0 and 1.
Where (P/1 – P) is the odds ratio, β0 is constant, and β is the coefficient.
Accuracy is the number of correct predictions divided by the total number of predictions. The values lie between 0 and 1; to convert it into a percentage, multiply the answer by 100. But only considering accuracy as the evaluation parameter is not an ideal thing to do. For example, if the data is imbalanced, you can obtain very high accuracy.
The crosstab between an actual and a predicted class is called a confusion matrix. It's not only for binary, but you can also use it for multiclass classification. Figure 8-22 represents a confusion matrix.
The ROC (receiver operating characteristic) curve is an evaluation metric for classification tasks. A plot with a false positive rate on the x axis and a true positive rate on the y axis is the ROC curve plot. It says how strongly the classes are distinguished when the thresholds are varied. Higher the value of the area under the ROC curve, the higher the predictive power. Figure 8-23 shows the ROC curve.
Statistical modeling must satisfy the assumptions that are discussed previously. If they are not satisfied, models won’t be reliable and thoroughly fit random predictions.
These algorithm face challenge when data and target feature is non-linear. Complex patterns are hard to decode.
Data should be clean (missing values and outliers should be treated).
Advanced machine learning concepts like decision tree, random forest, SVM, and neural networks can be used to overcome these limitations.
Implementation
Decision Tree
The decision is a type of supervised learning in which the data is split into similar groups based on the most important variable to the least. It looks like a tree-shaped structure when all the variables split hence the name tree-based models.
Let’s examine how tree splitting happens, which is the key concept in decision trees. The core of the decision tree algorithm is the process of splitting the tree. It uses different algorithms to split the node and is different for classification and regression problems.
The Gini index is a probabilistic way of splitting the trees. It uses the sum of the probability square for success and failure and decides the purity of the nodes. CART (classification and regression tree) uses the Gini index to create splits.
Chi-square is the statistical significance between subnodes, and the parent node decides the splitting. Chi-square = ((actual – expected)^2 / expected)^1/2. CHAID (Chi-square Automatic Interaction Detector) is an example of this.
Reduction in variance works based on the variance between two features (target and independent feature) to split a tree.
Overfitting occurs when the algorithms tightly fit the given training data but is inaccurate in predicting the outcomes of the untrained or test data. The same is the case with decision trees as well. It occurs when the tree is created to perfectly fit all samples in the training dataset, affecting test data accuracy.
Implementation
Random Forest
Random forest is the most widely used machine learning algorithm because of its flexibility and ability to overcome the overfitting problem. A random forest is an ensemble algorithm that is an ensemble of multiple decision trees. The higher the number of trees, the better the accuracy.
It is insensitive to missing values and outliers.
It prevents the algorithm from overfitting.
Randomly takes the square root of m features and 2/3 bootstrap data sample with a replacement for training each decision tree randomly and predicts the outcome
Builds n number of trees until the out-of-bag error rate is minimized and stabilized
Computes the votes for each predicted target and considers the mode as a final prediction in terms of classification
Implementation
KNN
For more information on the algorithm, please refer to Chapter 4.
Implementation
Naive Bayes and XGBoost implementations are also in the notebooks.
In the preceding models, the logistic regression performance is better than all other models.
These are the product IDs that should be recommended for customer 17315.
You can also do this recommendation using the probability output from the model by sorting them.
Summary
In this chapter, you learned how to recommend a product/item to the customers using various classification algorithms, from data cleaning to model building. These types of recommendations are an add-on to the e-commerce platform. With classification-based algorithm output, you can show the hidden products to the user, and the customer is more likely to be interested in those products/items. The conversion rate of these recommendations is high compared to other recommender techniques.