Collaborative filtering is a very popular method in recommendation engines. It is the predictive process behind the suggestions provided by these systems. It processes and analyzes customers’ information and suggests items they will likely appreciate.
Collaborative filtering algorithms use a customer’s purchase history and ratings to find similar customers and then suggest items that they liked.
For example, to find a new movie or show to watch, you can ask your friends for suggestions since you all share similar tastes in content. The same concept is used in collaborative filtering, where user-user similarity finds similar users to get recommendations based on each other’s likes.
There are two types of collaborative filtering methods—user-to-user and item-to-item. They are explored in the upcoming sections. This chapter looks at the implementation of these two methods using cosine similarity before diving into implementing the more popularly used KNN-based algorithm for collaborative filtering.
Implementation
Data Collection
This chapter uses a custom dataset that has been masked. Download the dataset from the GitHub link.
About the Dataset
InvoiceNo: The invoice number of a particular transaction
StockCode: The unique identifier for a particular item
Quantity: The quantity of that item bought by the customer
InvoiceDate: The date and time when the transaction was made
DeliveryDate: The date and time when the delivery happened
Discount%: Percentage of discount on the purchased item
ShipMode: Mode of shipping
ShippingCost: Cost of shipping that item
CustomerID: The unique identifier of a particular customer
The dataset has a total of 272,404 unique transactions in its nine columns.
The data is clean with no nulls in any columns. Further preprocessing is not required in this case.
There aren’t any negative values in the Quantity column, but if there were, those records would need to be dropped since it’s a data abnormality.
Memory-Based Approach
Let’s examine the most basic approach to implementing collaborative filtering: the memory-based approach. This approach uses simple arithmetic operations or metrics to calculate the similarities between two users or two items to group them. For example, to find user-user relations, both users’ historically liked items are used to find the similarity metric, that measures how similar the two users are.
This approach is easy to implement and understand because no model training or heavy optimization algorithms are involved. However, its performance degrades when there is sparse data. For this method to work precisely, huge amounts of clean data on multiple users and items are required, which hinders the scalability of this approach for most real-world applications.
The memory-based approach is further divided into user-to-user-based and item-to-item-based collaborative filtering.
The implementation of both methods is explored in this chapter.
User-to-User Collaborative Filtering
User-to-user-based collaborative filtering recommends items that a particular user might like by finding similar users, using purchase history or ratings on various items, and then suggesting the items liked by these similar users.
Here, a matrix is formed to describe the behavior of all the users (purchase history in our example) corresponding to all the items. Using this matrix, you can calculate the similarity metrics (cosine similarity) between users to formulate user-user relations. These relations help find users similar to a given user and recommend the items bought by these similar users.
Implementation
The data matrix shown in Figure 4-6 reveals the total quantity purchased by each user against each item. Only information about whether the item was bought or not by the user is needed, not the quantity.
Thus, an encoding of 0 or 1 is used, where 0 is not purchased, and 1 is purchased.
The purchase data matrix reveals the behavior of customers across all items. This matrix finds the user similarity scores matrix, and the similarity metric uses cosine similarity. The user similarity score matrix has user-to-user similarity for each user pair.
The similarity score values are between 0 to 1, where values closer to 0 represent less similar, and values closer to 1 represent more similar customers.
Using this user similarity scores data, let’s get recommendations for a given user.
This function separates the selected user from all other users and then takes a cosine similarity of the selected user with all users to find similar users. Return the top k similar users (by CustomerID) to our selected user.
As expected, the default five users are similar to user 12347.
Now, let’s get the recommendations by showing the items bought by similar users.
This function gets the similar users for the given customer (ID) and obtains a list of all the items bought by these similar users. This list is then flattened to get a final list of unique items, from which shows randomly chosen ten recommended items for a given user.
User 12347 had ten suggestions from the items bought by similar users.
Item-to-Item Collaborative Filtering
Item-to-item based collaborative filtering recommends items that a particular user might like by finding items similar to ones they already purchased and then creating a matrix profile for each item. Purchase history or user ratings are also used.
A matrix is formed to describe the behavior of all the users (purchase history in our example) corresponding to all the items. This matrix helps calculate the similarity metrics (cosine similarity) between items to formulate the item-item relations. This relation is used to recommend items similar to those previously purchased by the selected user.
Implementation
The following is the output.
This data matrix shows the total quantity purchased by each user against each item. But the only information needed is whether the user bought the item.
Thus, an encoding of 0 or 1 is used, where 0 is not purchased, and 1 is purchased.
The items purchase data matrix reveals the behavior of customers across all items. Let’s use this matrix to find item similarity scores with the cosine similarity metric. The item similarity score matrix has item-to-item similarity for each item pair.
The similarity score values are between 0 and 1, where values closer to 0 represent less similarity and values closer to 1 represent more similar items.
Using this item similarity score data, let’s get recommendations for a given user.
This function separates the selected item from all other items and then takes a cosine similarity of the selected item with all items to find the similarities. Return the top k similar items (StockCodes) to our selected item.
As expected, you see the default ten similar items to item 10002.
Now let’s get the recommendations by showing similar items to those bought by a particular user.
This function gets the list of similar items for all previously bought items by our given customer (ID). This list is then flattened to get a final list of unique items, from which randomly chosen ten items as recommendations for our given user are shown.
User 12347 has ten suggestions that are similar to items previously bought.
KNN-based Approach
You have learned the basics of collaborative filtering and implementing user-to-user and item-to-item filtering. Now let’s dive into machine learning-based approaches, which are more robust and popular in building recommendation systems.
Machine Learning
Machine learning is a machine’s capability to learn from experience (data) and make meaningful predictions without being explicitly programmed. It is a subfield of artificial intelligence that deals with building systems that can learn from data. The objective is to make computers learn on their own without any intervention from humans.
There are three primary machine learning categories.
Supervised Learning
In supervised learning, labeled training data is leveraged to derive the pattern or function and make a model or machine learn. Data consists of a dependent variable (Target label) and the independent variables or predictors. The machine tries to learn the function of labeled data and predict the output of unseen data.
Unsupervised Learning
In unsupervised learning, a machine learns the hidden pattern without leveraging labeled data, so training doesn’t happen. These algorithms learn to capture patterns based on similarities or distances between data points.
Reinforcement Learning
Reinforcement learning is the process of maximizing a reward by taking action. The algorithms learn how to reach a goal through experience.
Supervised Learning
There are two types of supervised learning: regression and classification.
Regression
Regression is a statistical predictive modeling technique that finds the relationship between a dependent variable and one or more independent variables. Regression is used when the dependent variable is continuous; prediction can take any numerical value.
Popular regression algorithms include linear regression, decision tree, random forest, SVM, LightGBM, and XGBoost.
Classification
In binary classification, it's either yes or no. There is no third option; for example, the customer can churn or not churn from a given business.
In multiclass classification, the labeled variable can be multiclass, for example, product categorization of an e-commerce website.
Logistic regression, k-nearest neighbor, decision tree, random forest, SVM, LightGBM, and XGBoost are popular classification algorithms.
K-Nearest Neighbor
The k-nearest neighbor (KNN) algorithm is a supervised machine learning model that is used for both classification and regression problems. It is a very robust algorithm that is easy to implement and interpret and uses less calculation time. Labeled data is needed since it’s a supervised learning algorithm.
Now let’s try implementing a simple KNN model on purchase_df, created in user-to-user filtering. This approach follows similar steps you have seen before (i.e., base recommendations from a list of items purchased by similar users). The difference is that a KNN model finds similar users (for a given user).
Implementation
Before passing our sparse matrix (i.e., purchase_df) into KNN, it must be converted into a CSR matrix.
values
extent of rows
index of columns
This function first calculates the distances and indices of the five nearest neighbors using our KNN model’s function. This output is then processed, and a list of similar users alone is returned. Instead of user_id as the input, take the index in the DataFrame.
Now that we have similar users, let’s get the recommendations by showing the items bought by these similar users.
This function replicates the logic used in user-to-user filtering. Next, let’s get the final list of items that similar users purchased and recommend any random ten items from it.
User 14729 has ten suggestions from the products bought by similar users.
Summary
This chapter covered collaborative filtering-based recommendation engines and implementing the two types of filtering methods—user-to-user and item-to-item—using basic arithmetic operations. The chapter also explored the k-nearest neighbor algorithm (along with some machine learning basics). It ended by implementing user-to-user-based collaborative filtering using the KNN approach. The next chapter explores other popular methods to implement collaborative filtering-based recommendation engines.