5. Collaborative Filtering Using Matrix Factorization, Singular Value Decomposition, and Co-Clustering
Akshay Kulkarni1, Adarsha Shivananda2, Anoosh Kulkarni3 and V Adithya Krishnan4
(1)
Bangalore, Karnataka, India
(2)
Hosanagara tq, Shimoga dt, Karnataka, India
(3)
Bangalore, India
(4)
Navi Mumbai, India
Chapter 4 explored collaborative filtering and using the KNN method. A few more important methods are covered in this chapter: matrix factorization (MF), singular value decomposition (SVD), and co-clustering. These methods (along with KNN) fall into the model-based collaborative filtering approach. The basic arithmetic method of calculating cosine similarity to find similar users falls into the memory-based approach. Each approach has pros and cons; depending on the use case, you must select the suitable approach.
Figure 5-1 explains the two types of approaches in collaborative filtering.
The memory-based approach is much easier to implement and explain, but its performance is often affected due to sparse data. But on the other hand, model-based approaches, like MF, handle the sparse data well, but it’s usually not intuitive or easy to explain and can be much more complex to implement. But the model-based approach performs better with large datasets and hence is quite scalable.
This chapter focuses on a few popular model-based approaches, such as implementing matrix factorization using the same data from Chapter 4, SVD, and co-clustering models.
Implementation
Matrix Factorization, Co-Clustering, and SVD
The following implementation is a continuation of Chapter 4 and uses the same dataset.
Let’s reuse item_purchase_df from Chapter 4. It is the matrix containing the items and the information on whether customers bought them.
items_purchase_df.head()
Figure 5-3 shows the item purchase DataFrame/matrix.
This chapter uses the Python package called surprise for modeling purposes. It has implementations of popular methods in collaborative filtering, like matrix factorization, SVD, co-clustering, and even KNN.
First, let’s format the data into the proper format required by the surprise package.
Figure 5-4 shows the output DataFrame after stacking.
print(items_purchase_df.shape)
print(data3.shape)
The following is the output.
(3538, 3647)
(12903086, 3)
As you can see, items_purchase_df has 3538 unique items (rows) and 3647 unique users (columns). The stacked DataFrame is 3538 × 3647 = 12,903,086 rows, which is too big to pass into any algorithm.
Let’s shortlist some customers and items based on the number of orders.
First, put all the IDs in a list.
# Storing all customer ids in customers
customer_ids = data1['CustomerID']
# Storing all item descriptions in items
item_ids = data1['StockCode']
The following imports the counter to count the number of orders made by each customer and for each item.
from collections import Counter
Count the number of orders by each customer and store that information in a DataFrame.
# counting no. of orders made by each customer
count_orders = Counter(customer_ids)
# storing the count and customer id in a dataframe
You can see from the output that the count has significantly reduced to 385,672 records, from 12,903,086. But this DataFrame is to be formatted further using built-in functions from the surprise package to be supported.
Read the data in a format supported by the surprise library.
reader = Reader(rating_scale=(0,5095))
The range has been set as 0,5095 because the maximum quantity value is 5095.
Load the dataset in a format supported by the surprise library.
Matrix factorization is a popular method used in building collaborative filtering-based recommendation systems. It is a basic embedding model where latent/hidden features (embeddings) are generated from the user and item matrixes using matrix multiplication. This reduces the dimensionality of the full input matrix and hence is a compact representation, increasing the scalability and performance. These latent features are then used to fit an optimization problem (usually minimizing an error equation) to get to the predictions.
# defining the model
algo1 = NMF()
# model fitting
algo1.fit(train_set)
# model prediction
pred1 = algo1.test(test_set)
Using built-in functions, you can calculate the performance metrics like RMSE (root-mean-squared error) and MAE (mean absolute error).
# RMSE
accuracy.rmse(pred1)
#MAE
accuracy.mae(pred1)
The following is the output.
RMSE: 428.3167
MAE: 272.6909
The RMSE and MAE are moderately high for this model, so let’s try the other two and compare them at the end.
You can also cross-validate (using built-in functions) to further validate these values.
Figure 5-10 shows the cross-validation output for NMF.
The cross-validation shows that the average RMSE is 427.774, and MAE is approximately 272.627, which is moderately high.
Implementing Co-Clustering
Co-clustering (also known as bi-clustering) is commonly used in collaborative filtering. It is a data-mining technique that simultaneously clusters the columns and rows of a DataFrame/matrix. It differs from normal clustering, where each object is checked for similarity with other objects based on a single entity/type of comparison. As in co-clustering, you check for co-grouping of two different entities/types of comparison for each object simultaneously as a pairwise interaction.
Let’s try modeling with the co-clustering method.
# defining the model
algo2 = CoClustering()
# model fitting
algo2.fit(train_set)
# model prediction
pred2 = algo2.test(test_set)
Calculate the RMSE and MAE performance metrics using built-in functions.
# RMSE
accuracy.rmse(pred2)
#MAE
accuracy.mae(pred2)
The following is the output.
RMSE: 6.7877
MAE: 5.8950
The RMSE and MAE are very low for this model. Until now, this has performed the best (better than NMF).
Cross-validate (using built-in functions) to further validate these values.
Figure 5-11 shows the cross-validation output for co-clustering.
The cross-validation shows that the average RMSE is 14.031, and MAE is approximately 6.135, which is quite low.
Implementing SVD
Singular value decomposition is a linear algebra concept generally used as a dimensionality reduction method. It is also a type of matrix factorization. It works similarly in collaborative filtering, where a matrix with rows and columns as users and items is reduced further into latent feature matrixes. An error equation is minimized to get to the prediction.
Let’s try modeling using the SVD method.
# defining the model
import SVD
algo3 = SVD()
# model fitting
algo3.fit(train_set)
# model prediction
pred3 = algo3.test(test_set)
Calculate the RMSE and MAE performance metrics using built-in functions.
# RMSE
accuracy.rmse(pred3)
#MAE
accuracy.mae(pred3)
The following is the output.
RMSE: 4827.6830
MAE: 4815.8341
The RMSE and MAE are significantly high for this model. Until now, this has performed the worst (worse than NMF and co-clustering).
Cross-validate (using built-in functions) to further validate these values.
Figure 5-12 shows the cross-validation output for SVD.
The cross-validation shows that the average RMSE is 4831.928 and MAE is approximately 4821.549, which is very high.
Getting the Recommendations
The co-clustering model has performed better than the NMF and the SVD models. But let’s validate the model further once more before using the predictions.
For validating the model, let’s use item 47590B and customer 15738.
You can now use the predictions data to get to the recommendations. First, find the customers that have bought the same items as a given user, and then from the other items they have bought, to fetch the top items and recommend them.
Let’s again use customer 12347 and create a list of the items this user bought.
The recommended list of items for user 12347 is achieved.
Summary
This chapter continued the discussion of collaborative filtering-based recommendation engines. Popular methods like matrix factorization, SVD, and co-clustering were explored with a focus on implementing all three models. For the given data, the co-clustering method performed the best, but you need to try all the different methods available to see which best fits your data and use case in building a recommendation system.