© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
A. Kulkarni et al.Applied Recommender Systems with Pythonhttps://doi.org/10.1007/978-1-4842-8954-9_5

5. Collaborative Filtering Using Matrix Factorization, Singular Value Decomposition, and Co-Clustering

Akshay Kulkarni1  , Adarsha Shivananda2, Anoosh Kulkarni3 and V Adithya Krishnan4
(1)
Bangalore, Karnataka, India
(2)
Hosanagara tq, Shimoga dt, Karnataka, India
(3)
Bangalore, India
(4)
Navi Mumbai, India
 

Chapter 4 explored collaborative filtering and using the KNN method. A few more important methods are covered in this chapter: matrix factorization (MF), singular value decomposition (SVD), and co-clustering. These methods (along with KNN) fall into the model-based collaborative filtering approach. The basic arithmetic method of calculating cosine similarity to find similar users falls into the memory-based approach. Each approach has pros and cons; depending on the use case, you must select the suitable approach.

Figure 5-1 explains the two types of approaches in collaborative filtering.

A framework exposes the classification of collaborative filtering. It classifies into 2 techniques: memory-based approach and model-based approach. The definitions, advantages, and disadvantages are exposed.

Figure 5-1

The two approaches of collaborative filtering explained

The memory-based approach is much easier to implement and explain, but its performance is often affected due to sparse data. But on the other hand, model-based approaches, like MF, handle the sparse data well, but it’s usually not intuitive or easy to explain and can be much more complex to implement. But the model-based approach performs better with large datasets and hence is quite scalable.

This chapter focuses on a few popular model-based approaches, such as implementing matrix factorization using the same data from Chapter 4, SVD, and co-clustering models.

Implementation

Matrix Factorization, Co-Clustering, and SVD

The following implementation is a continuation of Chapter 4 and uses the same dataset.

Let’s look at the data.
data1.head()
Figure 5-2 shows the DataFrame from Chapter 4.

A data frame exposes the input data that include invoice number, stock code, quantity, invoice date, delivery date, discount percentage, ship mode, shipping cost, and customer I D.

Figure 5-2

Input data

Let’s reuse item_purchase_df from Chapter 4. It is the matrix containing the items and the information on whether customers bought them.
items_purchase_df.head()
Figure 5-3 shows the item purchase DataFrame/matrix.

A data frame exposes the data of items purchased. The customer I D and stock code are represented.

Figure 5-3

Item purchase DataFrame/matrix

This chapter uses the Python package called surprise for modeling purposes. It has implementations of popular methods in collaborative filtering, like matrix factorization, SVD, co-clustering, and even KNN.

First, let’s format the data into the proper format required by the surprise package.

Start by stacking the DataFrame/matrix.
data3 = items_purchase_df.stack().to_frame()
#Renaming the column as Quantity
data3 = data3.reset_index().rename(columns={0:"Quantity"})
data3
Figure 5-4 shows the output DataFrame after stacking.

A data frame exposes the output of stacked item purchases. The stock code, customer I D, and quantity are represented. Every quantity is observed with 0.

Figure 5-4

Stacked item purchase DataFrame/matrix

print(items_purchase_df.shape)
print(data3.shape)
The following is the output.
(3538, 3647)
(12903086, 3)

As you can see, items_purchase_df has 3538 unique items (rows) and 3647 unique users (columns). The stacked DataFrame is 3538 × 3647 = 12,903,086 rows, which is too big to pass into any algorithm.

Let’s shortlist some customers and items based on the number of orders.

First, put all the IDs in a list.
# Storing all customer ids in customers
customer_ids = data1['CustomerID']
# Storing all item descriptions in items
item_ids = data1['StockCode']
The following imports the counter to count the number of orders made by each customer and for each item.
from collections import Counter
Count the number of orders by each customer and store that information in a DataFrame.
# counting no. of orders made by each customer
count_orders = Counter(customer_ids)
# storing the count and customer id in a dataframe
customer_count_df = pd.DataFrame.from_dict(count_orders, orient='index').reset_index().rename(columns={0:"Quantity"})
Drop all customer IDs with less than 120 orders.
customer_count_df = customer_count_df[customer_count_df["Quantity"]>120]
Rename the index column as 'CustomerID' for the inner join.
customer_count_df.rename(columns={'index':'CustomerID'},inplace=True)
customer_count_df
Figure 5-5 shows the customer count DataFrame output.

A data frame exposes the count of the customer. The customer I D and the quantity are represented. The model of the data frame: 568 rows and 2 columns.

Figure 5-5

Customer count DataFrame

Similarly, repeat the same process for items (i.e., counting the number of orders placed per item and storing it in a DataFrame).
# counting no. of times an item was ordered
count_items = Counter(item_ids)
# storing the count and item description in a dataframe
item_count_df = pd.DataFrame.from_dict(count_items, orient='index').reset_index().rename(columns={0:"Quantity"})
Drop all items that were ordered less than 120 times.
item_count_df = item_count_df[item_count_df["Quantity"]>120]
Rename the index column as 'Description' for the inner join.
item_count_df.rename(columns={'index':'StockCode'},inplace=True)
item_count_df
Figure 5-6 shows the output item count DataFrame.

A data frame exposes the output of item count. The customer I D and the quantity are represented. The model of the data frame: 679 rows and 2 columns.

Figure 5-6

Item count DataFrame

Next, apply a join on both DataFrames with stacked data to create the shortlisted final DataFrame.
#Merging stacked df with item count df
data4 = pd.merge(data3, item_count_df, on='StockCode', how='inner')
#Merging with customer count df
data4 = pd.merge(data4, customer_count_df, on='CustomerID', how='inner')
# dropping columns which are not necessary
data4.drop(['Quantity_y','Quantity_x'],axis=1,inplace=True)
data4
Figure 5-7 shows the shortlisted DataFrame output.

A data frame exposes the output of shortlisted data. The stock code, customer I D, and quantity are represented. The data frame model: 385672 rows and 3 columns.

Figure 5-7

The final shortlisted DataFrame

Now that the size of the data has been reduced, let’s describe it and view the stats.
data4.describe()
Figure 5-8 describes the shortlisted DataFrame.

A data frame reports the shortlisted data. The customer I D, and quantity along with the mean, count, minimum, and maximum values are exposed.

Figure 5-8

Describes the shortlisted DataFrame

You can see from the output that the count has significantly reduced to 385,672 records, from 12,903,086. But this DataFrame is to be formatted further using built-in functions from the surprise package to be supported.

Read the data in a format supported by the surprise library.
reader = Reader(rating_scale=(0,5095))

The range has been set as 0,5095 because the maximum quantity value is 5095.

Load the dataset in a format supported by the surprise library.
formated_data = Dataset.load_from_df(data4, reader)

The final formatted data is ready.

Now, let’s split the data to train and test for validating the models.
# performing train test split on the dataset
train_set, test_set = train_test_split(formated_data, test_size= 0.2)

Implementing NMF

Let’s start by modeling the non-negative matrix factorization method.

Figure 5-9 explains matrix factorization (multiplication).

A representation of matrix factorization. The purchase matrix is equal to the user matrix multiplied by the item matrix. A, B, C, and D indicate the users. W, X, Y, and Z indicate the items.

Figure 5-9

Matrix factorization

Matrix factorization is a popular method used in building collaborative filtering-based recommendation systems. It is a basic embedding model where latent/hidden features (embeddings) are generated from the user and item matrixes using matrix multiplication. This reduces the dimensionality of the full input matrix and hence is a compact representation, increasing the scalability and performance. These latent features are then used to fit an optimization problem (usually minimizing an error equation) to get to the predictions.
# defining the model
algo1 = NMF()
# model fitting
algo1.fit(train_set)
# model prediction
pred1 = algo1.test(test_set)
Using built-in functions, you can calculate the performance metrics like RMSE (root-mean-squared error) and MAE (mean absolute error).
# RMSE
accuracy.rmse(pred1)
#MAE
accuracy.mae(pred1)
The following is the output.
RMSE: 428.3167
MAE:  272.6909

The RMSE and MAE are moderately high for this model, so let’s try the other two and compare them at the end.

You can also cross-validate (using built-in functions) to further validate these values.
cross_validate(algo1, formated_data, verbose=True)
Figure 5-10 shows the cross-validation output for NMF.

A representation exhibits the outcome of cross-validation for N M F. Evaluating R M S E, M A E, of algorithm N M F on 5 splits is exposed.

Figure 5-10

Cross-validation output for NMF

The cross-validation shows that the average RMSE is 427.774, and MAE is approximately 272.627, which is moderately high.

Implementing Co-Clustering

Co-clustering (also known as bi-clustering) is commonly used in collaborative filtering. It is a data-mining technique that simultaneously clusters the columns and rows of a DataFrame/matrix. It differs from normal clustering, where each object is checked for similarity with other objects based on a single entity/type of comparison. As in co-clustering, you check for co-grouping of two different entities/types of comparison for each object simultaneously as a pairwise interaction.

Let’s try modeling with the co-clustering method.
# defining the model
algo2 = CoClustering()
# model fitting
algo2.fit(train_set)
# model prediction
pred2 = algo2.test(test_set)
Calculate the RMSE and MAE performance metrics using built-in functions.
# RMSE
accuracy.rmse(pred2)
#MAE
accuracy.mae(pred2)
The following is the output.
RMSE: 6.7877
MAE:  5.8950

The RMSE and MAE are very low for this model. Until now, this has performed the best (better than NMF).

Cross-validate (using built-in functions) to further validate these values.
cross_validate(algo2, formated_data, verbose=True)
Figure 5-11 shows the cross-validation output for co-clustering.

A representation exposes the output of cross-validation for co-clustering. Evaluating R M S E and M A E of algorithm co-clustering on 5 splits are depicted.

Figure 5-11

Cross-validation output for co-clustering

The cross-validation shows that the average RMSE is 14.031, and MAE is approximately 6.135, which is quite low.

Implementing SVD

Singular value decomposition is a linear algebra concept generally used as a dimensionality reduction method. It is also a type of matrix factorization. It works similarly in collaborative filtering, where a matrix with rows and columns as users and items is reduced further into latent feature matrixes. An error equation is minimized to get to the prediction.

Let’s try modeling using the SVD method.
# defining the model
import SVD
algo3 = SVD()
# model fitting
algo3.fit(train_set)
# model prediction
pred3 = algo3.test(test_set)
Calculate the RMSE and MAE performance metrics using built-in functions.
# RMSE
accuracy.rmse(pred3)
#MAE
accuracy.mae(pred3)
The following is the output.
RMSE: 4827.6830
MAE:  4815.8341

The RMSE and MAE are significantly high for this model. Until now, this has performed the worst (worse than NMF and co-clustering).

Cross-validate (using built-in functions) to further validate these values.
cross_validate(algo3, formated_data, verbose=True)
Figure 5-12 shows the cross-validation output for SVD.

A representation exposes the output of cross-validation for S V D. Evaluating R M S E and M A E of algorithm S V D on 5 splits are depicted.

Figure 5-12

Cross-validation output for SVD

The cross-validation shows that the average RMSE is 4831.928 and MAE is approximately 4821.549, which is very high.

Getting the Recommendations

The co-clustering model has performed better than the NMF and the SVD models. But let’s validate the model further once more before using the predictions.

For validating the model, let’s use item 47590B and customer 15738.
data1[(data1['StockCode']=='47590B')&(data1['CustomerID']==15738)].Quantity.sum()
The following is the output.
78
Let’s get the prediction for the same combination to see the estimation or prediction.
algo2.test([['47590B',15738,78]])
The following is the output.
[Prediction(uid='47590B', iid=15738, r_ui=78, est=133.01087456331527, details={'was_impossible': False})]

The predicted value given by the model is 133.01, while the actual was 78. It is close to the actual and validated the model performance even further.

The predictions are from the co-clustering model.
pred2
The following is the output.
[Prediction(uid='85014B', iid=17228, r_ui=130.0, est=119.18329013727276, details={'was_impossible': False}),
 Prediction(uid='84406B', iid=16520, r_ui=156.0, est=161.85867140088936, details={'was_impossible': False}),
 Prediction(uid='47590B', iid=17365, r_ui=353.0, est=352.7773176826455, details={'was_impossible': False}),
...,
 Prediction(uid='85049G', iid=16755, r_ui=170.0, est=159.5403752414615, details={'was_impossible': False}),
 Prediction(uid='16156S', iid=14895, r_ui=367.0, est=368.129814201444, details={'was_impossible': False}),
 Prediction(uid='47566B', iid=17238, r_ui=384.0, est=393.60123986750034, details={'was_impossible': False})]
Now let’s use these predictions and see the best and the worst predictions, but first, let’s get the final output onto a DataFrame.
predictions_data = pd.DataFrame(pred2, columns=['item_id', 'customer_id', 'quantity', 'prediction', 'details'])
But first, let’s also add important information like the number of item orders and customer orders for each record using the following function.
def get_item_orders(user_id):
    try:
        # for an item, return the no. of orders made
        return len(train_set.ur[train_set.to_inner_uid(user_id)])
    except ValueError:
        # user not present in training
        return 0
def get_customer_orders(item_id):
    try:
        # for an customer, return the no. of orders made
        return len(train_set.ir[train_set.to_inner_iid(item_id)])
    except ValueError:
        # item not present in training
        return 0
The following calls these functions.
predictions_data['item_orders'] = predictions_data.item_id.apply(get_item_orders)
predictions_data['customer_orders'] = predictions_data.customer_id.apply(get_customer_orders)
Calculate the error component to get the best and worst predictions.
predictions_data['error'] = abs(predictions_data.prediction - predictions_data.quantity)
predictions_data
Figure 5-13 shows the prediction DataFrame.

A data frame exposes the prediction data. The item id, customer id, quantity, prediction, details, item orders, customer orders, and error are represented.

Figure 5-13

Prediction DataFrame

The following gets the best predictions.
best_predictions = predictions_data.sort_values(by='error')[:10]
best_predictions
Figure 5-14 shows the best predictions.

A data frame exposes the best predictions data. The item id, customer id, quantity, prediction, details, item orders, customer orders, and error are represented.

Figure 5-14

Best predictions

The following gets the worst predictions.
worst_predictions = predictions_data.sort_values(by='error')[-10:]
worst_predictions
Figure 5-15 shows the worst predictions.

A data frame exposes the data of worst predictions. The item id, customer id, quantity, prediction, details, item orders, customer orders, and error are represented.

Figure 5-15

Worst predictions

You can now use the predictions data to get to the recommendations. First, find the customers that have bought the same items as a given user, and then from the other items they have bought, to fetch the top items and recommend them.

Let’s again use customer 12347 and create a list of the items this user bought.
# Getting item list for user 12347
item_list = predictions_data[predictions_data['customer_id']==12347]['item_id'].values.tolist()
item_list
The following is the output.
['82494L',
'84970S',
'47599A',
'84997B',
'85123A',
'84997C',
'85049A']
Get the list of customers who bought the same items as user 12347.
# Getting list of unique customers who also bought same items (item_list)
customer_list = predictions_data[predictions_data['item_id'].isin(item_list)]['customer_id'].values
customer_list = np.unique(customer_list).tolist()
customer_list
The following is the output.
[12347,
 12362,
 12370,
 12378,
 ...,
 12415,
 12417,
 12428]
Now let’s filter these customers (customer_list) from predictions data, remove the items already bought, and recommend the top items (prediction).
# filtering those customers from predictions data
filtered_data = predictions_data[predictions_data['customer_id'].isin(customer_list)]
# removing the items already bought
filtered_data = filtered_data[~filtered_data['item_id'].isin(item_list)]
# getting the top items (prediction)
recommended_items = filtered_data.sort_values('prediction',ascending=False).reset_index(drop=True).head(10)['item_id'].values.tolist()
recommended_items
The following is the output.
['16156S',
 '85049E',
 '47504K',
 '85099C',
 '85049G',
 '85014B',
 '72351B',
 '84536A',
 '48173C',
 '47590A']

The recommended list of items for user 12347 is achieved.

Summary

This chapter continued the discussion of collaborative filtering-based recommendation engines. Popular methods like matrix factorization, SVD, and co-clustering were explored with a focus on implementing all three models. For the given data, the co-clustering method performed the best, but you need to try all the different methods available to see which best fits your data and use case in building a recommendation system.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.204.216