Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

A. Kulkarni et al.Applied Recommender Systems with Pythonhttps://doi.org/10.1007/978-1-4842-8954-9_5

5. Collaborative Filtering Using Matrix Factorization, Singular Value Decomposition, and Co-Clustering

Akshay Kulkarni¹, Adarsha Shivananda², Anoosh Kulkarni³ and V Adithya Krishnan⁴

(1)

Bangalore, Karnataka, India

(2)

Hosanagara tq, Shimoga dt, Karnataka, India

(3)

Bangalore, India

(4)

Navi Mumbai, India

Chapter 4 explored collaborative filtering and using the KNN method. A few more important methods are covered in this chapter: matrix factorization (MF), singular value decomposition (SVD), and co-clustering. These methods (along with KNN) fall into the model-based collaborative filtering approach. The basic arithmetic method of calculating cosine similarity to find similar users falls into the memory-based approach. Each approach has pros and cons; depending on the use case, you must select the suitable approach.

Figure 5-1 explains the two types of approaches in collaborative filtering.

Figure 5-1
The two approaches of collaborative filtering explained

The memory-based approach is much easier to implement and explain, but its performance is often affected due to sparse data. But on the other hand, model-based approaches, like MF, handle the sparse data well, but it’s usually not intuitive or easy to explain and can be much more complex to implement. But the model-based approach performs better with large datasets and hence is quite scalable.

This chapter focuses on a few popular model-based approaches, such as implementing matrix factorization using the same data from Chapter 4, SVD, and co-clustering models.

Implementation

Matrix Factorization, Co-Clustering, and SVD

The following implementation is a continuation of Chapter 4 and uses the same dataset.

Let’s look at the data.

data1.head()

Figure 5-2 shows the DataFrame from Chapter 4.

Let’s reuse item_purchase_df from Chapter 4. It is the matrix containing the items and the information on whether customers bought them.

items_purchase_df.head()

Figure 5-3 shows the item purchase DataFrame/matrix.

Figure 5-3
Item purchase DataFrame/matrix

This chapter uses the Python package called surprise for modeling purposes. It has implementations of popular methods in collaborative filtering, like matrix factorization, SVD, co-clustering, and even KNN.

First, let’s format the data into the proper format required by the surprise package.

Start by stacking the DataFrame/matrix.

data3 = items_purchase_df.stack().to_frame()

#Renaming the column as Quantity

data3 = data3.reset_index().rename(columns={0:"Quantity"})

data3

Figure 5-4 shows the output DataFrame after stacking.

Figure 5-4
Stacked item purchase DataFrame/matrix

print(items_purchase_df.shape)

print(data3.shape)

The following is the output.

(3538, 3647)

(12903086, 3)

As you can see, items_purchase_df has 3538 unique items (rows) and 3647 unique users (columns). The stacked DataFrame is 3538 × 3647 = 12,903,086 rows, which is too big to pass into any algorithm.

Let’s shortlist some customers and items based on the number of orders.

First, put all the IDs in a list.

# Storing all customer ids in customers

customer_ids = data1['CustomerID']

# Storing all item descriptions in items

item_ids = data1['StockCode']

The following imports the counter to count the number of orders made by each customer and for each item.

from collections import Counter

Count the number of orders by each customer and store that information in a DataFrame.

# counting no. of orders made by each customer

count_orders = Counter(customer_ids)

# storing the count and customer id in a dataframe

customer_count_df = pd.DataFrame.from_dict(count_orders, orient='index').reset_index().rename(columns={0:"Quantity"})

Drop all customer IDs with less than 120 orders.

customer_count_df = customer_count_df[customer_count_df["Quantity"]>120]

Rename the index column as 'CustomerID' for the inner join.

customer_count_df.rename(columns={'index':'CustomerID'},inplace=True)

customer_count_df

Figure 5-5 shows the customer count DataFrame output.

Similarly, repeat the same process for items (i.e., counting the number of orders placed per item and storing it in a DataFrame).

# counting no. of times an item was ordered

count_items = Counter(item_ids)

# storing the count and item description in a dataframe

item_count_df = pd.DataFrame.from_dict(count_items, orient='index').reset_index().rename(columns={0:"Quantity"})

Drop all items that were ordered less than 120 times.

item_count_df = item_count_df[item_count_df["Quantity"]>120]

Rename the index column as 'Description' for the inner join.

item_count_df.rename(columns={'index':'StockCode'},inplace=True)

item_count_df

Figure 5-6 shows the output item count DataFrame.

Next, apply a join on both DataFrames with stacked data to create the shortlisted final DataFrame.

#Merging stacked df with item count df

data4 = pd.merge(data3, item_count_df, on='StockCode', how='inner')

#Merging with customer count df

data4 = pd.merge(data4, customer_count_df, on='CustomerID', how='inner')

# dropping columns which are not necessary

data4.drop(['Quantity_y','Quantity_x'],axis=1,inplace=True)

data4

Figure 5-7 shows the shortlisted DataFrame output.

Figure 5-7
The final shortlisted DataFrame

Now that the size of the data has been reduced, let’s describe it and view the stats.

data4.describe()

Figure 5-8 describes the shortlisted DataFrame.

You can see from the output that the count has significantly reduced to 385,672 records, from 12,903,086. But this DataFrame is to be formatted further using built-in functions from the surprise package to be supported.

Read the data in a format supported by the surprise library.

reader = Reader(rating_scale=(0,5095))

The range has been set as 0,5095 because the maximum quantity value is 5095.

Load the dataset in a format supported by the surprise library.

formated_data = Dataset.load_from_df(data4, reader)

The final formatted data is ready.

Now, let’s split the data to train and test for validating the models.

# performing train test split on the dataset

train_set, test_set = train_test_split(formated_data, test_size= 0.2)

Implementing NMF

Let’s start by modeling the non-negative matrix factorization method.

Figure 5-9 explains matrix factorization (multiplication).

Matrix factorization is a popular method used in building collaborative filtering-based recommendation systems. It is a basic embedding model where latent/hidden features (embeddings) are generated from the user and item matrixes using matrix multiplication. This reduces the dimensionality of the full input matrix and hence is a compact representation, increasing the scalability and performance. These latent features are then used to fit an optimization problem (usually minimizing an error equation) to get to the predictions.

# defining the model

algo1 = NMF()

# model fitting

algo1.fit(train_set)

# model prediction

pred1 = algo1.test(test_set)

Using built-in functions, you can calculate the performance metrics like RMSE (root-mean-squared error) and MAE (mean absolute error).

# RMSE

accuracy.rmse(pred1)

#MAE

accuracy.mae(pred1)

The following is the output.

RMSE: 428.3167

MAE: 272.6909

The RMSE and MAE are moderately high for this model, so let’s try the other two and compare them at the end.

You can also cross-validate (using built-in functions) to further validate these values.

cross_validate(algo1, formated_data, verbose=True)

Figure 5-10 shows the cross-validation output for NMF.

Figure 5-10
Cross-validation output for NMF

The cross-validation shows that the average RMSE is 427.774, and MAE is approximately 272.627, which is moderately high.

Implementing Co-Clustering

Co-clustering (also known as bi-clustering) is commonly used in collaborative filtering. It is a data-mining technique that simultaneously clusters the columns and rows of a DataFrame/matrix. It differs from normal clustering, where each object is checked for similarity with other objects based on a single entity/type of comparison. As in co-clustering, you check for co-grouping of two different entities/types of comparison for each object simultaneously as a pairwise interaction.

Let’s try modeling with the co-clustering method.

# defining the model

algo2 = CoClustering()

# model fitting

algo2.fit(train_set)

# model prediction

pred2 = algo2.test(test_set)

Calculate the RMSE and MAE performance metrics using built-in functions.

# RMSE

accuracy.rmse(pred2)

#MAE

accuracy.mae(pred2)

The following is the output.

RMSE: 6.7877

MAE: 5.8950

The RMSE and MAE are very low for this model. Until now, this has performed the best (better than NMF).

Cross-validate (using built-in functions) to further validate these values.

cross_validate(algo2, formated_data, verbose=True)

Figure 5-11 shows the cross-validation output for co-clustering.

Figure 5-11
Cross-validation output for co-clustering

The cross-validation shows that the average RMSE is 14.031, and MAE is approximately 6.135, which is quite low.

Implementing SVD

Singular value decomposition is a linear algebra concept generally used as a dimensionality reduction method. It is also a type of matrix factorization. It works similarly in collaborative filtering, where a matrix with rows and columns as users and items is reduced further into latent feature matrixes. An error equation is minimized to get to the prediction.

Let’s try modeling using the SVD method.

# defining the model

import SVD

algo3 = SVD()

# model fitting

algo3.fit(train_set)

# model prediction

pred3 = algo3.test(test_set)

Calculate the RMSE and MAE performance metrics using built-in functions.

# RMSE

accuracy.rmse(pred3)

#MAE

accuracy.mae(pred3)

The following is the output.

RMSE: 4827.6830

MAE: 4815.8341

The RMSE and MAE are significantly high for this model. Until now, this has performed the worst (worse than NMF and co-clustering).

Cross-validate (using built-in functions) to further validate these values.

cross_validate(algo3, formated_data, verbose=True)

Figure 5-12 shows the cross-validation output for SVD.

Figure 5-12
Cross-validation output for SVD

The cross-validation shows that the average RMSE is 4831.928 and MAE is approximately 4821.549, which is very high.

Getting the Recommendations

The co-clustering model has performed better than the NMF and the SVD models. But let’s validate the model further once more before using the predictions.

For validating the model, let’s use item 47590B and customer 15738.

data1[(data1['StockCode']=='47590B')&(data1['CustomerID']==15738)].Quantity.sum()

The following is the output.

Let’s get the prediction for the same combination to see the estimation or prediction.

algo2.test([['47590B',15738,78]])

The following is the output.

[Prediction(uid='47590B', iid=15738, r_ui=78, est=133.01087456331527, details={'was_impossible': False})]

The predicted value given by the model is 133.01, while the actual was 78. It is close to the actual and validated the model performance even further.

The predictions are from the co-clustering model.

pred2

The following is the output.

[Prediction(uid='85014B', iid=17228, r_ui=130.0, est=119.18329013727276, details={'was_impossible': False}),

Prediction(uid='84406B', iid=16520, r_ui=156.0, est=161.85867140088936, details={'was_impossible': False}),

Prediction(uid='47590B', iid=17365, r_ui=353.0, est=352.7773176826455, details={'was_impossible': False}),

...,

Prediction(uid='85049G', iid=16755, r_ui=170.0, est=159.5403752414615, details={'was_impossible': False}),

Prediction(uid='16156S', iid=14895, r_ui=367.0, est=368.129814201444, details={'was_impossible': False}),

Prediction(uid='47566B', iid=17238, r_ui=384.0, est=393.60123986750034, details={'was_impossible': False})]

Now let’s use these predictions and see the best and the worst predictions, but first, let’s get the final output onto a DataFrame.

predictions_data = pd.DataFrame(pred2, columns=['item_id', 'customer_id', 'quantity', 'prediction', 'details'])

But first, let’s also add important information like the number of item orders and customer orders for each record using the following function.

def get_item_orders(user_id):

try:

# for an item, return the no. of orders made

return len(train_set.ur[train_set.to_inner_uid(user_id)])

except ValueError:

# user not present in training

return 0

def get_customer_orders(item_id):

try:

# for an customer, return the no. of orders made

return len(train_set.ir[train_set.to_inner_iid(item_id)])

except ValueError:

# item not present in training

return 0

The following calls these functions.

predictions_data['item_orders'] = predictions_data.item_id.apply(get_item_orders)

predictions_data['customer_orders'] = predictions_data.customer_id.apply(get_customer_orders)

Calculate the error component to get the best and worst predictions.

predictions_data['error'] = abs(predictions_data.prediction - predictions_data.quantity)

predictions_data

Figure 5-13 shows the prediction DataFrame.

The following gets the best predictions.

best_predictions = predictions_data.sort_values(by='error')[:10]

best_predictions

Figure 5-14 shows the best predictions.

The following gets the worst predictions.

worst_predictions = predictions_data.sort_values(by='error')[-10:]

worst_predictions

Figure 5-15 shows the worst predictions.

You can now use the predictions data to get to the recommendations. First, find the customers that have bought the same items as a given user, and then from the other items they have bought, to fetch the top items and recommend them.

Let’s again use customer 12347 and create a list of the items this user bought.

# Getting item list for user 12347

item_list = predictions_data[predictions_data['customer_id']==12347]['item_id'].values.tolist()

item_list

The following is the output.

['82494L',

'84970S',

'47599A',

'84997B',

'85123A',

'84997C',

'85049A']

Get the list of customers who bought the same items as user 12347.

# Getting list of unique customers who also bought same items (item_list)

customer_list = predictions_data[predictions_data['item_id'].isin(item_list)]['customer_id'].values

customer_list = np.unique(customer_list).tolist()

customer_list

The following is the output.

[12347,

12362,

12370,

12378,

...,

12415,

12417,

12428]

Now let’s filter these customers (customer_list) from predictions data, remove the items already bought, and recommend the top items (prediction).

# filtering those customers from predictions data

filtered_data = predictions_data[predictions_data['customer_id'].isin(customer_list)]

# removing the items already bought

filtered_data = filtered_data[~filtered_data['item_id'].isin(item_list)]

# getting the top items (prediction)

recommended_items = filtered_data.sort_values('prediction',ascending=False).reset_index(drop=True).head(10)['item_id'].values.tolist()

recommended_items

The following is the output.

['16156S',

'85049E',

'47504K',

'85099C',

'85049G',

'85014B',

'72351B',

'84536A',

'48173C',

'47590A']

The recommended list of items for user 12347 is achieved.

Summary

This chapter continued the discussion of collaborative filtering-based recommendation engines. Popular methods like matrix factorization, SVD, and co-clustering were explored with a focus on implementing all three models. For the given data, the co-clustering method performed the best, but you need to try all the different methods available to see which best fits your data and use case in building a recommendation system.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5. Collaborative Filtering Using Matrix Factorization, Singular Value Decomposition, and Co-Clustering

Create new playlist

Sign In

Sign Up

5. Collaborative Filtering Using Matrix Factorization, Singular Value Decomposition, and Co-Clustering

Implementation

Matrix Factorization, Co-Clustering, and SVD

Implementing NMF

Implementing Co-Clustering

Implementing SVD

Getting the Recommendations

Summary

Table of Contents for
5. Collaborative Filtering Using Matrix Factorization, Singular Value Decomposition, and Co-Clustering