Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

A. Kulkarni et al.Applied Recommender Systems with Pythonhttps://doi.org/10.1007/978-1-4842-8954-9_4

4. Collaborative Filtering

Akshay Kulkarni¹, Adarsha Shivananda², Anoosh Kulkarni³ and V Adithya Krishnan⁴

(1)

Bangalore, Karnataka, India

(2)

Hosanagara tq, Shimoga dt, Karnataka, India

(3)

Bangalore, India

(4)

Navi Mumbai, India

Collaborative filtering is a very popular method in recommendation engines. It is the predictive process behind the suggestions provided by these systems. It processes and analyzes customers’ information and suggests items they will likely appreciate.

Collaborative filtering algorithms use a customer’s purchase history and ratings to find similar customers and then suggest items that they liked.

Figure 4-1 explains collaborative filtering at a high level.

Figure 4-1
Collaborative filtering explained

For example, to find a new movie or show to watch, you can ask your friends for suggestions since you all share similar tastes in content. The same concept is used in collaborative filtering, where user-user similarity finds similar users to get recommendations based on each other’s likes.

There are two types of collaborative filtering methods—user-to-user and item-to-item. They are explored in the upcoming sections. This chapter looks at the implementation of these two methods using cosine similarity before diving into implementing the more popularly used KNN-based algorithm for collaborative filtering.

Implementation

The following installs the surprise library.

!pip install scikit-surprise

The following imports basic libraries.

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

%matplotlib inline

import random

from IPython.display import Image

The following imports the KNN algorithm and csr_matrix for KNN data preparation.

from scipy.sparse import csr_matrix

from sklearn.neighbors import NearestNeighbors

The following calculates cosine similarity by importing cosine_similarity.

from sklearn.metrics.pairwise import cosine_similarity

Let’s import surprise.Reader and surprise.Dataset for surprise data preparation.

from surprise import Reader, Dataset

Next, import surprise.model_selection functions for surprise model customizations.

from surprise.model_selection import train_test_split, cross_validate, GridSearchCV

Then, import algorithms from the surprise package.

from surprise.prediction_algorithms import CoClustering

from surprise.prediction_algorithms import NMF

Finally, import accuracy to get metrics such as root-mean-square error (RMSE) and mean absolute error (MAE).

from surprise import accuracy

Data Collection

This chapter uses a custom dataset that has been masked. Download the dataset from the GitHub link.

The following reads the data.

data = pd.read_excel('Rec_sys_data.xlsx',encoding= 'unicode_escape')

data.head()

Figure 4-2 shows the DataFrame.

About the Dataset

The following is the data dictionary for the dataset; it has nine features (columns).

InvoiceNo: The invoice number of a particular transaction
StockCode: The unique identifier for a particular item
Quantity: The quantity of that item bought by the customer
InvoiceDate: The date and time when the transaction was made
DeliveryDate: The date and time when the delivery happened
Discount%: Percentage of discount on the purchased item
ShipMode: Mode of shipping
ShippingCost: Cost of shipping that item
CustomerID: The unique identifier of a particular customer

The following checks the size of the data.

data.shape

(272404, 9)

The dataset has a total of 272,404 unique transactions in its nine columns.

Let’s check if there are any null values because a clean dataset is required for further analysis.

data.isnull().sum().sort_values(ascending=False)

The following is the output.

CustomerID 0

ShippingCost 0

ShipMode 0

Discount% 0

DeliveryDate 0

InvoiceDate 0

Quantity 0

StockCode 0

InvoiceNo 0

dtype: int64

The data is clean with no nulls in any columns. Further preprocessing is not required in this case.

If there were any NaNs or nulls in the data, they were dropped using the following.

data1 = data.dropna()

Now let’s check for any data abnormalities by describing the data.

data1.describe()

Figure 4-3 describes data1.

There aren’t any negative values in the Quantity column, but if there were, those records would need to be dropped since it’s a data abnormality.

Let’s change the StockCode column datatype to string to maintain the same type across all rows.

data1.StockCode = data1.StockCode.astype(str)

Memory-Based Approach

Let’s examine the most basic approach to implementing collaborative filtering: the memory-based approach. This approach uses simple arithmetic operations or metrics to calculate the similarities between two users or two items to group them. For example, to find user-user relations, both users’ historically liked items are used to find the similarity metric, that measures how similar the two users are.

Cosine similarity is a common similarity metric. Euclidean distance and Pearson’s correlation are other popular metrics. A metric is considered geometric if the row (column) of a given user (item) is treated as a vector or a matrix. In cosine similarity, the similarity of two users (say) is measured as the cosine of the angle between the vectors of the two users. For users A and B, the cosine similarity is given by the formula shown in Figure 4-4.

This approach is easy to implement and understand because no model training or heavy optimization algorithms are involved. However, its performance degrades when there is sparse data. For this method to work precisely, huge amounts of clean data on multiple users and items are required, which hinders the scalability of this approach for most real-world applications.

The memory-based approach is further divided into user-to-user-based and item-to-item-based collaborative filtering.

The implementation of both methods is explored in this chapter.

Figure 4-5 illustrates user-based and item-based filtering.

Figure 4-5
User-based and item-based collaborative filtering

User-to-User Collaborative Filtering

User-to-user-based collaborative filtering recommends items that a particular user might like by finding similar users, using purchase history or ratings on various items, and then suggesting the items liked by these similar users.

Here, a matrix is formed to describe the behavior of all the users (purchase history in our example) corresponding to all the items. Using this matrix, you can calculate the similarity metrics (cosine similarity) between users to formulate user-user relations. These relations help find users similar to a given user and recommend the items bought by these similar users.

Implementation

Let’s first create a data matrix covering purchase history. It contains all customer IDs for all items (whether a customer has purchased an item or not).

purchase_df = (data1.groupby(['CustomerID', 'StockCode'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('CustomerID'))

purchase_df.head()

Figure 4-6 shows the purchase data matrix.

The data matrix shown in Figure 4-6 reveals the total quantity purchased by each user against each item. Only information about whether the item was bought or not by the user is needed, not the quantity.

Thus, an encoding of 0 or 1 is used, where 0 is not purchased, and 1 is purchased.

Let’s first write a function for encoding the data matrix.

def encode_units(x):

if x < 1: # If the quantity is less than 1

return 0 # Not purchased

if x >= 1: # If the quantity is greater than 1

return 1 # Purchased

Next, apply this function to the data matrix.

purchase_df = purchase_df.applymap(encode_units)

purchase_df.head()

Figure 4-7 shows the purchase data matrix after encoding.

Figure 4-7
Purchase data matrix after encoding

The purchase data matrix reveals the behavior of customers across all items. This matrix finds the user similarity scores matrix, and the similarity metric uses cosine similarity. The user similarity score matrix has user-to-user similarity for each user pair.

First, let’s apply cosine_similarity to the purchase data matrix.

user_similarities = cosine_similarity(purchase_df)

Now, let’s store the user similarity scores in a DataFrame (i.e., the similarity scores matrix).

user_similarity_data = pd.DataFrame(user_similarities,index=purchase_df.index,columns=purchase_df.index)

user_similarity_data.head()

Figure 4-8 shows the user similarity scores data matrix.

Figure 4-8
User similarity scores DataFrame

The similarity score values are between 0 to 1, where values closer to 0 represent less similar, and values closer to 1 represent more similar customers.

Using this user similarity scores data, let’s get recommendations for a given user.

Create a function for this.

def fetch_similar_users(user_id,k=5):

# separating data rows for the entered user id

user_similarity = user_similarity_data[user_similarity_data.index == user_id]

# a data of all other users

other_users_similarities = user_similarity_data[user_similarity_data.index != user_id]

# calcuate cosine similarity between user and each other user

similarities = cosine_similarity(user_similarity,other_users_similarities)[0].tolist()

user_indices = other_users_similarities.index.tolist()

index_similarity_pair = dict(zip(user_indices, similarities))

# sort by similarity

sorted_index_similarity_pair = sorted(index_similarity_pair.items(),reverse=True)

top_k_users_similarities = sorted_index_similarity_pair[:k]

similar_users = [u[0] for u in top_k_users_similarities]

print('The users with behaviour similar to that of user {0} are:'.format(user_id))

return similar_users

This function separates the selected user from all other users and then takes a cosine similarity of the selected user with all users to find similar users. Return the top k similar users (by CustomerID) to our selected user.

For example, let’s find the similar to user 12347.

similar_users = fetch_similar_users(12347)

similar_users

The following is the output.

The users with behavior similar to that of user 12347 are:

[18287, 18283, 18282, 18281, 18280]

As expected, the default five users are similar to user 12347.

Now, let’s get the recommendations by showing the items bought by similar users.

Write another function to get similar user recommendations.

def simular_users_recommendation(userid):

similar_users = fetch_similar_users(userid)

#obtaining all the items bought by similar users

simular_users_recommendation_list = []

for j in similar_users:

item_list = data1[data1["CustomerID"]==j]['StockCode'].to_list()

simular_users_recommendation_list.append(item_list)

#this gives us multi-dimensional list

# we need to flatten it

flat_list = []

for sublist in simular_users_recommendation_list:

for item in sublist:

flat_list.append(item)

final_recommendations_list = list(dict.fromkeys(flat_list))

# storing 10 random recommendations in a list

ten_random_recommendations = random.sample(final_recommendations_list, 10)

print('Items bought by Similar users based on Cosine Similarity')

#returning 10 random recommendations

return ten_random_recommendations

This function gets the similar users for the given customer (ID) and obtains a list of all the items bought by these similar users. This list is then flattened to get a final list of unique items, from which shows randomly chosen ten recommended items for a given user.

Using this function on user 12347 to get recommendations results in the following suggestions.

simular_users_recommendation(12347)

The following is the output.

The users with behavior similar to that of user 12347 are:

Items bought by Similar users based on Cosine Similarity

[‘21967’, ‘21908’, ‘21154’, ‘20723’, ‘23296’, ‘22271’, ‘22746’, ‘22355’, ‘22554’, ‘23199’]

User 12347 had ten suggestions from the items bought by similar users.

Item-to-Item Collaborative Filtering

Item-to-item based collaborative filtering recommends items that a particular user might like by finding items similar to ones they already purchased and then creating a matrix profile for each item. Purchase history or user ratings are also used.

A matrix is formed to describe the behavior of all the users (purchase history in our example) corresponding to all the items. This matrix helps calculate the similarity metrics (cosine similarity) between items to formulate the item-item relations. This relation is used to recommend items similar to those previously purchased by the selected user.

Implementation

Following the initial steps used in user-to-user collaborative filtering methods, let’s first create the data matrix, which contains all the item IDs across their purchase history (i.e., quantity purchased by each customer).

items_purchase_df = (data1.groupby(['StockCode','CustomerID'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('StockCode'))

items_purchase_df.head()

The following is the output.

Figure 4-9 shows the item purchase data matrix.

This data matrix shows the total quantity purchased by each user against each item. But the only information needed is whether the user bought the item.

Thus, an encoding of 0 or 1 is used, where 0 is not purchased, and 1 is purchased.

Use the same encode_units function created earlier.

items_purchase_df = items_purchase_df.applymap(encode_units)

The items purchase data matrix reveals the behavior of customers across all items. Let’s use this matrix to find item similarity scores with the cosine similarity metric. The item similarity score matrix has item-to-item similarity for each item pair.

First, let’s apply cosine_similarity to the item purchase data matrix.

item_similarities = cosine_similarity(items_purchase_df)

Now, let’s store the item similarity scores in a DataFrame (i.e., the similarity scores matrix).

item_similarity_data = pd.DataFrame(item_similarities,index=items_purchase_df.index,columns=items_purchase_df.index)

item_similarity_data.head()

Figure 4-10 shows the item similarity scores data matrix.

Figure 4-10
Item similarity scores DataFrame

The similarity score values are between 0 and 1, where values closer to 0 represent less similarity and values closer to 1 represent more similar items.

Using this item similarity score data, let’s get recommendations for a given user.

The following creates a function for this.

def fetch_similar_items(item_id,k=10):

# separating data rows of the selected item

item_similarity = item_similarity_data[item_similarity_data.index == item_id]

# a data of all other items

other_items_similarities = item_similarity_data[item_similarity_data.index != item_id]

# calculate cosine similarity between selected item with other items

similarities = cosine_similarity(item_similarity,other_items_similarities)[0].tolist()

# create list of indices of these items

item_indices = other_items_similarities.index.tolist()

# create key/values pairs of item index and their similarity

index_similarity_pair = dict(zip(item_indices, similarities))

# sort by similarity

sorted_index_similarity_pair = sorted(index_similarity_pair.items())

# grab k items from the top

top_k_item_similarities = sorted_index_similarity_pair[:k]

similar_items = [u[0] for u in top_k_item_similarities]

print('Similar items based on purchase behaviour (item-to-item collaborative filtering)')

return similar_items

This function separates the selected item from all other items and then takes a cosine similarity of the selected item with all items to find the similarities. Return the top k similar items (StockCodes) to our selected item.

For example, let’s find similar items for user 12347.

similar_items = fetch_similar_items('10002')

similar_items

The following is the output.

Similar items based on purchase behavior (item-to-item collaborative filtering)

['10080',

'10120',

'10123C',

'10124A',

'10124G',

'10125',

'10133',

'10135',

'11001',

'15030']

As expected, you see the default ten similar items to item 10002.

Now let’s get the recommendations by showing similar items to those bought by a particular user.

Write another function to get similar item recommendations.

def simular_item_recommendation(userid):

simular_items_recommendation_list = []

#obtaining all the similar items to items bought by user

item_list = data1[data1["CustomerID"]==userid]['StockCode'].to_list()

for item in item_list:

similar_items = fetch_similar_items(item)

simular_items_recommendation_list.append(item_list)

#this gives us multi-dimensional list

# we need to flatten it

flat_list = []

for sublist in simular_items_recommendation_list:

for item in sublist:

flat_list.append(item)

final_recommendations_list = list(dict.fromkeys(flat_list))

# storing 10 random recommendations in a list

ten_random_recommendations = random.sample(final_recommendations_list, 10)

print('Similar Items bought by our users based on Cosine Similarity')

#returning 10 random recommendations

return ten_random_recommendations

This function gets the list of similar items for all previously bought items by our given customer (ID). This list is then flattened to get a final list of unique items, from which randomly chosen ten items as recommendations for our given user are shown.

Again, trying this function on user 12347 to get the recommendations for that user results in the following suggestions.

simular_item_recommendation(12347)

The following is the output.

Similar Items bought by our users based on Cosine Similarity

['22196',

'22775',

'22492',

'23146',

'22774',

'21035',

'16008',

'21041',

'23316',

'22550']

User 12347 has ten suggestions that are similar to items previously bought.

KNN-based Approach

You have learned the basics of collaborative filtering and implementing user-to-user and item-to-item filtering. Now let’s dive into machine learning-based approaches, which are more robust and popular in building recommendation systems.

Machine Learning

Machine learning is a machine’s capability to learn from experience (data) and make meaningful predictions without being explicitly programmed. It is a subfield of artificial intelligence that deals with building systems that can learn from data. The objective is to make computers learn on their own without any intervention from humans.

There are three primary machine learning categories.

Supervised Learning

In supervised learning, labeled training data is leveraged to derive the pattern or function and make a model or machine learn. Data consists of a dependent variable (Target label) and the independent variables or predictors. The machine tries to learn the function of labeled data and predict the output of unseen data.

Unsupervised Learning

In unsupervised learning, a machine learns the hidden pattern without leveraging labeled data, so training doesn’t happen. These algorithms learn to capture patterns based on similarities or distances between data points.

Reinforcement Learning

Reinforcement learning is the process of maximizing a reward by taking action. The algorithms learn how to reach a goal through experience.

Figure 4-11 explains all the categories and subcategories.

Supervised Learning

There are two types of supervised learning: regression and classification.

Regression

Regression is a statistical predictive modeling technique that finds the relationship between a dependent variable and one or more independent variables. Regression is used when the dependent variable is continuous; prediction can take any numerical value.

Popular regression algorithms include linear regression, decision tree, random forest, SVM, LightGBM, and XGBoost.

Classification

Classification is a supervised machine learning technique in which the dependent or output variable is categorical; for example, spam/ham, churn/not churned, and so on.

In binary classification, it's either yes or no. There is no third option; for example, the customer can churn or not churn from a given business.
In multiclass classification, the labeled variable can be multiclass, for example, product categorization of an e-commerce website.

Logistic regression, k-nearest neighbor, decision tree, random forest, SVM, LightGBM, and XGBoost are popular classification algorithms.

K-Nearest Neighbor

The k-nearest neighbor (KNN) algorithm is a supervised machine learning model that is used for both classification and regression problems. It is a very robust algorithm that is easy to implement and interpret and uses less calculation time. Labeled data is needed since it’s a supervised learning algorithm.

Figure 4-12 explains KNN algorithms.

Now let’s try implementing a simple KNN model on purchase_df, created in user-to-user filtering. This approach follows similar steps you have seen before (i.e., base recommendations from a list of items purchased by similar users). The difference is that a KNN model finds similar users (for a given user).

Implementation

Before passing our sparse matrix (i.e., purchase_df) into KNN, it must be converted into a CSR matrix.

CSR divides a sparse matrix into three separate arrays.

values
extent of rows
index of columns

So, let’s convert the sparse matrix into a CSR matrix.

purchase_matrix = csr_matrix(purchase_df.values)

Next, create the KNN model using the Euclidean distance metric.

knn_model = NearestNeighbors(metric = 'euclidean', algorithm = 'brute')

Once the model is created, fit it on the data/matrix.

knn_model.fit(purchase_matrix)

Figure 4-13 shows the fitted KNN model.

Now that the KNN model is in place, let’s write a function to fetch similar users using the model.

def fetch_similar_users_knn(purchase_df,query_index):

# Creating empty list where we will store user id of similar users

simular_users_knn = []

# Storing the distance and index of nearest neighbor

distances, indices = knn_model.kneighbors(purchase_df.iloc[query_index,:].values.reshape(1, -1), n_neighbors = 5)

for i in range(0, len(distances.flatten())):

if i == 0:

print('Recommendations for {0}: '.format(purchase_df.index[query_index]))

else:

print('{0}: {1}, with distance of {2}:'.format(i, purchase_df.index[indices.flatten()[i]], distances.flatten()[i]))

simular_users_knn.append( purchase_df.index[indices.flatten()[i]])

This function first calculates the distances and indices of the five nearest neighbors using our KNN model’s function. This output is then processed, and a list of similar users alone is returned. Instead of user_id as the input, take the index in the DataFrame.

Let’s test this out for index 1497.

fetch_similar_users_knn(purchase_df,1497)

The following is the output.

Recommendations for 14729:

1: 16917, with distance of 8.12403840463596:

2: 16989, with distance of 8.12403840463596:

3: 15124, with distance of 8.12403840463596:

4: 12897, with distance of 8.246211251235321:

simular_users_knn

The following is the output.

[16917, 16989, 15124, 12897]

Now that we have similar users, let’s get the recommendations by showing the items bought by these similar users.

Write a function to get similar user recommendations.

def knn_recommendation(simular_users_knn):

#obtaining all the items bought by similar users

knn_recommnedations = []

for j in simular_users_knn:

item_list = data1[data1["CustomerID"]==j]['StockCode'].to_list()

knn_recommnedations.append(item_list)

#this gives us multi-dimensional list

# we need to flatten it

flat_list = []

for sublist in knn_recommnedations:

for item in sublist:

flat_list.append(item)

final_recommendations_list = list(dict.fromkeys(flat_list))

# storing 10 random recommendations in a list

ten_random_recommendations = random.sample(final_recommendations_list, 10)

print('Items bought by Similar users based on KNN')

#returning 10 random recommendations

return ten_random_recommendations

This function replicates the logic used in user-to-user filtering. Next, let’s get the final list of items that similar users purchased and recommend any random ten items from it.

Using this function on the previously generated similar users list gets the following recommendations.

knn_recommendation(simular_users_knn)

The following is the output using the KNN approach.

Items bought by Similar users based on KNN

['22487',

'84997A',

'22926',

'22921',

'22605',

'23298',

'22916',

'22470',

'22927',

'84978']

User 14729 has ten suggestions from the products bought by similar users.

Summary

This chapter covered collaborative filtering-based recommendation engines and implementing the two types of filtering methods—user-to-user and item-to-item—using basic arithmetic operations. The chapter also explored the k-nearest neighbor algorithm (along with some machine learning basics). It ended by implementing user-to-user-based collaborative filtering using the KNN approach. The next chapter explores other popular methods to implement collaborative filtering-based recommendation engines.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 4. Collaborative Filtering

Create new playlist

Sign In

Sign Up

4. Collaborative Filtering

Implementation

Data Collection

About the Dataset

Memory-Based Approach

User-to-User Collaborative Filtering

Implementation

Item-to-Item Collaborative Filtering

Implementation

KNN-based Approach

Machine Learning

Supervised Learning

Unsupervised Learning

Reinforcement Learning

Supervised Learning

Regression

Classification

K-Nearest Neighbor

Implementation

Summary

Table of Contents for
4. Collaborative Filtering