© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
A. Kulkarni et al.Applied Recommender Systems with Pythonhttps://doi.org/10.1007/978-1-4842-8954-9_4

4. Collaborative Filtering

Akshay Kulkarni1  , Adarsha Shivananda2, Anoosh Kulkarni3 and V Adithya Krishnan4
(1)
Bangalore, Karnataka, India
(2)
Hosanagara tq, Shimoga dt, Karnataka, India
(3)
Bangalore, India
(4)
Navi Mumbai, India
 

Collaborative filtering is a very popular method in recommendation engines. It is the predictive process behind the suggestions provided by these systems. It processes and analyzes customers’ information and suggests items they will likely appreciate.

Collaborative filtering algorithms use a customer’s purchase history and ratings to find similar customers and then suggest items that they liked.

Figure 4-1 explains collaborative filtering at a high level.

An illustration of high-level collaborative filtering. It includes an example: A user who tastes pizza, burgers, and finger chips recommends the other user with similar tastes.

Figure 4-1

Collaborative filtering explained

For example, to find a new movie or show to watch, you can ask your friends for suggestions since you all share similar tastes in content. The same concept is used in collaborative filtering, where user-user similarity finds similar users to get recommendations based on each other’s likes.

There are two types of collaborative filtering methods—user-to-user and item-to-item. They are explored in the upcoming sections. This chapter looks at the implementation of these two methods using cosine similarity before diving into implementing the more popularly used KNN-based algorithm for collaborative filtering.

Implementation

The following installs the surprise library.
!pip install scikit-surprise
The following imports basic libraries.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import random
from IPython.display import Image
The following imports the KNN algorithm and csr_matrix for KNN data preparation.
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
The following calculates cosine similarity by importing cosine_similarity.
from sklearn.metrics.pairwise import cosine_similarity
Let’s import surprise.Reader and surprise.Dataset for surprise data preparation.
from surprise import Reader, Dataset
Next, import surprise.model_selection functions for surprise model customizations.
from surprise.model_selection import train_test_split, cross_validate, GridSearchCV
Then, import algorithms from the surprise package.
from surprise.prediction_algorithms import CoClustering
from surprise.prediction_algorithms import NMF
Finally, import accuracy to get metrics such as root-mean-square error (RMSE) and mean absolute error (MAE).
from surprise import accuracy

Data Collection

This chapter uses a custom dataset that has been masked. Download the dataset from the GitHub link.

The following reads the data.
data = pd.read_excel('Rec_sys_data.xlsx',encoding= 'unicode_escape')
data.head()
Figure 4-2 shows the DataFrame.

An input file depicts the list of the first five data frames. It includes invoice number, stock code, quantity, invoice date, delivery date, discounts, ship mode, shipping cost, and customer I D.

Figure 4-2

Input data

About the Dataset

The following is the data dictionary for the dataset; it has nine features (columns).
  • InvoiceNo: The invoice number of a particular transaction

  • StockCode: The unique identifier for a particular item

  • Quantity: The quantity of that item bought by the customer

  • InvoiceDate: The date and time when the transaction was made

  • DeliveryDate: The date and time when the delivery happened

  • Discount%: Percentage of discount on the purchased item

  • ShipMode: Mode of shipping

  • ShippingCost: Cost of shipping that item

  • CustomerID: The unique identifier of a particular customer

The following checks the size of the data.
data.shape
(272404, 9)

The dataset has a total of 272,404 unique transactions in its nine columns.

Let’s check if there are any null values because a clean dataset is required for further analysis.
data.isnull().sum().sort_values(ascending=False)
The following is the output.
CustomerID      0
ShippingCost    0
ShipMode        0
Discount%       0
DeliveryDate    0
InvoiceDate     0
Quantity        0
StockCode       0
InvoiceNo       0
dtype: int64

The data is clean with no nulls in any columns. Further preprocessing is not required in this case.

If there were any NaNs or nulls in the data, they were dropped using the following.
data1 = data.dropna()
Now let’s check for any data abnormalities by describing the data.
data1.describe()
Figure 4-3 describes data1.

A file depicts data 1 with nine columns. It includes count, mean, standard deviation, minimum, 25, 50, 75 percentages, and maximum along with invoice number, quantity, discounts, shipping cost, and customer I D.

Figure 4-3

data1

There aren’t any negative values in the Quantity column, but if there were, those records would need to be dropped since it’s a data abnormality.

Let’s change the StockCode column datatype to string to maintain the same type across all rows.
data1.StockCode = data1.StockCode.astype(str)

Memory-Based Approach

Let’s examine the most basic approach to implementing collaborative filtering: the memory-based approach. This approach uses simple arithmetic operations or metrics to calculate the similarities between two users or two items to group them. For example, to find user-user relations, both users’ historically liked items are used to find the similarity metric, that measures how similar the two users are.

Cosine similarity is a common similarity metric. Euclidean distance and Pearson’s correlation are other popular metrics. A metric is considered geometric if the row (column) of a given user (item) is treated as a vector or a matrix. In cosine similarity, the similarity of two users (say) is measured as the cosine of the angle between the vectors of the two users. For users A and B, the cosine similarity is given by the formula shown in Figure 4-4.

A formula states: similarity equals cos theta equals start fraction A dot B over modulus of A and modulus of B end fraction.

Figure 4-4

Cosine similarity formula

This approach is easy to implement and understand because no model training or heavy optimization algorithms are involved. However, its performance degrades when there is sparse data. For this method to work precisely, huge amounts of clean data on multiple users and items are required, which hinders the scalability of this approach for most real-world applications.

The memory-based approach is further divided into user-to-user-based and item-to-item-based collaborative filtering.

The implementation of both methods is explored in this chapter.

Figure 4-5 illustrates user-based and item-based filtering.

A set of two network models depicts collaborative filtering. 1. User-based filtering (based on user's similarity). 2. Item-based filtering (based on items similarity)

Figure 4-5

User-based and item-based collaborative filtering

User-to-User Collaborative Filtering

User-to-user-based collaborative filtering recommends items that a particular user might like by finding similar users, using purchase history or ratings on various items, and then suggesting the items liked by these similar users.

Here, a matrix is formed to describe the behavior of all the users (purchase history in our example) corresponding to all the items. Using this matrix, you can calculate the similarity metrics (cosine similarity) between users to formulate user-user relations. These relations help find users similar to a given user and recommend the items bought by these similar users.

Implementation

Let’s first create a data matrix covering purchase history. It contains all customer IDs for all items (whether a customer has purchased an item or not).
purchase_df = (data1.groupby(['CustomerID', 'StockCode'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('CustomerID'))
purchase_df.head()
Figure 4-6 shows the purchase data matrix.

An output file depicts five rows of the list of purchase data matrices. It collects the total quantity purchased by the users against the item (only the items purchased are necessary).

Figure 4-6

Purchase data matrix

The data matrix shown in Figure 4-6 reveals the total quantity purchased by each user against each item. Only information about whether the item was bought or not by the user is needed, not the quantity.

Thus, an encoding of 0 or 1 is used, where 0 is not purchased, and 1 is purchased.

Let’s first write a function for encoding the data matrix.
def encode_units(x):
    if x < 1: # If the quantity is less than 1
        return 0 # Not purchased
    if x >= 1: # If the quantity is greater than 1
        return 1 # Purchased
Next, apply this function to the data matrix.
purchase_df = purchase_df.applymap(encode_units)
purchase_df.head()
Figure 4-7 shows the purchase data matrix after encoding.

An output file depicts five rows of the list of purchase data after encoding. It reveals users' similarity score matrix (user-to-user similarity score).

Figure 4-7

Purchase data matrix after encoding

The purchase data matrix reveals the behavior of customers across all items. This matrix finds the user similarity scores matrix, and the similarity metric uses cosine similarity. The user similarity score matrix has user-to-user similarity for each user pair.

First, let’s apply cosine_similarity to the purchase data matrix.
user_similarities = cosine_similarity(purchase_df)
Now, let’s store the user similarity scores in a DataFrame (i.e., the similarity scores matrix).
user_similarity_data = pd.DataFrame(user_similarities,index=purchase_df.index,columns=purchase_df.index)
user_similarity_data.head()
Figure 4-8 shows the user similarity scores data matrix.

An output file depicts five rows of the user similarity score data. It depends on two values 0 and 1. A value closer to 0 represents less similar and a value closer to 1 represents more similar customers.

Figure 4-8

User similarity scores DataFrame

The similarity score values are between 0 to 1, where values closer to 0 represent less similar, and values closer to 1 represent more similar customers.

Using this user similarity scores data, let’s get recommendations for a given user.

Create a function for this.
def fetch_similar_users(user_id,k=5):
    # separating data rows for the entered user id
    user_similarity = user_similarity_data[user_similarity_data.index == user_id]
    # a data of all other users
    other_users_similarities = user_similarity_data[user_similarity_data.index != user_id]
    # calcuate cosine similarity between user and each other user
    similarities = cosine_similarity(user_similarity,other_users_similarities)[0].tolist()
    user_indices = other_users_similarities.index.tolist()
    index_similarity_pair = dict(zip(user_indices, similarities))
    # sort by similarity
    sorted_index_similarity_pair = sorted(index_similarity_pair.items(),reverse=True)
    top_k_users_similarities = sorted_index_similarity_pair[:k]
    similar_users = [u[0] for u in top_k_users_similarities]
    print('The users with behaviour similar to that of user {0} are:'.format(user_id))
    return similar_users

This function separates the selected user from all other users and then takes a cosine similarity of the selected user with all users to find similar users. Return the top k similar users (by CustomerID) to our selected user.

For example, let’s find the similar to user 12347.
similar_users = fetch_similar_users(12347)
similar_users
The following is the output.
The users with behavior similar to that of user 12347 are:
[18287, 18283, 18282, 18281, 18280]

As expected, the default five users are similar to user 12347.

Now, let’s get the recommendations by showing the items bought by similar users.

Write another function to get similar user recommendations.
def simular_users_recommendation(userid):
    similar_users = fetch_similar_users(userid)
    #obtaining all the items bought by similar users
    simular_users_recommendation_list = []
    for j in similar_users:
        item_list = data1[data1["CustomerID"]==j]['StockCode'].to_list()
        simular_users_recommendation_list.append(item_list)
    #this gives us multi-dimensional list
    # we need to flatten it
    flat_list = []
    for sublist in simular_users_recommendation_list:
        for item in sublist:
            flat_list.append(item)
    final_recommendations_list = list(dict.fromkeys(flat_list))
    # storing 10 random recommendations in a list
    ten_random_recommendations = random.sample(final_recommendations_list, 10)
    print('Items bought by Similar users based on Cosine Similarity')
    #returning 10 random recommendations
    return ten_random_recommendations

This function gets the similar users for the given customer (ID) and obtains a list of all the items bought by these similar users. This list is then flattened to get a final list of unique items, from which shows randomly chosen ten recommended items for a given user.

Using this function on user 12347 to get recommendations results in the following suggestions.
simular_users_recommendation(12347)
The following is the output.
The users with behavior similar to that of user 12347 are:
Items bought by Similar users based on Cosine Similarity
[‘21967’, ‘21908’, ‘21154’, ‘20723’, ‘23296’, ‘22271’, ‘22746’, ‘22355’, ‘22554’, ‘23199’]

User 12347 had ten suggestions from the items bought by similar users.

Item-to-Item Collaborative Filtering

Item-to-item based collaborative filtering recommends items that a particular user might like by finding items similar to ones they already purchased and then creating a matrix profile for each item. Purchase history or user ratings are also used.

A matrix is formed to describe the behavior of all the users (purchase history in our example) corresponding to all the items. This matrix helps calculate the similarity metrics (cosine similarity) between items to formulate the item-item relations. This relation is used to recommend items similar to those previously purchased by the selected user.

Implementation

Following the initial steps used in user-to-user collaborative filtering methods, let’s first create the data matrix, which contains all the item IDs across their purchase history (i.e., quantity purchased by each customer).
items_purchase_df = (data1.groupby(['StockCode','CustomerID'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('StockCode'))
items_purchase_df.head()

The following is the output.

Figure 4-9 shows the item purchase data matrix.

An output file depicts an item purchased data matrix. It reveals the total quantity purchased by a single user against each item, depending on 0 or 1. Value 0 represents not purchased and 1 purchased by the user.

Figure 4-9

Item purchase data matrix

This data matrix shows the total quantity purchased by each user against each item. But the only information needed is whether the user bought the item.

Thus, an encoding of 0 or 1 is used, where 0 is not purchased, and 1 is purchased.

Use the same encode_units function created earlier.
items_purchase_df = items_purchase_df.applymap(encode_units)

The items purchase data matrix reveals the behavior of customers across all items. Let’s use this matrix to find item similarity scores with the cosine similarity metric. The item similarity score matrix has item-to-item similarity for each item pair.

First, let’s apply cosine_similarity to the item purchase data matrix.
item_similarities = cosine_similarity(items_purchase_df)
Now, let’s store the item similarity scores in a DataFrame (i.e., the similarity scores matrix).
item_similarity_data = pd.DataFrame(item_similarities,index=items_purchase_df.index,columns=items_purchase_df.index)
item_similarity_data.head()
Figure 4-10 shows the item similarity scores data matrix.

An output file depicts the item similarity score data matrix. The similarity score is between 0 and 1. Value 0 represents less similarity and closer to 1 represents more similar items.

Figure 4-10

Item similarity scores DataFrame

The similarity score values are between 0 and 1, where values closer to 0 represent less similarity and values closer to 1 represent more similar items.

Using this item similarity score data, let’s get recommendations for a given user.

The following creates a function for this.
def fetch_similar_items(item_id,k=10):
    # separating data rows of the selected item
    item_similarity = item_similarity_data[item_similarity_data.index == item_id]
    # a data of all other items
    other_items_similarities = item_similarity_data[item_similarity_data.index != item_id]
    # calculate cosine similarity between selected item with other items
    similarities = cosine_similarity(item_similarity,other_items_similarities)[0].tolist()
    # create list of indices of these items
    item_indices = other_items_similarities.index.tolist()
    # create key/values pairs of item index and their similarity
    index_similarity_pair = dict(zip(item_indices, similarities))
    # sort by similarity
    sorted_index_similarity_pair = sorted(index_similarity_pair.items())
    # grab k items from the top
    top_k_item_similarities = sorted_index_similarity_pair[:k]
    similar_items = [u[0] for u in top_k_item_similarities]
    print('Similar items based on purchase behaviour (item-to-item collaborative filtering)')
    return similar_items

This function separates the selected item from all other items and then takes a cosine similarity of the selected item with all items to find the similarities. Return the top k similar items (StockCodes) to our selected item.

For example, let’s find similar items for user 12347.
similar_items = fetch_similar_items('10002')
similar_items
The following is the output.
Similar items based on purchase behavior (item-to-item collaborative filtering)
['10080',
 '10120',
 '10123C',
 '10124A',
 '10124G',
 '10125',
 '10133',
 '10135',
 '11001',
 '15030']

As expected, you see the default ten similar items to item 10002.

Now let’s get the recommendations by showing similar items to those bought by a particular user.

Write another function to get similar item recommendations.
def simular_item_recommendation(userid):
    simular_items_recommendation_list = []
    #obtaining all the similar items to items bought by user
    item_list = data1[data1["CustomerID"]==userid]['StockCode'].to_list()
    for item in item_list:
        similar_items = fetch_similar_items(item)
        simular_items_recommendation_list.append(item_list)
    #this gives us multi-dimensional list
    # we need to flatten it
    flat_list = []
    for sublist in simular_items_recommendation_list:
        for item in sublist:
            flat_list.append(item)
    final_recommendations_list = list(dict.fromkeys(flat_list))
    # storing 10 random recommendations in a list
    ten_random_recommendations = random.sample(final_recommendations_list, 10)
    print('Similar Items bought by our users based on Cosine Similarity')
    #returning 10 random recommendations
    return ten_random_recommendations

This function gets the list of similar items for all previously bought items by our given customer (ID). This list is then flattened to get a final list of unique items, from which randomly chosen ten items as recommendations for our given user are shown.

Again, trying this function on user 12347 to get the recommendations for that user results in the following suggestions.
simular_item_recommendation(12347)
The following is the output.
Similar Items bought by our users based on Cosine Similarity
['22196',
 '22775',
 '22492',
 '23146',
 '22774',
 '21035',
 '16008',
 '21041',
 '23316',
 '22550']

User 12347 has ten suggestions that are similar to items previously bought.

KNN-based Approach

You have learned the basics of collaborative filtering and implementing user-to-user and item-to-item filtering. Now let’s dive into machine learning-based approaches, which are more robust and popular in building recommendation systems.

Machine Learning

Machine learning is a machine’s capability to learn from experience (data) and make meaningful predictions without being explicitly programmed. It is a subfield of artificial intelligence that deals with building systems that can learn from data. The objective is to make computers learn on their own without any intervention from humans.

There are three primary machine learning categories.

Supervised Learning

In supervised learning, labeled training data is leveraged to derive the pattern or function and make a model or machine learn. Data consists of a dependent variable (Target label) and the independent variables or predictors. The machine tries to learn the function of labeled data and predict the output of unseen data.

Unsupervised Learning

In unsupervised learning, a machine learns the hidden pattern without leveraging labeled data, so training doesn’t happen. These algorithms learn to capture patterns based on similarities or distances between data points.

Reinforcement Learning

Reinforcement learning is the process of maximizing a reward by taking action. The algorithms learn how to reach a goal through experience.

Figure 4-11 explains all the categories and subcategories.

An illustration depicts the categories and subcategories of machine learning supervised (regression, classification), unsupervised (clustering, dimensionally reduction), and reinforcement learning.

Figure 4-11

Machine learning categories

Supervised Learning

There are two types of supervised learning: regression and classification.

Regression

Regression is a statistical predictive modeling technique that finds the relationship between a dependent variable and one or more independent variables. Regression is used when the dependent variable is continuous; prediction can take any numerical value.

Popular regression algorithms include linear regression, decision tree, random forest, SVM, LightGBM, and XGBoost.

Classification

Classification is a supervised machine learning technique in which the dependent or output variable is categorical; for example, spam/ham, churn/not churned, and so on.
  • In binary classification, it's either yes or no. There is no third option; for example, the customer can churn or not churn from a given business.

  • In multiclass classification, the labeled variable can be multiclass, for example, product categorization of an e-commerce website.

Logistic regression, k-nearest neighbor, decision tree, random forest, SVM, LightGBM, and XGBoost are popular classification algorithms.

K-Nearest Neighbor

The k-nearest neighbor (KNN) algorithm is a supervised machine learning model that is used for both classification and regression problems. It is a very robust algorithm that is easy to implement and interpret and uses less calculation time. Labeled data is needed since it’s a supervised learning algorithm.

Figure 4-12 explains KNN algorithms.

A set of four illustrations depict the K N N algorithm. 1. Look at the data (three potential classes). 2. Calculate distances (between grey points and all other points). 3. Find Neighbours. 4. Vote on labels.

Figure 4-12

KNN algorithm explained

Now let’s try implementing a simple KNN model on purchase_df, created in user-to-user filtering. This approach follows similar steps you have seen before (i.e., base recommendations from a list of items purchased by similar users). The difference is that a KNN model finds similar users (for a given user).

Implementation

Before passing our sparse matrix (i.e., purchase_df) into KNN, it must be converted into a CSR matrix.

CSR divides a sparse matrix into three separate arrays.
  • values

  • extent of rows

  • index of columns

So, let’s convert the sparse matrix into a CSR matrix.
purchase_matrix = csr_matrix(purchase_df.values)
Next, create the KNN model using the Euclidean distance metric.
knn_model = NearestNeighbors(metric = 'euclidean', algorithm = 'brute')
Once the model is created, fit it on the data/matrix.
knn_model.fit(purchase_matrix)
Figure 4-13 shows the fitted KNN model.

A display of fitted K N N model. It reads nearest neighbors of the algorithm equals 'brute', and metric equals 'Euclidean'.

Figure 4-13

Fitted KNN model

Now that the KNN model is in place, let’s write a function to fetch similar users using the model.
def fetch_similar_users_knn(purchase_df,query_index):
    # Creating empty list where we will store user id of similar users
    simular_users_knn = []
    # Storing the distance and index of nearest neighbor
    distances, indices = knn_model.kneighbors(purchase_df.iloc[query_index,:].values.reshape(1, -1), n_neighbors = 5)
    for i in range(0, len(distances.flatten())):
        if i == 0:
            print('Recommendations for {0}: '.format(purchase_df.index[query_index]))
        else:
            print('{0}: {1}, with distance of {2}:'.format(i, purchase_df.index[indices.flatten()[i]], distances.flatten()[i]))
            simular_users_knn.append( purchase_df.index[indices.flatten()[i]])

This function first calculates the distances and indices of the five nearest neighbors using our KNN model’s function. This output is then processed, and a list of similar users alone is returned. Instead of user_id as the input, take the index in the DataFrame.

Let’s test this out for index 1497.
fetch_similar_users_knn(purchase_df,1497)
The following is the output.
Recommendations for 14729:
1: 16917, with distance of 8.12403840463596:
2: 16989, with distance of 8.12403840463596:
3: 15124, with distance of 8.12403840463596:
4: 12897, with distance of 8.246211251235321:
simular_users_knn
The following is the output.
[16917, 16989, 15124, 12897]

Now that we have similar users, let’s get the recommendations by showing the items bought by these similar users.

Write a function to get similar user recommendations.
def knn_recommendation(simular_users_knn):
    #obtaining all the items bought by similar users
    knn_recommnedations = []
    for j in simular_users_knn:
        item_list = data1[data1["CustomerID"]==j]['StockCode'].to_list()
        knn_recommnedations.append(item_list)
    #this gives us multi-dimensional list
    # we need to flatten it
    flat_list = []
    for sublist in knn_recommnedations:
        for item in sublist:
            flat_list.append(item)
    final_recommendations_list = list(dict.fromkeys(flat_list))
    # storing 10 random recommendations in a list
    ten_random_recommendations = random.sample(final_recommendations_list, 10)
    print('Items bought by Similar users based on KNN')
    #returning 10 random recommendations
    return ten_random_recommendations

This function replicates the logic used in user-to-user filtering. Next, let’s get the final list of items that similar users purchased and recommend any random ten items from it.

Using this function on the previously generated similar users list gets the following recommendations.
knn_recommendation(simular_users_knn)
The following is the output using the KNN approach.
Items bought by Similar users based on KNN
['22487',
 '84997A',
 '22926',
 '22921',
 '22605',
 '23298',
 '22916',
 '22470',
 '22927',
 '84978']

User 14729 has ten suggestions from the products bought by similar users.

Summary

This chapter covered collaborative filtering-based recommendation engines and implementing the two types of filtering methods—user-to-user and item-to-item—using basic arithmetic operations. The chapter also explored the k-nearest neighbor algorithm (along with some machine learning basics). It ended by implementing user-to-user-based collaborative filtering using the KNN approach. The next chapter explores other popular methods to implement collaborative filtering-based recommendation engines.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.74.205