Building a recommendation system with a user-based collaborative filtering technique

The Jokes recommendation system we built earlier, with item-based filtering, uses the powerful recommenderlab library available in R. In this implementation of the user-based collaborative filtering (UBCF) approach, we make use of the same library.

The following diagram shows the working principle of UBCF:

Example depicting working principle of user based collaborative filter

To understand the concept better, let's discuss the preceding diagram in detail. Let's assume that there are three users: X,Y, and Z. In UBCF, users X and Z are very similar as both of them like strawberries and watermelons. User X also likes grapes and oranges. So a user-based collaborative filter recommends grapes and oranges to user Z. The idea is that similar people tend to like similar things.

The primary difference between a user-based collaborative filter and an item-based collaborative filter is demonstrated by the following recommendation captions often seen in online retail sites:

  • ITCF: Customers who bought this item also bought
  • UBCF: Customers similar to you bought

A user-based collaborative filter is built upon the following three key steps:

  1. Find the k-nearest neighbors (KNN) to the user x, using a similarity function, w, to measure the distance between each pair of users:

  1. Predict the rating that user x will provide to all items the KNN has rated, but x has not.
  2. The N recommended items to user x is the top N items that have the best predicted ratings.

In short, a user-item matrix is constructed during the UBCF process and based on similar users, the ratings of the unseen items of a user are predicted. The items that get the highest ratings among the predictions form the final list of recommendations.

The implementation of this project is very similar to ITCF as we are using the same library. The only change required in the code is to change the IBCF method to use UBCF. The following code block is the full code of the project implementation with UBCF:

library(recommenderlab)
data(Jester5k)
# split the data into the training and the test set
Jester5k_es <- evaluationScheme(Jester5k, method="split", train=0.8, given=20, goodRating=0)
print(Jester5k_es)
type = "UBCF"
#train UBCF cosine similarity models
# non-normalized
UBCF_N_C <- Recommender(getData(Jester5k_es, "train"), type,
param=list(normalize = NULL, method="Cosine"))
# centered
UBCF_C_C <- Recommender(getData(Jester5k_es, "train"), type,
param=list(normalize = "center",method="Cosine"))
# Z-score normalization
UBCF_Z_C <- Recommender(getData(Jester5k_es, "train"), type,
param=list(normalize = "Z-score",method="Cosine"))
#train UBCF Euclidean Distance models
# non-normalized
UBCF_N_E <- Recommender(getData(Jester5k_es, "train"), type,
param=list(normalize = NULL, method="Euclidean"))
# centered
UBCF_C_E <- Recommender(getData(Jester5k_es, "train"), type,
param=list(normalize = "center",method="Euclidean"))
# Z-score normalization
UBCF_Z_E <- Recommender(getData(Jester5k_es, "train"), type,
param=list(normalize = "Z-score",method="Euclidean"))
#train UBCF pearson correlation models
# non-normalized
UBCF_N_P <- Recommender(getData(Jester5k_es, "train"), type,
param=list(normalize = NULL, method="pearson"))
# centered
UBCF_C_P <- Recommender(getData(Jester5k_es, "train"), type,
param=list(normalize = "center",method="pearson"))
# Z-score normalization
UBCF_Z_P <- Recommender(getData(Jester5k_es, "train"), type,
param=list(normalize = "Z-score",method="pearson"))
# compute predicted ratings from each of the 9 models on the test dataset
pred1 <- predict(UBCF_N_C, getData(Jester5k_es, "known"), type="ratings")
pred2 <- predict(UBCF_C_C, getData(Jester5k_es, "known"), type="ratings")
pred3 <- predict(UBCF_Z_C, getData(Jester5k_es, "known"), type="ratings")
pred4 <- predict(UBCF_N_E, getData(Jester5k_es, "known"), type="ratings")
pred5 <- predict(UBCF_C_E, getData(Jester5k_es, "known"), type="ratings")
pred6 <- predict(UBCF_Z_E, getData(Jester5k_es, "known"), type="ratings")
pred7 <- predict(UBCF_N_P, getData(Jester5k_es, "known"), type="ratings")
pred8 <- predict(UBCF_C_P, getData(Jester5k_es, "known"), type="ratings")
pred9 <- predict(UBCF_Z_P, getData(Jester5k_es, "known"), type="ratings")
# set all predictions that fall outside the valid range to the boundary values
pred1@data@x[pred1@data@x[] < -10] <- -10
pred1@data@x[pred1@data@x[] > 10] <- 10
pred2@data@x[pred2@data@x[] < -10] <- -10
pred2@data@x[pred2@data@x[] > 10] <- 10
pred3@data@x[pred3@data@x[] < -10] <- -10
pred3@data@x[pred3@data@x[] > 10] <- 10
pred4@data@x[pred4@data@x[] < -10] <- -10
pred4@data@x[pred4@data@x[] > 10] <- 10
pred5@data@x[pred5@data@x[] < -10] <- -10
pred5@data@x[pred5@data@x[] > 10] <- 10
pred6@data@x[pred6@data@x[] < -10] <- -10
pred6@data@x[pred6@data@x[] > 10] <- 10
pred7@data@x[pred7@data@x[] < -10] <- -10
pred7@data@x[pred7@data@x[] > 10] <- 10
pred8@data@x[pred8@data@x[] < -10] <- -10
pred8@data@x[pred8@data@x[] > 10] <- 10
pred9@data@x[pred9@data@x[] < -10] <- -10
pred9@data@x[pred9@data@x[] > 10] <- 10
# aggregate the performance statistics
error_UBCF <- rbind(
UBCF_N_C = calcPredictionAccuracy(pred1, getData(Jester5k_es, "unknown")),
UBCF_C_C = calcPredictionAccuracy(pred2, getData(Jester5k_es, "unknown")),
UBCF_Z_C = calcPredictionAccuracy(pred3, getData(Jester5k_es, "unknown")),
UBCF_N_E = calcPredictionAccuracy(pred4, getData(Jester5k_es, "unknown")),
UBCF_C_E = calcPredictionAccuracy(pred5, getData(Jester5k_es, "unknown")),
UBCF_Z_E = calcPredictionAccuracy(pred6, getData(Jester5k_es, "unknown")),
UBCF_N_P = calcPredictionAccuracy(pred7, getData(Jester5k_es, "unknown")),
UBCF_C_P = calcPredictionAccuracy(pred8, getData(Jester5k_es, "unknown")),
UBCF_Z_P = calcPredictionAccuracy(pred9, getData(Jester5k_es, "unknown"))
)
library(knitr)
print(kable(error_UBCF))

This will result in the following output:

|         |     RMSE|      MSE|      MAE|
|:--------|--------:|--------:|--------:|
|UBCF_N_C | 4.877935| 23.79425| 3.986170|
|UBCF_C_C | 4.518210| 20.41422| 3.578551|
|UBCF_Z_C | 4.517669| 20.40933| 3.552120|
|UBCF_N_E | 4.644877| 21.57488| 3.778046|
|UBCF_C_E | 4.489157| 20.15253| 3.552543|
|UBCF_Z_E | 4.496185| 20.21568| 3.528534|
|UBCF_N_P | 4.927442| 24.27968| 4.074879|
|UBCF_C_P | 4.487073| 20.13382| 3.553429|
|UBCF_Z_P | 4.484986| 20.11510| 3.525356|

Based on the UBCF output, we observe that the Z-score normalized data with Pearson's correlation as the distance has yielded the best performance measurement. Furthermore, if we want, the UBCF and ITCF results may be compared (testing needs to be done on the same test dataset) to arrive at a conclusion of accepting the best model among the 18 models that are built for the final recommendation engine deployment.

The key point to observe in the code is the UBCF value that is passed to the method parameter. In the previous project, we built an item-based collaborative filter; all that is needed is for us to replace the value passed to the method parameter with IBCF.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.59.145