Exploring the data

Before building and evaluating recommender systems using the two data sets we have loaded, it is a good idea to get a feel for the data. For one thing, we can make use of the getRatings() function to retrieve the ratings from a rating matrix. This is useful in order to construct a histogram of item ratings. Additionally, we can also normalize the ratings with respect to each user as we discussed earlier. The following code snippet shows how we can compute ratings and normalized ratings for the jester data. We can then do the same for the MovieLens data and produce histograms for the ratings:

> jester_ratings <- getRatings(jester_rrm)
> jester_normalized_ratings <- getRatings(normalize(jester_rrm, 
                                          method = "Z-score"))

The following plot shows the different histograms:

Exploring the data

In the jester data, we can see that ratings above zero are more prominent than ratings below zero, and the most common rating is 10, the maximum rating. The normalized ratings create a more symmetric distribution centered on zero. For the MovieLens data with the 5-point rating scale, 4 is the most prominent rating and higher ratings are far more common than low ratings.

We can also look for the distribution of the number of items rated per user and the average rating per item by looking at the row counts and the column means of the rating matrix respectively. Again, the following code snippet shows how to compute these for the jester data and we follow up with histograms showing the results for both data sets:

> jester_items_rated_per_user <- rowCounts(jester_rrm)
> jester_average_item_rating_per_item <- colMeans(jester_rrm)
Exploring the data

Both data sets show a curve in the average ratings per user that looks like a power curve. Most of the users have rated very few items, but a small number of very committed users have actually rated a very large number of items. In the Jester case, some have rated the maximum number of jokes in the data set. This is an exception and only occurs because the number of items (jokes) in this data set is relatively small. The distribution of the average joke rating is between -3 and 4, but for movies we see the whole range of the spectrum indicating that some users have rated all the movies they considered as completely awful or totally great. We can find the average of these distributions in order to determine the average number of items rated per user and the average rating of each item.

Note that we need to remove NA values from consideration in the Jester data set, as some columns may not have ratings in them:

> (jester_avg_items_rated_per_user <- mean(rowCounts(jester_rrm)))
[1] 34.10493
> (jester_avg_item_rating <- mean(colMeans(jester_rrm), na.rm = T))
[1] 1.633048
> (movies_avg_items_rated_per_user <- mean(rowCounts(movies_rrm)))
[1] 165.5975
> (movies_avg_item_rating <- mean(colMeans(movies_rrm)))
[1] 3.238892 

Evaluating binary top-N recommendations

We now have some sense of what our data looks like for both data sets so we can start building some models. We will begin by looking at the problem of making top-N recommendations for a binary recommender system, which is simpler to do than when we have more granular data for ratings. Recall that a top-N recommendation is nothing but a list of N recommendations that are most likely to interest a user. To do this, we will use the jester data set and create a binary version of our rating matrix. We'll call any rating that is 5 or above a positive rating. As this may result in some users having no positive ratings, we'll also prune the rating matrix and keep only users with at least ten positive ratings under this scheme:

> jester_bn <- binarize(jester_rrm, minRating = 5)
> jester_bn <- jester_bn[rowCounts(jester_bn) > 1]
> dim(jester_bn)
[1] 13789   150

One of the advantages of the recommenderlab package is that it makes it very easy for us to compare results from several algorithms. The process of training and evaluating multiple algorithms for top-N recommendations begins by creating a list containing the definitions of the algorithms that we want to use. Each element in the list is given a name of our choice but must itself be a list containing a set of parameters for configuring a known algorithm. Concretely, the name parameter of this inner parameter list must be one that the recommenderlab package recognizes. It is possible to create and register one's own algorithm with this package, but our focus will be on existing implementations that more than suffice for our intents and purposes:

> algorithms <- list(
     "Random" = list(name = "RANDOM", param = NULL),
     "Popular" = list(name = "POPULAR", param = NULL),
     "UserBasedCF_COS" = list(name = "UBCF", 
                 param = list(method = "Cosine", nn = 50)),
     "UserBasedCF_JAC" = list(name = "UBCF", 
                 param = list(method = "Jaccard", nn = 50))
 )

The RANDOM algorithm is a baseline algorithm that makes recommendations randomly. The POPULAR algorithm is another baseline algorithm that can sometimes be tough to beat. This proposes items in descending order of global popularity, so that for a top-1 recommendation, it will recommend the item with the highest average rating in the data set. We have chosen to try out two variants of user-based collaborative filtering for this example. The first one uses the cosine distance and specifies 50 as the number of nearest neighbors to use. The second one is identical but uses the Jaccard distance instead.

Next, we define an evaluation scheme via the function evaluationScheme(). This function records how we will split our data into training and test sets, the number of ratings we will take as given from our test users via the given parameter, and how many runs we want to execute. We will do a straight 80-20 split for our training and test set, consider 10 ratings from our test users as known ratings, and evaluate over a single run:

> jester_split_scheme <- evaluationScheme(jester_bn, method = 
                         "split", train = 0.8, given = 10, k = 1)

Note that the given parameter must be at least as large as the smallest number of items rated by a user in our data set. We previously filtered the data set to ensure we have 10 items per user at least, so we are covered in our case. Finally, we will evaluate our list of algorithms in turn with our evaluation scheme using the evaluate() function. Aside from an evaluation scheme and a list of algorithms, we will also specify the range of N values to use when making top-N recommendations via the n parameter. We will do this for values 1 through 20:

> jester_split_eval <- evaluate(jester_split_scheme, algorithms, 
                                n = 1 : 20)
RANDOM run 
  1  [0.015sec/1.87sec] 
POPULAR run 
  1  [0.006sec/12.631sec] 
UBCF run 
  1  [0.001sec/36.862sec] 
UBCF run 
  1  [0.002sec/36.342sec]

We now have a list of four objects that represent the evaluation results of each algorithm on our data. We can get important measures such as precision by looking at the confusion matrices. Note that as we have run this experiment for top-N recommendations where N is in the range 1-20, we expect to have 20 such confusion matrices for each algorithm. The function getConfusionMatrix(), when applied to one of these objects, can be used to retrieve the folded confusion matrices in tabular format so that each row represents the confusion matrix for a particular value of N:

> options(digits = 4)
> getConfusionMatrix(jester_split_eval[[4]])
[[1]]
       TP      FP    FN    TN precision  recall     TPR      FPR
1  0.5181  0.4819 18.47 120.5    0.5181 0.06272 0.06272 0.003867
2  1.0261  0.9739 17.96 120.0    0.5131 0.12042 0.12042 0.007790
3  1.4953  1.5047 17.49 119.5    0.4984 0.16470 0.16470 0.012011
4  1.9307  2.0693 17.06 118.9    0.4827 0.20616 0.20616 0.016547
5  2.3575  2.6425 16.63 118.4    0.4715 0.24215 0.24215 0.021118
6  2.7687  3.2313 16.22 117.8    0.4614 0.27509 0.27509 0.025791
7  3.1530  3.8470 15.83 117.2    0.4504 0.30508 0.30508 0.030709
8  3.5221  4.4779 15.46 116.5    0.4403 0.33216 0.33216 0.035735
9  3.8999  5.1001 15.09 115.9    0.4333 0.36069 0.36069 0.040723
10 4.2542  5.7458 14.73 115.3    0.4254 0.38723 0.38723 0.045890
11 4.6037  6.3963 14.38 114.6    0.4185 0.40927 0.40927 0.051036
12 4.9409  7.0591 14.04 114.0    0.4117 0.43368 0.43368 0.056345
13 5.2534  7.7466 13.73 113.3    0.4041 0.45345 0.45345 0.061856
14 5.5638  8.4362 13.42 112.6    0.3974 0.47248 0.47248 0.067360
15 5.8499  9.1501 13.14 111.9    0.3900 0.48907 0.48907 0.073066
16 6.1298  9.8702 12.86 111.1    0.3831 0.50604 0.50604 0.078836
17 6.4090 10.5910 12.58 110.4    0.3770 0.52151 0.52151 0.084592
18 6.6835 11.3165 12.30 109.7    0.3713 0.53664 0.53664 0.090384
19 6.9565 12.0435 12.03 109.0    0.3661 0.55187 0.55187 0.096198
20 7.2165 12.7835 11.77 108.2    0.3608 0.56594 0.56594 0.102095

To visualize these data and compare our algorithms, we can try plotting the results directly using the plot() function. For our evaluation results, the default is a plot of the true positive rate (TPR) versus the false positive rate (FPR). This is nothing other than the ROC curve, as we know from Chapter 4, Neural Networks.

> plot(jester_split_eval, annotate = 2, legend = "topright")
> title(main = "TPR vs FPR For Binary Jester Data")

Here is the ROC curve for the binary Jester data:

Evaluating binary top-N recommendations

The graph shows that the user-based collaborative filtering algorithms perform better than the two baseline algorithms, but there is very little to separate these two, with the cosine distance marginally outperforming the Jaccard distance on these data. We can complement this view of our results by also plotting a precision recall curve:

> plot(jester_split_eval, "prec/rec", annotate = 2, 
       legend = "bottomright")
> title(main = "Precision versus Recall Binary Jester Data")

Here is the precision recall curve for the binary Jester data:

Evaluating binary top-N recommendations

The precision recall curve paints a similar picture, with the user-based collaborative filtering algorithm that uses the cosine distance coming out as the winner. Note that the trade off between precision and recall surfaces in a top-N recommender system via the number of recommendations that the system makes. The way our evaluation scheme works is that we treat users in the test data as new users in the system that just contributed a certain number of ratings. We hold out as many ratings as the given parameter allows. Then, we apply our model in order to see if the ratings we suggest will agree with the ratings that remain. We order our suggestions in descending order of confidence so that in a top-1 recommendation system, we will suggest the item we believe has the best chance of interesting the user. Increasing N therefore is like casting a wider net. We will be less precise in our suggestions but are more likely to find something the user will like.

Note

An excellent and freely available resource for recommendation systems is Chapter 9 from the online textbook Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. This book is also an excellent additional reference for working with Big Data. The website is http://www.mmds.org/.

Evaluating non-binary top-N recommendations

In this section, we will use the movies data set to see how we perform in the non-binary scenario. First, we will define our algorithms as before:

> normalized_algorithms <- list(
     "Random" = list(name = "RANDOM", param = list(normalize =  
                      "Z-score")),
     "Popular" = list(name = "POPULAR", param = list(normalize = 
                      "Z-score")),
     "UserBasedCF" = list(name = "UBCF", param = list(normalize = 
                      "Z-score", method = "Cosine", nn = 50)),
     "ItemBasedCF" = list(name = "IBCF", param = list(normalize = 
                      "Z-score")),
     "SVD" = list(name = "SVD", param = list(categories = 30, 
                      normalize = "Z-score", treat_na = "median"))
 )

This time, our algorithms will work with normalized ratings by specifying the normalize parameter. We will only be using the cosine distance for user-based collaborative filtering as the Jaccard distance only applies in the binary setting. Furthermore, we will also try out item-based collaborative filtering as well as SVD-based recommendations. Instead of directly splitting our data, we demonstrate how we can perform ten-fold cross-validation by modifying our evaluation scheme. We will continue to investigate making top-N recommendations in the range of 1 to 20. Evaluating a moderately sized data set with five algorithms using ten-fold cross-validation means that we can expect this process to take quite a long time to finish depending on the computing power we have at our disposal:

> movies_cross_scheme <- evaluationScheme(movies_rrm, method = 
           "cross-validation", k = 10, given = 10, goodRating = 4)
> movies_cross_eval <- evaluate(movies_cross_scheme, 
            normalized_algorithms, n = 1 : 20)

To conserve space, we have truncated the output that shows us the amount of time spent running each iteration for the different algorithms. Note that the most expensive algorithm during training is the item-based collaborative filtering algorithm, as this is building a model and not just performing lazy learning. Once the process terminates, we can plot the results in the same way as we did for our binarized Jester data set in order to compare the performance of our algorithms:

> plot(movies_cross_eval, annotate = 4, legend = "topright")
> title(main = "TPR versus FPR For Movielens Data")

Here is the ROC curve for the MovieLens data:

Evaluating non-binary top-N recommendations

As we can see, user-based collaborative filtering is the clear winner here. SVD performs in a similar manner to the POPULAR baseline, though the latter starts to become better when N is high. Finally, we see item-based collaborative filtering performing far worse than these, outperforming only the random baseline. What is clear from these experiments is that tuning recommendation systems can often be a very time-consuming, resource-intensive endeavor.

All the algorithms that we specified can be tuned in various ways and we have seen a number of parameters, from the size of the neighborhood to the similarity metric, that will influence the results. In addition, we've seen that even for the top-N scenario alone there are several ways that we can evaluate our recommendation system, so if we want to try out a number of these for comparison, we will again need to spend more time during model training.

The reader is encouraged to repeat these experiments using different parameters and evaluation schemes in order to get a feel for the process of designing and training recommendation systems. In addition, by visiting the websites of our two data sets, the reader can find additional links to similar data sets commonly used for learning about recommendation systems, such as the book-crossing data set.

For completeness, we will plot the precision recall curve for the MovieLens data:

> plot(movies_split_eval, "prec/rec", annotate = 3, 
       legend = "bottomright")
> title(main = "Precision versus Recall For Movielens Data")

Here is the precision recall curve for the MovieLens data:

Evaluating non-binary top-N recommendations

Evaluating individual predictions

Another way to evaluate a recommendation system is to ask it to predict the specific values of a portion of the known ratings made by a set of test users, using the remainder of their ratings as given. In this way, we can measure accuracy by taking average distance measures over the predicted ratings. These include the mean squared error (MSE) and the Root Mean Square Error (RMSE), which we have seen before, and the mean average error (MAE), which is just the average of the absolute errors. We will do this for the regular (unbinarized) Jester data set. We begin as before by defining an evaluation scheme:

> jester_split_scheme <- evaluationScheme(jester_rrm, method = 
  "split", train = 0.8, given = 5, goodRating = 5)

Next, we will define individual user- and item-based collaborative filtering recommenders using the Recommender() and getData() functions. The logic behind these is that the getData() function will extract the ratings set aside for training by the evaluation scheme and the Recommender() function will use these data to train a model:

> jester_ubcf_srec <- Recommender(getData(jester_split_scheme, 
                                  "train"), "UBCF")
> jester_ibcf_srec <- Recommender(getData(jester_split_scheme, 
                                  "train"), "IBCF")

We can then use these models to predict those ratings that were classified as known (there are as many of these as the given parameter specifies) in our test data:

> jester_ubcf_known <- predict(jester_ubcf_srec, 
  getData(jester_split_scheme, "known"), type="ratings")
> jester_ibcf_known <- predict(jester_ibcf_srec, 
  getData(jester_split_scheme, "known"), type="ratings") 

Finally, we can use the known ratings to compute prediction accuracy on the ratings kept for testing:

> (jester_ubcf_acc <- calcPredictionAccuracy(jester_ubcf_known, 
  getData(jester_split_scheme, "unknown")))
    RMSE      MSE      MAE 
 4.70765 22.16197  3.54130 
> (jester_ibcf_acc <- calcPredictionAccuracy(jester_ibcf_known, 
  getData(jester_split_scheme, "unknown")))
     RMSE       MSE       MAE 
 5.012211 25.122256  3.518815

We can see that the performance of the two algorithms is fairly close. User-based collaborative filtering performs better when we penalize larger errors (via the RMSE and MSE) through squaring. From the perspective of the mean average error, item-based collaborative filtering is very marginally better.

Consequently, in this case, we might make our decision on which type of recommendation system to use on the basis of the error behavior that more closely matches our business needs. In this section, we used the default parameter values for the two algorithms, but by using the parameter parameter in the Recommender() function, we can play around with different configurations as we did before. This is left as an exercise for the reader.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.2.225