We have discussed all of the most relevant methods used in the commercial environment to date. The evaluation of a recommendation system can be executed offline (using only the data in the utility matrix) or online (using the utility matrix data and the new data provided in real time by each user using the website). The online evaluation procedures are discussed in Chapter 7, Movie Recommendation System Web Application, together with a proper online movie recommendation system website. In this section, we will evaluate the performances of the methods using two offline tests often used to evaluate recommendation systems: root mean square error on ratings and ranking accuracy. For all the evaluations in which k-fold cross-validation (see Chapter 3, Supervised Machine Learning) is applicable, a 5-fold cross-validation has been performed to obtain more objective results. The utility matrix has been divided in to 5 folds using the following function:
Here df
is a data frame object that stores the utility matrix and k is the number of folds. In the validation set, for each user ratings' vector u_vec
, half of the ratings have been hidden so that the real value can be predicted.
u_vals
stores the values to predict while u_test
contains the ratings for testing the algorithms. Before we start to compare the different algorithms with the different measures, we load the utility matrix and the movie content matrix into data frames and split the data into 5 folds for cross-validation.
df_vals
contains the validation sets so the HideRandomRatings
function presented in this section needs to be applied.
The data available in the movies
matrix, the movieslist
list, and the data frames df_trains
, vals_vecs_folds
, tests_vecs_folds
are now ready to be used for training and validating all the methods discussed in the previous sections. We can start evaluating the root mean square error (RMSE).
This validation technique is applicable only on CF methods and linear regression CBF since the predicted ratings are generated only by these algorithms. Given each rating rij in u_vals
in the validation sets, the predicted rating is calculated using each method and the root mean square error is obtained:
RMSE =
Here, Nval is the number of ratings in the u_vals
vectors. The presence of the square factor in this formula highly penalizes the large errors, so the methods with low RMSE (best values) are characterized by small errors spread over all the predicted ratings instead of large errors on few ratings, like the mean absolute error MAE= would prefer.
The code to calculate the RMSE for the memory-based CF user-based and item-based methods is as follows:
For each method, the SE function is called to compute the error for each fold and then the total RMSE of the folds is obtained.
Using 5 nearest-neighbors for item-based CF with slope one and 20 for user-based CF, the methods have the following errors:
Method |
RMSE |
Number of Predicted Ratings |
---|---|---|
CF user-based |
1.01 |
39,972 |
CF item-based |
1.03 |
39,972 |
Slope one |
1.08 |
39,972 |
CF-CBF user-based |
1.01 |
39,972 |
All have similar RMSE values but the best method is item-based Collaborative Filtering.
For the model-based methods, instead of not hidden validation ratings, u_test
are included in the utility matrix for training and then the RMSE is calculated using the following script:
The code calculates the RMSE only for CBF regression and SVD, and the reader can easily replicate the code to calculate the error for the other algorithms since most of the required code is just commented (SVD expect-maximization, SGD, ALS, and NMF). The results are shown in the following table (K dimension feature space):
Method |
RMSE |
Number Predicted Ratings |
---|---|---|
CBF linear regression (a= 0.01, l =0.0001, its=50) |
1.09 |
39,972 |
SGD ( K=20, 50 its, a =0.00001, l=0.001) |
1.35 |
39,972 |
ALS ( K=20, 50 its, l =0.001) |
2.58 |
39,972 |
SVD ( |
1.02 |
39,972 |
SVD EM ( |
1.03 |
39,972 |
HYBRID SVD ( |
1.01 |
39,972 |
NMF (K=20 |
0.97 |
39,972 |
As expected, the ALS and SGD are the worst methods but they are discussed because they are instructive from a didactic point of view (they are also slow because the implementation is not as optimized as the methods from sklearn
library).
All the others have similar results. However, just note that the hybrid methods have slightly better results than the corresponding SVD and CF user-based algorithms. Note that the movies to predict are chosen randomly so the results may vary.
The rating error RMSE does not really indicate the quality of a method but is an academic measure that is not really used in a commercial environment. The goal of a website is to present content that is relevant to the user regardless of the exact rating the user gives. In order to evaluate the relevance of the recommended items, the precision
, recall
, and f1
(see Chapter 2, Unsupervised Machine Learning) measures are used where the correct predictions are the items with ratings greater than 3. These measures are calculated on the first 50 items returned by each algorithm (if the algorithm return a recommended list or the 50 items with the highest predicted ratings for the other methods). The function that calculates the measures is as follows:
Here, Boolean ratingsval
indicates if the method returns ratings or recommended list. We use the function ClassificationMetrics
in the same way we compute the RMSE for all the methods, so the actual code to evaluate the measures is not shown (you can write it as an exercise). The following table summarizes the results for all the methods (neighs is number of nearest-neighbors, K dimension feature space):
Method |
Precision |
Recall |
f1 |
Number of Predicted Ratings |
---|---|---|---|---|
CF user-based (neighs=20) |
0.6 |
0.18 |
0.26 |
39,786 |
CBFCF user-based (neighs=20) |
0.6 |
0.18 |
0.26 |
39,786 |
HYBRID SVD (K=20, |
0.54 |
0.12 |
0.18 |
39,786 |
CF item-based (neighs=5) |
0.57 |
0.15 |
0.22 |
39,786 |
Slope one (neighs=5) |
0.57 |
0.17 |
0.24 |
39,786 |
SVD EM (K=20, iterations=30, |
0.58 |
0.16 |
0.24 |
39,786 |
SVD (K=20, |
0.53 |
0.12 |
0.18 |
39,786 |
CBF regression (a = 0.01, l =0.0001, iterations=50) |
0.54 |
0.13 |
0.2 |
39,786 |
SGD (K=20, a =0.00001, l =0.001) |
0.52 |
0.12 |
0.18 |
39,786 |
ALS (K=20, λ =0.001, iterations=50) |
0.57 |
0.15 |
0.23 |
39,786 |
CBF average |
0.56 |
0.12 |
0.19 |
39,786 |
LLR |
0.63 |
0.3 |
0.39 |
39,786 |
NMF (K=20, λ =0.001, |
0.53 |
0.13 |
0.19 |
39,786 |
Association rules |
0.68 |
0.31 |
0.4 |
39,786 |
From the results you can see that the best method is association rules, and there is good precision also for the LLR, hybrid CBFCF user-based, and CF user-based methods. Note that the results may vary since the movies to predict have been randomly chosen.
3.133.127.37