Evaluation of the recommendation systems

We have discussed all of the most relevant methods used in the commercial environment to date. The evaluation of a recommendation system can be executed offline (using only the data in the utility matrix) or online (using the utility matrix data and the new data provided in real time by each user using the website). The online evaluation procedures are discussed in Chapter 7, Movie Recommendation System Web Application, together with a proper online movie recommendation system website. In this section, we will evaluate the performances of the methods using two offline tests often used to evaluate recommendation systems: root mean square error on ratings and ranking accuracy. For all the evaluations in which k-fold cross-validation (see Chapter 3, Supervised Machine Learning) is applicable, a 5-fold cross-validation has been performed to obtain more objective results. The utility matrix has been divided in to 5 folds using the following function:

Evaluation of the recommendation systems

Here df is a data frame object that stores the utility matrix and k is the number of folds. In the validation set, for each user ratings' vector u_vec, half of the ratings have been hidden so that the real value can be predicted.

Evaluation of the recommendation systems

u_vals stores the values to predict while u_test contains the ratings for testing the algorithms. Before we start to compare the different algorithms with the different measures, we load the utility matrix and the movie content matrix into data frames and split the data into 5 folds for cross-validation.

Evaluation of the recommendation systems

df_vals contains the validation sets so the HideRandomRatings function presented in this section needs to be applied.

Evaluation of the recommendation systems

The data available in the movies matrix, the movieslist list, and the data frames df_trains, vals_vecs_folds, tests_vecs_folds are now ready to be used for training and validating all the methods discussed in the previous sections. We can start evaluating the root mean square error (RMSE).

Root mean square error (RMSE) evaluation

This validation technique is applicable only on CF methods and linear regression CBF since the predicted ratings are generated only by these algorithms. Given each rating rij in u_vals in the validation sets, the predicted rating Root mean square error (RMSE) evaluation is calculated using each method and the root mean square error is obtained:

RMSE = Root mean square error (RMSE) evaluation

Here, Nval is the number of ratings in the u_vals vectors. The presence of the square factor in this formula highly penalizes the large errors, so the methods with low RMSE (best values) are characterized by small errors spread over all the predicted ratings instead of large errors on few ratings, like the mean absolute error MAE=Root mean square error (RMSE) evaluation would prefer.

The code to calculate the RMSE for the memory-based CF user-based and item-based methods is as follows:

Root mean square error (RMSE) evaluation
Root mean square error (RMSE) evaluation

For each method, the SE function is called to compute the error for each fold and then the total RMSE of the folds is obtained.

Using 5 nearest-neighbors for item-based CF with slope one and 20 for user-based CF, the methods have the following errors:

Method

RMSE

Number of Predicted Ratings

CF user-based

1.01

39,972

CF item-based

1.03

39,972

Slope one

1.08

39,972

CF-CBF user-based

1.01

39,972

All have similar RMSE values but the best method is item-based Collaborative Filtering.

For the model-based methods, instead of not hidden validation ratings, u_test are included in the utility matrix for training and then the RMSE is calculated using the following script:

Root mean square error (RMSE) evaluation

The code calculates the RMSE only for CBF regression and SVD, and the reader can easily replicate the code to calculate the error for the other algorithms since most of the required code is just commented (SVD expect-maximization, SGD, ALS, and NMF). The results are shown in the following table (K dimension feature space):

Method

RMSE

Number Predicted Ratings

CBF linear regression

(a= 0.01, l =0.0001, its=50)

1.09

39,972

SGD ( K=20, 50 its, a =0.00001, l=0.001)

1.35

39,972

ALS ( K=20, 50 its, l =0.001)

2.58

39,972

SVD (imputation=useraverage, K=20)

1.02

39,972

SVD EM (imputation=itemaverage, iterations=30,K=20)

1.03

39,972

HYBRID SVD (imputation=useraverage, K=20)

1.01

39,972

NMF (K=20 imputation=useraverage)

0.97

39,972

As expected, the ALS and SGD are the worst methods but they are discussed because they are instructive from a didactic point of view (they are also slow because the implementation is not as optimized as the methods from sklearn library).

All the others have similar results. However, just note that the hybrid methods have slightly better results than the corresponding SVD and CF user-based algorithms. Note that the movies to predict are chosen randomly so the results may vary.

Classification metrics

The rating error RMSE does not really indicate the quality of a method but is an academic measure that is not really used in a commercial environment. The goal of a website is to present content that is relevant to the user regardless of the exact rating the user gives. In order to evaluate the relevance of the recommended items, the precision, recall, and f1 (see Chapter 2, Unsupervised Machine Learning) measures are used where the correct predictions are the items with ratings greater than 3. These measures are calculated on the first 50 items returned by each algorithm (if the algorithm return a recommended list or the 50 items with the highest predicted ratings for the other methods). The function that calculates the measures is as follows:

Classification metrics

Here, Boolean ratingsval indicates if the method returns ratings or recommended list. We use the function ClassificationMetrics in the same way we compute the RMSE for all the methods, so the actual code to evaluate the measures is not shown (you can write it as an exercise). The following table summarizes the results for all the methods (neighs is number of nearest-neighbors, K dimension feature space):

Method

Precision

Recall

f1

Number of Predicted Ratings

CF user-based (neighs=20)

0.6

0.18

0.26

39,786

CBFCF user-based (neighs=20)

0.6

0.18

0.26

39,786

HYBRID SVD (K=20, imputation=useraverage)

0.54

0.12

0.18

39,786

CF item-based (neighs=5)

0.57

0.15

0.22

39,786

Slope one (neighs=5)

0.57

0.17

0.24

39,786

SVD EM (K=20, iterations=30, imputation=useraverage)

0.58

0.16

0.24

39,786

SVD (K=20, imputation=itemaverage)

0.53

0.12

0.18

39,786

CBF regression (a = 0.01, l =0.0001, iterations=50)

0.54

0.13

0.2

39,786

SGD (K=20, a =0.00001, l =0.001)

0.52

0.12

0.18

39,786

ALS (K=20, λ =0.001, iterations=50)

0.57

0.15

0.23

39,786

CBF average

0.56

0.12

0.19

39,786

LLR

0.63

0.3

0.39

39,786

NMF (K=20, λ =0.001, imputation=ssss)

0.53

0.13

0.19

39,786

Association rules

0.68

0.31

0.4

39,786

From the results you can see that the best method is association rules, and there is good precision also for the LLR, hybrid CBFCF user-based, and CF user-based methods. Note that the results may vary since the movies to predict have been randomly chosen.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.127.37