Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Evaluation of the recommendation systems

We have discussed all of the most relevant methods used in the commercial environment to date. The evaluation of a recommendation system can be executed offline (using only the data in the utility matrix) or online (using the utility matrix data and the new data provided in real time by each user using the website). The online evaluation procedures are discussed in Chapter 7, Movie Recommendation System Web Application, together with a proper online movie recommendation system website. In this section, we will evaluate the performances of the methods using two offline tests often used to evaluate recommendation systems: root mean square error on ratings and ranking accuracy. For all the evaluations in which k-fold cross-validation (see Chapter 3, Supervised Machine Learning) is applicable, a 5-fold cross-validation has been performed to obtain more objective results. The utility matrix has been divided in to 5 folds using the following function:

Here df is a data frame object that stores the utility matrix and k is the number of folds. In the validation set, for each user ratings' vector u_vec, half of the ratings have been hidden so that the real value can be predicted.

u_vals stores the values to predict while u_test contains the ratings for testing the algorithms. Before we start to compare the different algorithms with the different measures, we load the utility matrix and the movie content matrix into data frames and split the data into 5 folds for cross-validation.

df_vals contains the validation sets so the HideRandomRatings function presented in this section needs to be applied.

The data available in the movies matrix, the movieslist list, and the data frames df_trains, vals_vecs_folds, tests_vecs_folds are now ready to be used for training and validating all the methods discussed in the previous sections. We can start evaluating the root mean square error (RMSE).

Root mean square error (RMSE) evaluation

This validation technique is applicable only on CF methods and linear regression CBF since the predicted ratings are generated only by these algorithms. Given each rating rij in u_vals in the validation sets, the predicted rating is calculated using each method and the root mean square error is obtained:

RMSE = Root mean square error (RMSE) evaluation

Here, Nval is the number of ratings in the u_vals vectors. The presence of the square factor in this formula highly penalizes the large errors, so the methods with low RMSE (best values) are characterized by small errors spread over all the predicted ratings instead of large errors on few ratings, like the mean absolute error MAE= Root mean square error (RMSE) evaluation would prefer.

The code to calculate the RMSE for the memory-based CF user-based and item-based methods is as follows:

Root mean square error (RMSE) evaluation

For each method, the SE function is called to compute the error for each fold and then the total RMSE of the folds is obtained.

Using 5 nearest-neighbors for item-based CF with slope one and 20 for user-based CF, the methods have the following errors:

Method	RMSE	Number of Predicted Ratings
CF user-based	1.01	39,972
CF item-based	1.03	39,972
Slope one	1.08	39,972
CF-CBF user-based	1.01	39,972

All have similar RMSE values but the best method is item-based Collaborative Filtering.

For the model-based methods, instead of not hidden validation ratings, u_test are included in the utility matrix for training and then the RMSE is calculated using the following script:

The code calculates the RMSE only for CBF regression and SVD, and the reader can easily replicate the code to calculate the error for the other algorithms since most of the required code is just commented (SVD expect-maximization, SGD, ALS, and NMF). The results are shown in the following table (K dimension feature space):

Method	RMSE	Number Predicted Ratings
CBF linear regression (a= 0.01, l =0.0001, its=50)	1.09	39,972
SGD ( K=20, 50 its, a =0.00001, l=0.001)	1.35	39,972
ALS ( K=20, 50 its, l =0.001)	2.58	39,972
SVD (`imputation`=`useraverage`, K=20)	1.02	39,972
SVD EM (`imputation`=`itemaverage`, iterations=30,K=20)	1.03	39,972
HYBRID SVD (`imputation`=`useraverage`, K=20)	1.01	39,972
NMF (K=20 `imputation`=`useraverage`)	0.97	39,972

As expected, the ALS and SGD are the worst methods but they are discussed because they are instructive from a didactic point of view (they are also slow because the implementation is not as optimized as the methods from sklearn library).

All the others have similar results. However, just note that the hybrid methods have slightly better results than the corresponding SVD and CF user-based algorithms. Note that the movies to predict are chosen randomly so the results may vary.

Classification metrics

The rating error RMSE does not really indicate the quality of a method but is an academic measure that is not really used in a commercial environment. The goal of a website is to present content that is relevant to the user regardless of the exact rating the user gives. In order to evaluate the relevance of the recommended items, the precision, recall, and f1 (see Chapter 2, Unsupervised Machine Learning) measures are used where the correct predictions are the items with ratings greater than 3. These measures are calculated on the first 50 items returned by each algorithm (if the algorithm return a recommended list or the 50 items with the highest predicted ratings for the other methods). The function that calculates the measures is as follows:

Here, Boolean ratingsval indicates if the method returns ratings or recommended list. We use the function ClassificationMetrics in the same way we compute the RMSE for all the methods, so the actual code to evaluate the measures is not shown (you can write it as an exercise). The following table summarizes the results for all the methods (neighs is number of nearest-neighbors, K dimension feature space):

Method	Precision	Recall	f1	Number of Predicted Ratings
CF user-based (neighs=20)	0.6	0.18	0.26	39,786
CBFCF user-based (neighs=20)	0.6	0.18	0.26	39,786
HYBRID SVD (K=20, `imputation`=`useraverage`)	0.54	0.12	0.18	39,786
CF item-based (neighs=5)	0.57	0.15	0.22	39,786
Slope one (neighs=5)	0.57	0.17	0.24	39,786
SVD EM (K=20, iterations=30, `imputation`=`useraverage`)	0.58	0.16	0.24	39,786
SVD (K=20, `imputation`=`itemaverage`)	0.53	0.12	0.18	39,786
CBF regression (a = 0.01, l =0.0001, iterations=50)	0.54	0.13	0.2	39,786
SGD (K=20, a =0.00001, l =0.001)	0.52	0.12	0.18	39,786
ALS (K=20, λ =0.001, iterations=50)	0.57	0.15	0.23	39,786
CBF average	0.56	0.12	0.19	39,786
LLR	0.63	0.3	0.39	39,786
NMF (K=20, λ =0.001, `imputation`=`ssss`)	0.53	0.13	0.19	39,786
Association rules	0.68	0.31	0.4	39,786

From the results you can see that the best method is association rules, and there is good precision also for the LLR, hybrid CBFCF user-based, and CF user-based methods. Note that the results may vary since the movies to predict have been randomly chosen.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Evaluation of the recommendation systems

Create new playlist

Sign In

Sign Up

Evaluation of the recommendation systems

Root mean square error (RMSE) evaluation

Classification metrics

Table of Contents for
Evaluation of the recommendation systems