This class of method relies on the data that describes the items, which is then used to extract the features of the users. In our MovieLens example, each movie j has a set of G binary fields to indicate if it belongs to one of the following genres: unknown, action, adventure, animation, children's, comedy, crime, documentary, drama, fantasy, film noir, horror, musical, mystery, romance, sci-fi, thriller, war, or western.
Based on these features (genres), each movie is described by a binary vector mj with G dimensions (number of movie genres) with entries equal to 1
for all the genres contained in movie j, or 0
otherwise. Given the dataframe
that stores the utility matrix called dfout
in the Utility matrix section mentioned earlier, these binary vectors mj are collected from the MoviesLens database
into a dataframe using the following script:
The movies content matrix has been saved in the movies_content.csv
file ready to be used by the CBF methods.
The goal of the content-based recommendation system is to generate the user's profile with the same fields to indicate how much the user likes each genre. The problem with this method is that the content description of the item is not always available, so it is not always possible to employ this technique in the e-commerce environment. The advantage is that the recommendations to a specific user are independent of the other users' ratings, so it does not suffer from cold start problems due to an insufficient number of users' ratings for particular items. Two approaches are going to be discussed to find the best recommendation methodologies. The first methodology simply generates the user's profile associated with the average ratings of the movies seen by each user to each genre and the cosine similarity is used to find the movies most similar to the user preferences. The second methodology is a regularized linear regression model to generate the user's profile features from the ratings and the movie features so that the ratings of the movies not yet seen by each user can be predicted using these users' profiles.
The approach is really simple and we are going to explain it using the features that describe the movies in the MovieLens example, as discussed previously. The objective of the method is to generate the movie genres' preferences vector for each user i (length equal to G). This is done by calculating the average rating and each genre entry g; is given by the sum of ratings of the movies seen by user i (Mi) containing the genre g, minus the average and divided by the number of movies containing genre g:
Here, Ikg is 1 if the movie k contains genre g; otherwise it is 0
.
The vectors are then compared to the binary vectors mj using the cosine similarity and the movies with the highest similarity values are recommended to the user i. The implementation of the method is given by the following Python class:
The constructor stores the list of the movie titles in Movieslist
and the movie features in the Movies
vector, and the GetRecMovies
function generates the user genres' preferences vector, that is, (applying the preceding formula) called features_u
, and returns the most similar items to this vector.
The method learns the movie preferences of the users as parameters of a linear model, with , where N is the number of users and G is the number of features (movie genres) of each item. We add an intercept value on the user parameters θi (θi0 = 1) and also the movie vector mj that has the same value mj0=1, and so . To learn the vectors of parameters qi , we solve the following regularized minimization problem:
Here, Iij is 1
; that is, user i watched the movie, otherwise j is 0
and λ is the regularization parameter (see Chapter 3, Supervised Machine Learning).
The solution is given by applying gradient descent (see Chapter 3, Supervised Machine Learning). For each user i:
Since we are adding 1
entry to the movie and user vectors respectively, the distinction between learning the intercept parameter (k=0) and the others is necessary (there is no possibility of overfitting on the intercept, so no need to regularize on it). After the parameters qi are learned, the recommendation is performed by simply applying for any missing rating rij in the formula .
The method is implemented by the following code:
The constructor of the class CBF_regression
just performs the gradient descent to find the parameters θi (called Pmatrix
) while the function CalcRatings
finds the most similar rating vector in the stored utility matrix R (in case the user is not present in the utility matrix) and then it uses the corresponding parameters' vector to predict the missing ratings.
3.128.226.121