Collaborative filtering

Having covered distances, we are ready to delve into the topic of collaborative filtering, which will help us define a strategy for making recommendations. Collaborative filtering describes an algorithm, or more precisely a family of algorithms, that aims to create recommendations for a test user given only information about the ratings of other users via the rating matrix, as well as any ratings that the test user has already made.

There are two very common variants of collaborative filtering, memory-based collaborative filtering and model-based collaborative filtering. With memory-based collaborative filtering, the entire history of all the ratings made by all the users is remembered and must be processed in order to make a recommendation. The prototypical memory-based collaborative filtering method is user-based collaborative filtering. Although this approach uses all the ratings available, the downside is that it can be computationally expensive as the entire database is used in order to make rating predictions for our test user.

The alternative approach to this is embodied in model-based collaborative filtering. Here, we first create a model of the rating preferences of our users, such as a set of clusters of users who like similar items, and then use the model to generate recommendations. We will study item-based collaborative filtering, which is the most well-known model-based collaborative filtering method.

User-based collaborative filtering

User-based collaborative filtering is commonly described as a memory-based or lazy learning approach. Unlike most of the models we have built in this book, which assume that we will fit the data to a particular model and then use this model to make predictions, lazy learning simply uses the training data itself to make predictions directly. We saw an example of lazy learning with k-nearest neighbors in Chapter 1, Gearing Up for Predictive Modeling. In fact, the premise of the user-based collaborative filtering approach builds directly on the k-nearest neighbors approach.

Concretely, in user-based collaborative filtering, when we want to make recommendations for a new user, we will first pick a set of similar users using a particular distance metric. Then, we try to infer the ratings that our target user would assign to items that he or she has not yet rated as an average of the ratings made by these similar users on those items. We usually refer to this set of similar users as the user's neighborhood. Thus, the idea is that a user will prefer items that their neighborhood prefers.

Typically, there are two ways to define the user's neighborhood. We can compute a fixed neighborhood by finding the k-nearest neighbors. These are the k users in our database that have the smallest distance between them and our target user.

Alternatively, we can specify a similarity threshold and pick all the users in our database whose distance from our target user does not exceed this threshold. This second approach has the advantage that we will be making recommendations using users that are as close to our target user as we want, and therefore our confidence in our recommendation can be high. On the other hand, there might be very few users that satisfy our requirement, meaning that we will be relying on the recommendations of these few users. Worse, there might be no users in our database who are sufficiently similar to our target user and we might not be able to actually make a recommendation at all. If we don't mind our method sometimes failing to make a recommendation, for example because we have a backup plan to handle these cases, the second approach might be a good choice.

Another important consideration to make in a real-world setting is the problem of sparse ratings. In our simple restaurant example, every user had rated every restaurant. This rarely happens in a real situation, if ever, simply because the number of items is usually too big for a user to rate them all. If we think of e-commerce websites such as amazon.com, for example, it is easy to imagine that the most products that any user has rated is still only a small fraction of the overall number of products on sale.

To compute distance metrics between users in order to determine similarity, we usually incorporate only the items that both users have rated. Consequently, in practice, we often make comparisons between users in a smaller number of dimensions.

Once we have decided on a distance metric and how to form a neighborhood of users that are similar to our test user, we then use this neighborhood to compute the missing ratings for the test user. The easiest way to do this is to simply compute the average rating for each item in the user neighborhood and report this value. Thus for test user t and an item j for which the test user has not yet made a rating, we can predict the test user's rating for that item, User-based collaborative filtering, as follows:

User-based collaborative filtering

This equation expresses the simple idea that the predicted rating of our test user t for item j is just the average of the ratings made by the test user's neighborhood on this item. Suppose we had a new user for our restaurant scenario and this user had already rated a few of the restaurants. Then imagine that from these ratings, we discovered that our new user's neighborhood was comprised of Oliver and Thibault. If we wanted to make a prediction for what rating the test user would make on the restaurant El Pollo Loco, this would be done by averaging the ratings of Oliver and Thibault for this restaurant, which in this case would be the average of 2 and 4, yielding a rating of 3.

If our objective was to obtain a top-N list of recommendations for our user, we would repeat this process for all the items in the database, sort them by descending rating so that the highest rated items appeared first, and then pick out the top N items from this list. In practice, we would only need to check the items that at least one of the users in the new user's neighborhood has rated in order to simplify this computation.

We can make some improvements to this very simple approach. A first possible improvement comes from the observation that some users will tend to consistently rate items more strictly or more leniently than other users, and we would like to smooth out this variation. In practice, we often use Z-score normalization, which takes into account the variance of the ratings. We can also center each rating made by a user by subtracting that user's average rating across all the items. In the rating matrix, this means subtracting the mean of each row from the elements of the row. Let's apply this last transformation to our restaurant rating matrix and see the results:

> centered_rm <- t(apply(ratingMatrix, 1, function(x) x - mean(x)))
> centered_rm
         Berny's La Traviata El Pollo Loco Joey's Pizza
oliver     -4.00       -4.00         -3.00          0.0
thibault   -0.12        3.88         -1.12         -4.1
maria      -3.50       -0.50         -2.50          0.5
pedro      -3.12        0.88          1.88         -3.1
ines       -4.12       -2.12         -3.12         -1.1
gertrude   -3.25        1.75          0.75          2.8
         The Old West Jake and Jill Full Moon Acropolis
oliver           2.00           3.0      4.00      2.00
thibault        -4.12           1.9     -0.12      3.88
maria            3.50           1.5     -2.50      3.50
pedro            0.88          -4.1      2.88      3.88
ines             2.88           3.9      1.88      1.88
gertrude        -1.25          -2.2      0.75      0.75

Even though both Ines and Gertrude originally rated Berny's with the same rating of 1, the centering operation has Ines rating this restaurant with a lower score than Gertrude. This is because Ines tends to make higher ratings on average than Gertrude and so the rating of 1 for Ines could be interpreted as a stronger negative rating than Gertrude's.

Another area of improvement concerns the way in which we incorporate the ratings of our new user's neighborhood to create the final ratings. By treating the ratings of all the neighboring users as equal, we are ignoring the fact that our distance metric may show that certain users in the neighborhood of the new user are more similar to the new user than others.

As we have already seen from the example of Jaccard similarity and Jaccard distance, we can often define a similarity metric from a distance metric by inverting it in some way, such as subtracting from one or taking the reciprocal. Consequently, for the distance metric of our choice, we can define its corresponding similarity metric and denote it with sim(u,t). A user similarity metric takes high values for similar users, which are users for whom a distance metric takes low values.

With this clarification established, we can incorporate the similarity between users u and t in our previous equation by taking a weighted average of the ratings made by the neighboring users of the new user as follows:

User-based collaborative filtering

Other reasons why we might want to incorporate weights in the ratings made by other users include trust. For example, we might trust a user that has been using our restaurant recommendation service for a long time more than a more recent user. Equally, we might also want to consider the total number of items that a user has rated in common with the new user. For example, if a user has only rated two items in common with the new user, then even if the corresponding ratings made are identical, the evidence that these two users are indeed very similar is little.

All in all, the single largest difficulty with user-based collaborative filtering is that making recommendations for a test user requires access to the whole database of users in order to determine the user neighborhood. This is done by performing a similarity computation between the test user and every other user, an expensive process computationally. Next, we'll look at item-based collaborative filtering, which attempts to ameliorate this situation.

Item-based collaborative filtering

Item-based collaborative filtering is a model-based approach to collaborative filtering. The central idea underlying this method is that instead of looking at other users similar to the test user, we will directly recommend items that are similar to the items that have received a high rating by the test user. As we are directly comparing items instead of first comparing users in order to recommend items, we can build up a model to describe the similarity of items and then use the model rather than the entire database to make recommendations.

The process of building an item-based similarity model involves computing a similarity matrix for all pairs of items in our database. If we have N items, then we will end up with a similarity matrix with N2 elements in total. To reduce the size of our model, we can store a list of the similarity values of the top k most similar items for every item in the database.

As k will be far smaller than N, we will have a very substantial reduction in the size of the data that we need to keep for our model. For every item in our database, this list of the k most similar items is analogous to the neighborhood of users for the user-based collaborative filtering approach. The same discussion regarding normalizing with respect to the bias and variance of user ratings in user-based collaborative filtering can be applied here. That is, we can compute item-to-item similarities after we normalize our rating matrix.

This approach is not without its shortcomings. In the memory-based recommender, a new user rating can automatically be incorporated into the recommendation process because that approach uses the entire database (rating matrix). Model-based collaborative filtering requires us to periodically retrain the model to incorporate information from these new ratings. In addition, the fact that the modeling process discards some information from the original rating matrix by retaining only a short list of the most similar items for each item in our database, means that it can sometimes make non-optimal recommendations.

Despite these drawbacks, the space and time performance of item-based collaborative filtering means that it has been very successfully applied in a large number of real-world settings. Model retraining can be done offline and automatically scheduled, and the non-optimality of recommendations can often be tolerated.

We can devise an analogous equation to what we saw for user-based collaborative filtering that explains how to predict a new rating using the item-based collaborative filtering model. Suppose we want to estimate the rating that our test user, t, would give to an item i, Item-based collaborative filtering. Suppose also that we already chose a similarity function, sim(i,j), between a pair of items i and j, and from this we constructed our model. Using the model, we can retrieve the stored item neighborhood for the item in which we are interested, S(i). To compute the predicted rating that our test user will make on this item, we calculate the weighted sum of the ratings our user has made on items that are similar to it:

Item-based collaborative filtering

While this approach won't work if the user hasn't rated any items similar to the item in question, it does not require finding users that have similar preferences to the test user.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.37.254