Improving the movie-rating system

We don't want to build a recommendation engine with a system that considers the likely straight-to-DVD Santa with Muscles as generally superior to Casablanca. Thus, the naïve scoring approach used previously must be improved upon and is the focus of this recipe.

Getting ready

Make sure that you have completed the previous recipes in this chapter first.

How to do it…

The following steps implement and test a new movie-scoring algorithm:

  1. Let's implement a new Bayesian movie-scoring algorithm as shown in the following function, adding it to the MovieLens class:
    def bayesian_average(self, c=59, m=3):
         """
         Reports the Bayesian average with parameters c and m.
         """
         for movieid in self.movies:
             reviews = list(r['rating'] for r in self.reviews_for_movie(movieid))
             average = ((c * m) + sum(reviews)) / 
                    float(c + len(reviews))
             yield (movieid, average, len(reviews))
    
  2. Next, we will replace the top_rated method in the MovieLens class with the version in the following commands that uses the new Bayesian_average method from the preceding step:
    def top_rated(self, n=10):
         """
         Yields the n top rated movies
         """
         return heapq.nlargest(n, self.bayesian_average(), key=itemgetter(1))
    
  3. Printing our new top-10 list looks a bit more familiar to us and Casablanca is now happily rated number 4:
     [4.234 average rating (583 reviews)] Star Wars (1977)
     [4.224 average rating (298 reviews)] Schindler's List (1993)
     [4.196 average rating (283 reviews)] Shawshank Redemption, The (1994)
     [4.172 average rating (243 reviews)] Casablanca (1942)
     [4.135 average rating (267 reviews)] Usual Suspects, The (1995)
     [4.123 average rating (413 reviews)] Godfather, The (1972)
     [4.120 average rating (390 reviews)] Silence of the Lambs, The (1991)
     [4.098 average rating (420 reviews)] Raiders of the Lost Ark (1981)
     [4.082 average rating (209 reviews)] Rear Window (1954)
     [4.066 average rating (350 reviews)] Titanic (1997)
    

How it works…

Taking the average of movie reviews, as in shown the previous recipe, simply did not work because some movies did not have enough ratings to give a meaningful comparison to movies with more ratings. What we'd really like is to have every single movie critic rate every single movie. Given that this is impossible, we could derive an estimate for how the movie would be rated if an infinite number of people rated the movie; this is hard to infer from one data point, so we should say that we would like to estimate the movie rating if the same number of people gave it a rating on an average (for example, filtering our results based on the number of reviews).

This estimate can be computed with a Bayesian average, implemented in the bayesian_average() function, to infer these ratings based on the following equation:

How it works…

Here, m is our prior for the average of stars, and C is a confidence parameter that is equivalent to the number of observations in our posterior.

Determining priors can be a complicated and magical art. Rather than taking the complex path of fitting a Dirichlet distribution to our data, we can simply choose an m prior of 3 with our 5-star rating system, which means that our prior assumes that star ratings tend to be reviewed around the median value. In choosing C, you are expressing how many reviews are needed to get away from the prior; we can compute this by looking at the average number of reviews per movie:

print float(sum(num for mid, avg, num in model.average_reviews())) / len(model.movies)

This gives us an average number of 59.4, which we use as the default value in our function definition.

There's more…

Play around with the C parameter. You should find that if you change the parameter so that C = 50, the top-10 list subtly shifts; in this case, Schindler's List and Star Wars are swapped in rankings, as are Raiders of the Lost Ark and Rear Window— note that both the swapped movies have far more reviews than the former, which means that the higher C parameter was balancing the fewer ratings of the other movie.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.33.157