Finding the highest-scoring movies

If you're looking for a good movie, you'll often want to see the most popular or best rated movies overall. Initially, we'll take a naïve approach to compute a movie's aggregate rating by averaging the user reviews for each movie. This technique will also demonstrate how to access the data in our MovieLens class.

Getting ready

These recipes are sequential in nature. Thus, you should have completed the previous recipes in the chapter before starting with this one.

How to do it…

Follow these steps to output numeric scores for all movies in the dataset and compute a top-10 list:

  1. Augment the MovieLens class with a new method to get all reviews for a particular movie:
    class MovieLens(object):
    
         ...
    
         def reviews_for_movie(self, movieid):
             """
             Yields the reviews for a given movie
             """
             for review in self.reviews.values():
                 if movieid in review:
                     yield review[movieid]
    
  2. Then, add an additional method to compute the top 10 movies reviewed by users:
    import heapq
    from operator import itemgetter
    
    class MovieLens(object):
    
         ...
    
         def average_reviews(self):
             """
             Averages the star rating for all movies. 
        Yields a tuple of movieid,
             the average rating, and the number of reviews.
             """
             for movieid in self.movies:
                 reviews = list(r['rating'] for r in self.reviews_for_movie(movieid))
                 average = sum(reviews) / float(len(reviews))
                 yield (movieid, average, len(reviews))
    
         def top_rated(self, n=10):
             """
             Yields the n top rated movies
             """
             return heapq.nlargest(n, self.average_reviews(), key=itemgetter(1))
    

    Tip

    Note that the notation just below class MovieLens(object): signifies that we will be appending the average_reviews method to the existing MovieLens class.

  3. Now, let's print the top-rated results:
    for mid, avg, num in model.top_rated(10):
         title = model.movies[mid]['title']
         print "[%0.3f average rating (%i reviews)] %s" % (avg, num,title)
    
  4. Executing the preceding commands in your REPL should produce the following output:
    [5.000 average rating (1 reviews)] Entertaining Angels: The Dorothy Day Story (1996)
     [5.000 average rating (2 reviews)] Santa with Muscles (1996)
     [5.000 average rating (1 reviews)] Great Day in Harlem, A (1994)
     [5.000 average rating (1 reviews)] They Made Me a Criminal (1939)
     [5.000 average rating (1 reviews)] Aiqing wansui (1994)
     [5.000 average rating (1 reviews)] Someone Else's America (1995)
     [5.000 average rating (2 reviews)] Saint of Fort Washington, The (1993)
     [5.000 average rating (3 reviews)] Prefontaine (1997)
     [5.000 average rating (3 reviews)] Star Kid (1997)
     [5.000 average rating (1 reviews)] Marlene Dietrich: Shadow and Light (1996)
    

How it works…

The new reviews_for_movie() method that is added to the MovieLens class iterates through our review dictionary values (which are indexed by the userid parameter), checks whether the movieid value has been reviewed by the user, and then presents that review dictionary. We will need such functionality for the next method.

With the average_review() method, we have created another generator function that goes through all of our movies and all of their reviews and presents the movie ID, the average rating, and the number of reviews. The top_rated function uses the heapq module to quickly sort the reviews based on the average.

The heapq data structure, also known as the priority queue algorithm, is the Python implementation of an abstract data structure with interesting and useful properties. Heaps are binary trees that are built so that every parent node has a value that is either less than or equal to any of its children nodes. Thus, the smallest element is the root of the tree, which can be accessed in constant time, which is a very desirable property. With heapq, Python developers have an efficient means to insert new values in an ordered data structure and also return sorted values.

There's more…

Here, we run into our first problem—some of the top-rated movies only have one review (and conversely, so do the worst-rated movies). How do you compare Casablanca, which has a 4.457 average rating (243 reviews), with Santa with Muscles, which has a 5.000 average rating (2 reviews)? We are sure that those two reviewers really liked Santa with Muscles, but the high rating for Casablanca is probably more meaningful because more people liked it. Most recommenders with star ratings will simply output the average rating along with the number of reviewers, allowing the user to determine their quality; however, as data scientists, we can do better in the next recipe.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.130.201