Finding the best critic for a user

Now that we have two different ways to compute a similarity distance between users, we can determine the best critics for a particular user and see how similar they are to an individual's preferences.

Getting ready

Make sure that you have completed the previous recipes before tackling this one.

How to do it…

Implement a new method for the MovieLens class, similar_critics(), that locates the best match for a user:

import heapq

 ...

     def similar_critics(self, user, metric='euclidean', n=None):
         """
         Finds, ranks similar critics for the user according to the
         specified distance metric. Returns the top n similar critics
         if n is specified.
         """

         # Metric jump table
         metrics  = {
             'euclidean': self.euclidean_distance,
             'pearson':   self.pearson_correlation,
         }

         distance = metrics.get(metric, None)

         # Handle problems that might occur
         if user not in self.reviews:
             raise KeyError("Unknown user, '%s'." % user)
         if not distance or not callable(distance):
             raise KeyError("Unknown or unprogrammed distance metric '%s'." % metric)

         # Compute user to critic similarities for all critics
         critics = {}
         for critic in self.reviews:
             # Don't compare against yourself!
             if critic == user:
                 continue
             critics[critic] = distance(user, critic)

         if n:
             return heapq.nlargest(n, critics.items(), key=itemgetter(1))
         return critics

How it works…

The similar_critics method, added to the MovieLens class, serves as the heart of this recipe. It takes as parameters the targeted user and two optional parameters: the metric to be used, which defaults to euclidean, and the number of results to be returned, which defaults to None. As you can see, this flexible method uses a jump table to determine what algorithm is to be used (you can pass in euclidean or pearson to choose the distance metric). Every other critic is compared to the current user (except a comparison of the user against themselves). The results are then sorted using the flexible heapq module and the top n results are returned.

To test out our implementation, print out the results of the run for both similarity distances:

>>> for item in model.similar_critics(232, 'euclidean', n=10):
  print "%4i: %0.3f" % item
  688: 1.000
  914: 1.000
   47: 0.500
   78: 0.500
  170: 0.500
  335: 0.500
  341: 0.500
  101: 0.414
  155: 0.414
  309: 0.414

 >>> for item in model.similar_critics(232, 'pearson', n=10):
   print "%4i: %0.3f" % item
   33: 1.000
   36: 1.000
  155: 1.000
  260: 1.000
  289: 1.000
  302: 1.000
  309: 1.000
  317: 1.000
  511: 1.000
  769: 1.000

These scores are clearly very different, and it appears that Pearson thinks that there are much more similar users than the Euclidean distance metric. The Euclidean distance metric tends to favor users who have rated fewer items exactly the same. Pearson correlation favors more scores that fit well linearly, and therefore, Pearson corrects grade inflation where two critics might rate movies very similarly, but one user rates them consistently one star higher than the other.

If you plot out how many shared rankings each critic has, you'll see that the data is very sparse. Here is the preceding data with the number of rankings appended:

Euclidean scores: 
  688: 1.000 (1 shared rankings)
  914: 1.000 (2 shared rankings)
   47: 0.500 (5 shared rankings)
   78: 0.500 (3 shared rankings)
  170: 0.500 (1 shared rankings)
  
 Pearson scores: 
   33: 1.000 (2 shared rankings)
   36: 1.000 (3 shared rankings)
  155: 1.000 (2 shared rankings)
  260: 1.000 (3 shared rankings)
  289: 1.000 (3 shared rankings)

Therefore, it is not enough to find similar critics and use their ratings to predict our users' scores; instead, we will have to aggregate the scores of all of the critics, regardless of similarity, and predict ratings for the movies we haven't rated.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.27.45