Computing the correlation between users

In the previous recipe, we used one out of many possible distance measures to capture the distance between the movie reviews of users. This distance between two specific users is not changed even if there are five or five million other users.

In this recipe, we will compute the correlation between users in the preference space. Like distance metrics, there are many correlation metrics. The most popular of these are Pearson or Spearman correlations or Cosine distance. Unlike distance metrics, the correlation will change depending on the number of users and movies.

Getting ready

We will be continuing the efforts of the previous recipes again, so make sure you understand each one.

How to do it…

The following function implements the computation of the pearson_correlation function for two critics, which are criticA and criticB, and is added to the MovieLens class:

     def pearson_correlation(self, criticA, criticB):
         """
         Returns the Pearson Correlation of two critics, A and B by
         performing the PPMC calculation on the scatter plot of (a, b)
         ratings on the shared set of critiqued titles.
         """

         # Get the set of mutually rated items
         preferences = self.shared_preferences(criticA, criticB)

         # Store the length to save traversals of the len computation.
         # If they have no rankings in common, return 0.
         length = len(preferences)
         if length == 0: return 0

         # Loop through the preferences of each critic once and compute the
         # various summations that are required for our final calculation.
         sumA = sumB = sumSquareA = sumSquareB = sumProducts = 0
         for a, b in preferences.values():
             sumA += a
             sumB += b
             sumSquareA  += pow(a, 2)
             sumSquareB  += pow(b, 2)
             sumProducts += a*b

         # Calculate Pearson Score
         numerator   = (sumProducts*length) - (sumA*sumB)
         denominator = sqrt(((sumSquareA*length) - pow(sumA, 2)) * ((sumSquareB*length) - pow(sumB, 2)))

         # Prevent division by zero.
         if denominator == 0: return 0

         return abs(numerator / denominator)

How it works…

The Pearson correlation computes the "product moment", which is the mean of the product of mean adjusted random variables and is defined as the covariance of two variables (a and b, in our case) divided by the product of the standard deviation of a and the standard deviation of b. As a formula, this looks like the following:

How it works…

For a finite sample, which is what we have, the detailed formula, which was implemented in the preceding function, is as follows:

How it works…

Another way to think about the Pearson correlation is as a measure of the linear dependence between two variables. It returns a score of -1 to 1, where negative scores closer to -1 indicate a stronger negative correlation, and positive scores closer to 1 indicate a stronger, positive correlation. A score of 0 means that the two variables are not correlated.

In order for us to perform comparisons, we want to normalize our similarity metrics in the space of [0, 1] so that 0 means less similar and 1 means more similar, so we return the absolute value:

 >>> print model.pearson_correlation(232, 532)
0.06025793538385047

There's more…

We have explored two distance metrics: the Euclidean distance and the Pearson correlation. There are many more, including the Spearman correlation, Tantimoto scores, Jaccard distance, Cosine similarity, and Manhattan distance, to name a few. Choosing the right distance metric for the dataset of your recommender along with the type of preference expression used is crucial to ensuring success in this style of recommender. It's up to the reader to explore this space further based on his or her interest and particular dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.130.201