So, what's collaborative filtering?

Collaborative filtering is based on the idea that, somewhere out there in the world, you have a taste doppelganger—someone who shares the same notions about how good Star Wars is and how awful Love Actually is.

The idea is that you've rated some set of items in a way that's very similar to the way this other person, this doppelganger, has rated them, but then each of you has rated additional items that the other hasn't. Because you've established that your tastes are similar, recommendations can be generated from the items your doppelganger has rated highly but which you haven't rated and vice versa. It's in a way much like digital matchmaking, but with the outcome being songs or products you would like, rather than actual people.

So, in the case of our pregnant high schooler, when she bought the right combination of unscented lotions, cotton balls, and vitamin supplements, she likely found herself paired up with people who went on to buy cribs and diapers at some point later.

Let's go through an example to see how this works in practice.

We'll start with what's called a utility matrix. This is similar to a term-document matrix but, instead of terms and documents, we'll be representing products and users.

Here we'll assume that we have customers A-D and a set of products that they've rated on a scale from 0 to 5:

Customer	Snarky's Potato Chips	SoSo Smooth Lotion	Duffly Beer	BetterTap Water	XXLargeLivin' Football Jersey	Snowy Cotton Balls	Disposos' Diapers
A	4		5	3	5
B		4		4		5
C	2		2		1
D		5		3		5	4

We've seen previously that, when we want to find similar items, we could use cosine similarity. Let's try that here. We'll find the user most like user A. Because we have a sparse vector containing many unrated items, we'll have to input something for those missing values. We'll just go with 0 here. We'll start by comparing user A to user B:

from sklearn.metrics.pairwise import cosine_similarity 
cosine_similarity(np.array([4,0,5,3,5,0,0]).reshape(1,-1), 
                  np.array([0,4,0,4,0,5,0]).reshape(1,-1))

The previous code results in the following output:

As you can see, the two don't have a high similarity rating, which makes sense as they have no ratings in common.

Let's now look at user C compared to user A:

cosine_similarity(np.array([4,0,5,3,5,0,0]).reshape(1,-1), 
                  np.array([2,0,2,0,1,0,0]).reshape(1,-1))

The previous code results in the following output:

Here, we see that they have a high similarity rating (remember 1 is perfect similarity), despite the fact they rated the same products very differently. Why are we getting these results? The problem lies with our choice of using 0 for the unrated products. It's registering as strong (negative) agreement on those unrated products. 0 isn't neutral in this case.

So, how can we fix this?

What we can do instead of just using 0 for the missing values is to re-center each user's ratings so that the mean rating is 0, or neutral. We do this by taking each user rating and subtracting the mean for all ratings of that user. For example, for user A, the mean is 17/4, or 4.25. We then subtract that from every individual rating that user A provided.

Once that's been done, we then continue on to find the mean for every other user and subtract it from each of their ratings until every user has been processed.

This procedure will result in a table like the following. You will notice each user row sums to 0 (ignore the rounding issues here):

Customers	Snarky's Potato Chips	SoSo Smooth Lotion	Duffly Beer	BetterTap Water	XXLargeLivin' Football Jersey	Snowy Cotton Balls	Disposos' Diapers
A	-.25		.75	-1.25	.75
B		-.33		-.33		.66
C	.33		.33		-.66
D		.75		-1.25		.75	-.25

Let's now try our cosine similarity on our newly centered data. We'll do user A compared to user B and C again.

First, let's compare user A to user B:

cosine_similarity(np.array([-.25,0,.75,-1.25,.75,0,0]) 
                  .reshape(1,-1), 
                  np.array([0,-.33,0,-.33,0,.66,0]) 
                  .reshape(1,-1))

The preceding code results in the following output:

Now let's try between users A and C:

cosine_similarity(np.array([-.25,0,.75,-1.25,.75,0,0]) 
                  .reshape(1,-1), 
                  np.array([.33,0,.33,0,-.66,0,0]) 
                  .reshape(1,-1))

The preceding code results in the following output:

What we can see is that the similarity between A and B increased slightly, while the similarity between A and C decreased dramatically. This is exactly as we would hope.

This centering process, besides helping us deal with missing values, also has the side benefit of helping us to deal with difficult or easy raters since now everyone is centered around a mean of 0. This formula, you may notice, is equivalent to the Pearson correlation coefficient and, just like with that coefficient, the values fall between -1 and 1.

Table of Contents for So, what's collaborative filtering?

Create new playlist

Sign In

Sign Up

Table of Contents for
So, what's collaborative filtering?