So, what we do with this data? Well, what we want to do is recommend movies for people. The way we do that is we look at all the ratings for a given person, find movies similar to the stuff that they rated, and those are candidates for recommendations to that person.
Let's start by creating a fake person to create recommendations for. I've actually already added a fake user by hand, ID number 0, to the MovieLens dataset that we're processing. You can see that user with the following code:
myRatings = userRatings.loc[0].dropna() myRatings
This gives the following output:
That kind of represents someone like me, who loved Star Wars and The Empire Strikes Back, but hated the movie Gone with the Wind. So, this represents someone who really loves Star Wars, but does not like old style, romantic dramas, okay? So, I gave a rating of 5 star to The Empire Strikes Back (1980) and Star Wars (1977), and a rating of 1 star to Gone with the Wind (1939). So, I'm going to try to find recommendations for this fictitious user.
So, how do I do that? Well, let's start by creating a series called simCandidates and I'm going to go through every movie that I rated.
simCandidates = pd.Series() for i in range(0, len(myRatings.index)): print "Adding sims for " + myRatings.index[i] + "..." # Retrieve similar movies to this one that I rated sims = corrMatrix[myRatings.index[i]].dropna() # Now scale its similarity by how well I rated this movie sims = sims.map(lambda x: x * myRatings[i]) # Add the score to the list of similarity candidates simCandidates = simCandidates.append(sims) #Glance at our results so far: print "sorting..." simCandidates.sort_values(inplace = True, ascending = False) print simCandidates.head(10)
For i in range 0 through the number of ratings that I have in myRatings, I am going to add up similar movies to the ones that I rated. So, I'm going to take that corrMatrix DataFrame, that magical one that has all of the movie similarities, and I am going to create a correlation matrix with myRatings, drop any missing values, and then I am going to scale that resulting correlation score by how well I rated that movie.
So, the idea here is I'm going to go through all the similarities for The Empire Strikes Back, for example, and I will scale it all by 5, because I really liked The Empire Strikes Back. But, when I go through and get the similarities for Gone with the Wind, I'm only going to scale those by 1, because I did not like Gone with the Wind. So, this will give more strength to movies that are similar to movies that I liked, and less strength to movies that are similar to movies that I did not like, okay?
So, I just go through and build up this list of similarity candidates, recommendation candidates if you will, sort the results and print them out. Let's see what we get:
Hey, those don't look too bad, right? So, obviously The Empire Strikes Back (1980) and Star Wars (1977) come out on top, because I like those movies explicitly, I already watched them and rated them. But, bubbling up to the top of the list is Return of the Jedi (1983), which we would expect and Raiders of the Lost Ark (1981).
Let's start to refine these results a little bit more. We're seeing that we're getting duplicate values back. If we have a movie that was similar to more than one movie that I rated, it will come back more than once in the results, so we want to combine those together. If I do in fact have the same movie, maybe that should get added up together into a combined, stronger recommendation score. Return of the Jedi, for example, was similar to both Star Wars and The Empire Strikes Back. How would we do that?