Let's figure out what went wrong with our movie similarities there. We went through all this exciting work to compute correlation scores between movies based on their user ratings vectors, and the results we got kind of sucked. So, just to remind you, we looked for movies that are similar to Star Wars using that technique, and we ended up with a bunch of weird recommendations at the top that had a perfect correlation.
And, most of them were very obscure movies. So, what do you think might be going on there? Well, one thing that might make sense is, let's say we have a lot of people watch Star Wars and some other obscure film. We'd end up with a good correlation between these two movies because they're tied together by Star Wars, but at the end of the day, do we really want to base our recommendations on the behavior of one or two people that watch some obscure movie?
Probably not! I mean yes, the two people in the world, or whatever it is, that watch the movie Full Speed, and both liked it in addition to Star Wars, maybe that is a good recommendation for them, but it's probably not a good recommendation for the rest of the world. We need to have some sort of confidence level in our similarities by enforcing a minimum boundary of how many people watched a given movie. We can't make a judgment that a given movie is good just based on the behavior of one or two people.
So, let's try to put that insight into action using the following code:
import numpy as np movieStats = ratings.groupby('title').agg({'rating': [np.size, np.mean]}) movieStats.head()
What we're going to do is try to identify the movies that weren't actually rated by many people and we'll just throw them out and see what we get. So, to do that we're going to take our original ratings DataFrame and we're going to say groupby('title'), again Pandas has all sorts of magic in it. And, this will basically construct a new DataFrame that aggregates together all the rows for a given title into one row.
We can say that we want to aggregate specifically on the rating, and we want to show both the size, the number of ratings for each movie, and the mean average score, the mean rating for that movie. So, when we do that, we end up with something like the following:
This is telling us, for example, for the movie 101 Dalmatians (1996), 109 people rated that movie and their average rating was 2.9 stars, so not that great of a score really! So, if we just eyeball this data, we can say okay well, movies that I consider obscure, like 187 (1997), had 41 ratings, but 101 Dalmatians (1996), I've heard of that, you know 12 Angry Men (1957), I've heard of that. It seems like there's sort of a natural cutoff value at around 100 ratings, where maybe that's the magic value where things start to make sense.
Let's go ahead and get rid of movies rated by fewer than 100 people, and yes, you know I'm kind of doing this intuitively at this point. As we'll talk about later, there are more principled ways of doing this, where you could actually experiment and do train/test experiments on different threshold values, to find the one that actually performs the best. But initially, let's just use our common sense and filter out movies that were rated by fewer than 100 people. Again, Pandas makes that really easy to do. Let's figure it out with the following example:
popularMovies = movieStats['rating']['size'] >= 100 movieStats[popularMovies].sort_values([('rating', 'mean')], ascending=False)[:15]
We can just say popularMovies, a new DataFrame, is going to be constructed by looking at movieStats and we're going to only take rows where the rating size is greater than or equal to 100, and I'm then going to sort that by mean rating, just for fun, to see the top rated, widely watched movies.
What we have here is a list of movies that were rated by more than 100 people, sorted by their average rating score, and this in itself is a recommender system. These are highly-rated popular movies. A Close Shave (1995), apparently, was a really good movie and a lot of people watched it and they really liked it.
So again, this is a very old dataset, from the late 90s, so even though you might not be familiar with the film A Close Shave (1995), it might be worth going back and rediscovering it; add it to your Netflix! Schindler's List (1993) not a big surprise there, that comes up on the top of most top movies lists. The Wrong Trousers (1993), another example of an obscure film that apparently was really good and was also pretty popular. So, some interesting discoveries there already, just by doing that.
Things look a little bit better now, so let's go ahead and basically make our new DataFrame of Star Wars recommendations, movies similar to Star Wars, where we only base it on movies that appear in this new DataFrame. So, we're going to use the join operation, to go ahead and join our original similarMovies DataFrame to this new DataFrame of only movies that have greater than 100 ratings, okay?
df = movieStats[popularMovies].join(pd.DataFrame(similarMovies, columns=['similarity'])) df.head()
In this code, we create a new DataFrame based on similarMovies where we extract the similarity column, join that with our movieStats DataFrame, which is our popularMovies DataFrame, and we look at the combined results. And, there we go with that output!
Now we have, restricted only to movies that are rated by more than 100 people, the similarity score to Star Wars. So, now all we need to do is sort that using the following code:
df.sort_values(['similarity'], ascending=False)[:15]
Here, we're reverse sorting it and we're just going to take a look at the first 15 results. If you run that now, you should see the following:
This is starting to look a little bit better! So, Star Wars (1977) comes out on top because it's similar to itself, The Empire Strikes Back (1980) is number 2, Return of the Jedi (1983) is number 3, Raiders of the Lost Ark (1981), number 4. You know, it's still not perfect, but these make a lot more sense, right? So, you would expect the three Star Wars films from the original trilogy to be similar to each other, this data goes back to before the next three films, and Raiders of the Lost Ark (1981) is also a very similar movie to Star Wars in style, and it comes out as number 4. So, I'm starting to feel a little bit better about these results. There's still room for improvement, but hey! We got some results that make sense, whoo-hoo!
Now, ideally, we'd also filter out Star Wars, you don't want to be looking at similarities to the movie itself that you started from, but we'll worry about that later! So, if you want to play with this a little bit more, like I said 100 was sort of an arbitrary cutoff for the minimum number of ratings. If you do want to experiment with different cutoff values, I encourage you to go back and do so. See what that does to the results. You know, you can see in the preceding table that the results that we really like actually had much more than 100 ratings in common. So, we end up with Austin Powers: International Man of Mystery (1997) coming in there pretty high with only 130 ratings so maybe 100 isn't high enough! Pinocchio (1940) snuck in at 101, not very similar to Star Wars, so, you might want to consider an even higher threshold there and see what it does.
Now let's move on and actually do full-blown item-based collaborative filtering where we recommend movies to people using a more complete system, we'll do that next.