Understanding the code

The first thing we're going to do is import the u.data file as part of the MovieLens dataset, and that is a tab-delimited file that contains every rating in the dataset.

import pandas as pd 
 
r_cols = ['user_id', 'movie_id', 'rating'] 
ratings = pd.read_csv('e:/sundog-consult/packt/datascience/ml-100k/u.data',  
sep='\t', names=r_cols, usecols=range(3))

Note that you'll need to add the path here to where you stored the downloaded MovieLens files on your computer. So, the way that this works is even though we're calling read_csv on Pandas, we can specify a different separator than a comma. In this case, it's a tab.

We're basically saying take the first three columns in the u.data file, and import it into a new DataFrame, with three columns: user_id, movie_id, and rating.

What we end up with here is a DataFrame that has a row for every user_id, which identifies some person, and then, for every movie they rated, we have the movie_id, which is some numerical shorthand for a given movie, so Star Wars might be movie 53 or something, and their rating, you know, 1 to 5 stars. So, we have here a database, a DataFrame, of every user and every movie they rated, okay?

Now, we want to be able to work with movie titles, so we can interpret these results more intuitively, so we're going to use their human-readable names instead.

If you're using a truly massive dataset, you'd save that to the end because you want to be working with numbers, they're more compact, for as long as possible. For the purpose of example and teaching, though, we'll keep the titles around so you can see what's going on.

m_cols = ['movie_id', 'title'] 
movies = pd.read_csv('e:/sundog-consult/packt/datascience/ml-100k/u.item', 
sep='|', names=m_cols, usecols=range(2))

There's a separate data file with the MovieLens dataset called u.item, and it is pipe-delimited, and the first two columns that we import will be the movie_id and the title of that movie. So, now we have two DataFrames: r_cols has all the user ratings and m_cols has all the titles for every movie_id. We can then use the magical merge function in Pandas to mush it all together.

ratings = pd.merge(movies, ratings) 

Let's add a ratings.head() command and then run those cells. What we end up with is something like the following table. That was pretty quick!

We end up with a new DataFrame that contains the user_id and rating for each movie that a user rated, and we have both the movie_id and the title that we can read and see what it really is. So, the way to read this is user_id number 308 rated the Toy Story (1995) movie 4 stars, user_id number 287 rated the Toy Story (1995) movie 5 stars, and so on and so forth. And, if we were to keep looking at more and more of this DataFrame, we'd see different ratings for different movies as we go through it.

Now the real magic of Pandas comes in. So, what we really want is to look at relationships between movies based on all the users that watched each pair of movies, so we need, at the end, a matrix of every movie, and every user, and all the ratings that every user gave to every movie. The pivot_table command in Pandas can do that for us. It can basically construct a new table from a given DataFrame, pretty much any way that you want it. For this, we can use the following code:

movieRatings = ratings.pivot_table(index=['user_id'],
columns=['title'],values='rating') movieRatings.head()

So, what we're saying with this code is-take our ratings DataFrame and create a new DataFrame called movieRatings and we want the index of it to be the user IDs, so we'll have a row for every user_id, and we're going to have every column be the movie title. So, we're going to have a column for every title that we encounter in that DataFrame, and each cell will contain the rating value, if it exists. So, let's go ahead and run it.

And, we end up with a new DataFrame that looks like the following table:

It's kind of amazing how that just put it all together for us. Now, you'll see some NaN values, which stands for Not a Number, and its just how Pandas indicates a missing value. So, the way to interpret this is, user_id number 1, for example, did not watch the movie 1-900 (1994), but user_id number 1 did watch 101 Dalmatians (1996) and rated it 2 stars. The user_id number 1 also watched 12 Angry Men (1957) and rated it 5 stars, but did not watch the movie 2 Days in the Valley (1996), for example, okay? So, what we end up with here is a sparse matrix basically, that contains every user, and every movie, and at every intersection where a user rated a movie there's a rating value.

So, you can see now, we can very easily extract vectors of every movie that our user watched, and we can also extract vectors of every user that rated a given movie, which is what we want. So, that's useful for both user-based and item-based collaborative filtering, right? If I wanted to find relationships between users, I could look at correlations between these user rows, but if I want to find correlations between movies, for item-based collaborative filtering, I can look at correlations between columns based on the user behavior. So, this is where the real flipping things on its head for user versus item-based similarities comes into play.

Now, we're going with item-based collaborative filtering, so we want to extract columns, to do this let's run the following code:

starWarsRatings = movieRatings['Star Wars (1977)'] 
starWarsRatings.head() 

Now, with the help of that, let's go ahead and extract all the users who rated Star Wars (1977):

And, we can see most people have, in fact, watched and rated Star Wars (1977) and everyone liked it, at least in this little sample that we took from the head of the DataFrame. So, we end up with a resulting set of user IDs and their ratings for Star Wars (1977). The user ID 3 did not rate Star Wars (1977) so we have a NaN value, indicating a missing value there, but that's okay. We want to make sure that we preserve those missing values so we can directly compare columns from different movies. So, how do we do that?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.34.205