Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Understanding the data

Understanding your data is critical to all data-related work. In this recipe, we acquire and take a first look at the data that we will be using to build our recommendation engine.

Getting ready

To prepare for this recipe, and the rest of the chapter, download the MovieLens data from the GroupLens website of the University of Minnesota. You can find the data at http://grouplens.org/datasets/movielens/.

In this chapter, we will use the smaller MoveLens 100k dataset (4.7 MB in size) in order to load the entire model into the memory with ease.

How to do it…

Perform the following steps to better understand the data that we will be working with throughout this chapter:

Download the data from http://grouplens.org/datasets/movielens/. The 100K dataset is the one that you want (ml-100k.zip).
Unzip the downloaded data into the directory of your choice.

The two files that we are mainly concerned with are u.data, which contains the user movie ratings, and u.item, which contains movie information and details. To get a sense of each file, use the head command at the command prompt for Mac and Linux or the more command for Windows:

head -n 5 u.item

Tip

Note that if you are working on a computer running the Microsoft Windows operating system and not using a virtual machine (not recommended), you do not have access to the head command; instead, use the following command:

more u.item 2 n

The preceding command gives you the following output:

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0

The following command will produce the given output:

head -n 5 u.data

For Windows, you can use the following command:

more u.item 2 n
196  242  3  881250949
186  302  3  891717742
22  377  1  878887116
244  51  2  880606923
166  346  1  886397596

How it works…

The two main files that we will be using are as follows:

u.data: This contains the user moving ratings
u.item: This contains the movie information and other details

Both are character-delimited files; u.data, which is the main file, is tab delimited, and u.item is pipe delimited.

For u.data, the first column is the user ID, the second column is the movie ID, the third is the star rating, and the last is the timestamp. The u.item file contains much more information, including the ID, title, release date, and even a URL to IMDB. Interestingly, this file also has a Boolean array indicating the genre(s) of each movie, including (in order) action, adventure, animation, children, comedy, crime, documentary, drama, fantasy, film-noir, horror, musical, mystery, romance, sci-fi, thriller, war, and western.

There's more…

Free, web-scale datasets that are appropriate for building recommendation engines are few and far between. As a result, the movie lens dataset is a very popular choice for such a task but there are others as well. The well-known Netflix Prize dataset has been pulled down by Netflix. However, there is a dump of all user-contributed content from the Stack Exchange network (including Stack Overflow) available via the Internet Archive (https://archive.org/details/stackexchange). Additionally, there is a book-crossing dataset that contains over a million ratings of about a quarter million different books (http://www2.informatik.uni-freiburg.de/~cziegler/BX/).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Understanding the data

Create new playlist

Sign In

Sign Up

Understanding the data

Getting ready

How to do it…

Tip

How it works…

There's more…

Table of Contents for
Understanding the data