Understanding your data is critical to all data-related work. In this recipe, we acquire and take a first look at the data that we will be using to build our recommendation engine.
To prepare for this recipe, and the rest of the chapter, download the MovieLens data from the GroupLens website of the University of Minnesota. You can find the data at http://grouplens.org/datasets/movielens/.
In this chapter, we will use the smaller MoveLens 100k dataset (4.7 MB in size) in order to load the entire model into the memory with ease.
Perform the following steps to better understand the data that we will be working with throughout this chapter:
ml-100k.zip
).u.data
, which contains the user movie ratings, and u.item
, which contains movie information and details. To get a sense of each file, use the head
command at the command prompt for Mac and Linux or the more
command for Windows:head -n 5 u.item
The preceding command gives you the following output:
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0 2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0 3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0 4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0 5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0
The following command will produce the given output:
head -n 5 u.data
For Windows, you can use the following command:
more u.item 2 n 196 242 3 881250949 186 302 3 891717742 22 377 1 878887116 244 51 2 880606923 166 346 1 886397596
The two main files that we will be using are as follows:
u.data
: This contains the user moving ratingsu.item
: This contains the movie information and other detailsBoth are character-delimited files; u.data
, which is the main file, is tab delimited, and u.item
is pipe delimited.
For u.data
, the first column is the user ID, the second column is the movie ID, the third is the star rating, and the last is the timestamp. The u.item
file contains much more information, including the ID, title, release date, and even a URL to IMDB. Interestingly, this file also has a Boolean array indicating the genre(s) of each movie, including (in order) action, adventure, animation, children, comedy, crime, documentary, drama, fantasy, film-noir, horror, musical, mystery, romance, sci-fi, thriller, war, and western.
Free, web-scale datasets that are appropriate for building recommendation engines are few and far between. As a result, the movie lens dataset is a very popular choice for such a task but there are others as well. The well-known Netflix Prize dataset has been pulled down by Netflix. However, there is a dump of all user-contributed content from the Stack Exchange network (including Stack Overflow) available via the Internet Archive (https://archive.org/details/stackexchange). Additionally, there is a book-crossing dataset that contains over a million ratings of about a quarter million different books (http://www2.informatik.uni-freiburg.de/~cziegler/BX/).
3.21.39.142