Loading and pre-processing the data

Our first goal in building our recommender systems is to load the data in R, preprocess it, and convert it into a rating matrix. More precisely, in each case, we will be creating a realRatingMatrix object, which is the specific data structure that the recommenderlab package uses to store numerical ratings. We will start with the jester datasets. If we download and unzip the archive from the website, we'll see that the file jesterfinal151cols.csv contains the ratings. More specifically, each row in this file corresponds to the ratings made by a particular user, and each column corresponds to a particular joke.

The columns are comma-separated and there is no header row. In fact, the format is almost already a rating matrix, were it not for the fact that the first column is a special column and contains the total number of ratings made by a particular user. We will load this data into a data table using the function fread(), which is a fast implementation of read.table() and efficiently loads a data file into a data table. We'll then drop the first column efficiently using the data.table syntax:

> library(data.table)
> jester <- fread("jesterfinal151cols.csv", sep = ",", header = F)
> jester[, V1 := NULL]

The last line used the assignment operator := to set the first column, V1, to NULL, which is how we drop a column on a data table. We now have one final preprocessing step left to do on our data table, jester, before we are ready to convert it to a realRatingMatrix object. Specifically, we will convert this into a matrix and replace all occurrences of the rating of 99 with NA, as 99 was the special rating used to represent missing values:

>jester_m<- as.matrix(jester) 
>jester_m<- ifelse(jester_m == 99, NA, jester_m)
> library(recommenderlab)
>jester_rrm<- as(jester_m, "realRatingMatrix")

Depending on the computational resources of the computer available to us (most notably, the available memory), we may want to try to process a single datasets in its entirety instead of loading both datasets at once. Here, we have chosen to work with the two datasets in parallel in order to showcase the main steps in the analysis and highlight any differences or particularities of an individual datasets with respect to a particular step.

Let's move on to the MovieLens data. Downloading the MovieLens1M archive and unzipping reveals three main data files. The users.dat file contains background information about the users, such as age and gender. The movies.dat data file, in turn, contains information about the movies being rated, namely the title and a list of genres (for example, comedy) to which the movie belongs.

We are mainly interested in the ratings.dat file, which contains the ratings themselves. Unlike the raw jester data, here each line corresponds to a single rating made by a user. The line format contains the User ID, Movie ID, rating, and timestamp, all separated by two colon characters, ::. Unfortunately, fread() requires a separator with a single character, so we will specify a single colon. The double-colon separator in the raw data results in us creating extra columns with NA values that we will have to remove, as well as the final column that contains the timestamp:

> movies <- fread("ratings.dat", sep = ":", header = F)
> movies[, c("V2", "V4", "V6", "V7") := NULL]
> head(movies)
V1V3V5
1:  1 1193  5
2:  1  661  3
3:  1  914  3
4:  1 3408  4
5:  1 2355  5
6:  1 1197  3

As we can see, we are now left with three columns, where the first is the UserID, the second is the MovieID, and the last is the rating. We will now aggregate all the ratings made by a user in order to form an object that can be interpreted as or converted to, a rating matrix. We should aggregate the data in a way that minimizes memory usage. We will do this by building a sparse matrix using the sparseMatrix() command from the Matrix package.

This package is loaded automatically when we use the recommenderlab package, as, it is one of its dependencies. To build a sparse matrix using this function, we can simply specify a vector of row coordinates, a vector of matching column coordinates, and a vector with the nonzero values that fill up the sparse matrix. Remember, as our matrix is sparse, all we need are the locations and values for entries that are nonzero.

Right now, it is slightly inconvenient that we cannot directly interpret the User IDs and Movie IDs as coordinates. This is because, if we have a user with a User ID value of 1 and a user with a User ID value of 3, R will automatically create a user with a User ID value of 2 and create an empty row, even though that user does not actually exist in the training data. The situation is similar for columns. Consequently, we must first make factors out of our UserID and MovieID columns before proceeding to create our rating matrix as described earlier. Here is the code for building our rating matrix for the MovieLens data:

>userid_factor<- as.factor(movies[, V1])
>movieid_factor<- as.factor(movies[, V3])
>movies_sm<- sparseMatrix(i = as.numeric(userid_factor), j = 
as.numeric(movieid_factor), x = as.numeric(movies[,V5]))
>movies_rrm<- new("realRatingMatrix", data = movies_sm)
>colnames(movies_rrm) <- levels(movieid_factor)
>rownames(movies_rrm) <- levels(userid_factor)
> dim(movies_rrm)
[1] 6040 3706

It is a good exercise to check that the dimensions of the result correspond to our expectations on the number of users and movies respectively.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.161.225