Understanding the Jokes recommendation problem and the dataset

Dr. Ken Goldberg and his colleagues, Theresa Roeder, Dhruv Gupta, and Chris Perkins, introduced a dataset to the world through their paper Eigentaste: A Constant Time Collaborative Filtering Algorithm, which is pretty popular in the recommender-systems domain. The dataset is named the Jester's jokes dataset. To create it, a number of users are presented with several jokes and they are asked to rate them. The ratings provided by the users for the various jokes formed the dataset. The data in this dataset is collected between April 1999 and May 2003. The following are the attributes of the dataset:

  • Over 11,000,000 ratings of 150 jokes from 79,681 users
  • Each row is a user (Row 1 = User #1)
  • Each column is a joke (Column 1 = Joke #1)
  • Ratings are given as real values from -10.00 to +10.00; -10 being the lowest possible rating and 10 being the highest
  • 99 corresponds to a null rating

The recommenderlab package in R provides a subset of this original dataset provided by Dr. Ken Goldberg's group. We will make use of this subset for our projects covered in this chapter.

The Jester5k dataset provided in the recommenderlab library contains a 5,000 x 100 rating matrix (5,000 users and 100 jokes) with ratings between -10.00 and +10.00. All selected users have rated 36 or more jokes. The dataset is in the realRatingMatrix format. This is a special matrix format that the recommenderlab expects the data to be in, to apply the various functions that are packaged in the library.

As we are already aware, exploratory data analysis (EDA) is the first step for any data science project. Going by this principle, let's begin by reading the data, and then proceed with the EDA step on the dataset:

# including the required libraries
library(data.table)
library(recommenderlab)
# setting the seed so as to reproduce the results
set.seed(54)
# reading the data to a variable
library(recommenderlab)
data(Jester5k)
str(Jester5k)

This will result in the following output:

Formal class 'realRatingMatrix' [package "recommenderlab"] with 2 slots
..@ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
.. .. ..@ i : int [1:362106] 0 1 2 3 4 5 6 7 8 9 ...
.. .. ..@ p : int [1:101] 0 3314 6962 10300 13442 18440 22513 27512 32512 35685 ...
.. .. ..@ Dim : int [1:2] 5000 100
.. .. ..@ Dimnames:List of 2
.. .. .. ..$ : chr [1:5000] "u2841" "u15547" "u15221" "u15573" ...
.. .. .. ..$ : chr [1:100] "j1" "j2" "j3" "j4" ...
.. .. ..@ x : num [1:362106] 7.91 -3.2 -1.7 -7.38 0.1 0.83 2.91 -2.77 -3.35 -1.99 ...
.. .. ..@ factors : list()
..@ normalize: NULL

The data structure output is pretty self explanatory and we see it provides empirical evidence for the details we have discussed already. Let's continue our EDA further:

# Viewing the first 5 records in the dataset
head(getRatingMatrix(Jester5k),5)

This will result in the following output:

2.5 x 100 sparse Matrix of class "dgCMatrix"
[[ suppressing 100 column names ‘j1’, ‘j2’, ‘j3’ ... ]]
u2841 7.91 9.17 5.34 8.16 -8.74 7.14 8.88 -8.25 5.87 6.21 7.72 6.12 -0.73 7.77 -5.83 -8.88 8.98
u15547 -3.20 -3.50 -9.56 -8.74 -6.36 -3.30 0.78 2.18 -8.40 -8.79 -7.04 -6.02 3.35 -4.61 3.64 -6.41 -4.13
u15221 -1.70 1.21 1.55 2.77 5.58 3.06 2.72 -4.66 4.51 -3.06 2.33 3.93 0.05 2.38 -3.64 -7.72 0.97
u15573 -7.38 -8.93 -3.88 -7.23 -4.90 4.13 2.57 3.83 4.37 3.16 -4.90 -5.78 -5.83 2.52 -5.24 4.51 4.37
u21505 0.10 4.17 4.90 1.55 5.53 1.50 -3.79 1.94 3.59 4.81 -0.68 -0.97 -6.46 -0.34 -2.14 -2.04 -2.57
u2841 -9.32 -9.08 -9.13 7.77 8.59 5.29 8.25 6.02 5.24 7.82 7.96 -8.88 8.25 3.64 -0.73 8.25 5.34 -7.77
u15547 -0.15 -1.84 -1.84 1.84 -1.21 -8.59 -5.19 -2.18 0.19 2.57 -5.78 1.07 -8.79 3.01 2.67 -9.22 -9.32 3.69
u15221 2.04 1.94 4.42 1.17 0.10 -5.10 -3.25 3.35 3.30 -1.70 3.16 -0.29 1.36 3.54 6.17 -2.72 3.11 4.81
u15573 4.95 5.49 -0.49 3.40 -2.14 5.29 -3.11 -4.56 -5.44 -6.89 -0.24 -5.15 -3.59 -8.20 2.18 0.39 -1.21 -2.62
u21505 -0.15 2.43 3.16 1.50 4.37 -0.10 -2.14 3.98 2.38 6.84 -0.68 0.87 3.30 6.21 5.78 -6.21 -0.78 -1.36
## number of ratings
print(nratings(Jester5k))

This will result in the following output:

362106## number of ratings per user

We will print the summary of the dataset using the following command:

print(summary(rowCounts(Jester5k)))

This will result in the following output:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
36.00 53.00 72.00 72.42 100.00 100.00

We will now plot the histogram:

## rating distribution
hist(getRatings(Jester5k), main="Distribution of ratings")

This will result in the following output:

From the output, we see a somewhat normal distribution. It can also be seen that the positive ratings outnumber the negative ratings.

The Jester5K dataset also provides a character vector called JesterJokes. The vector is of length 100. These are the actual 100 jokes among others that were shown to the users to get the ratings. We could examine the jokes with the following command:

head(JesterJokes,5)

This will result in the following output:

j1 "A man visits the doctor. The doctor says "I have bad news for you.You have cancer and Alzheimer's disease". The man replies "Well,thank God I don't have cancer!""
j2 "This couple had an excellent relationship going until one day he came home from work to find his girlfriend packing. He asked her why she was leaving him and she told him that she had heard awful things about him. "What could they possibly have said to make you move out?" "They told me that you were a pedophile." He replied, "That's an awfully big word for a ten year old.""
j3 "Q. What's 200 feet long and has 4 teeth? A. The front row at a Willie Nelson Concert."
j4 "Q. What's the difference between a man and a toilet? A. A toilet doesn't follow you around after you use it."
j5 "Q. What's O. J. Simpson's Internet address? A. Slash, slash, backslash, slash, slash, escape."

Based on the 5,000 user ratings we have, we could perform additional EDA to identify the joke that is rated as best by the users. This can be done through the following code:

## 'best' joke with highest average rating
best <- which.max(colMeans(Jester5k))
cat(JesterJokes[best])

This will result in the following output:

A guy goes into confession and says to the priest, "Father, I'm 80 years old, widower, with 11 grandchildren. Last night I met two beautiful flight attendants. They took me home and I made love to both of them. Twice." The priest said: "Well, my son, when was the last time you were in confession?" "Never Father, I'm Jewish." "So then, why are you telling me?" "I'm telling everybody."

We could perform additional EDA to visualize the univariate and multivariate analysis. This exploration will help us understand each of the variables in detail as well as the relationship between them. While we do not delve deep into each of these aspects, here are some thoughts that can be explored:

  • Exploring the users who always provide high ratings to most jokes
  • Correlation between the ratings provided to jokes
  • Identification of users that are very critical
  • Exploring the most popular jokes or least popular jokes
  • Identifying the jokes with the fewest ratings and identifying the associations between them

 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.110.155