Chapter 5. Case Study – Building Your Own Recommendation Engine

The previous two chapters showed how you how to build, test, and optimize recommender systems using R. Although the chapters were full of examples, they were based on datasets provided by an R package. The data was structured using redyal and was ready to be processed. However, in real life, the data preparation is an important, time-consuming, and tough step.

Another limitation of the previous examples is that they are based on the ratings only. In most of the situations, there are other data sources such as item descriptions and user profiles. A good solution comes from a combination of all the relevant information.

This chapter shows a practical example in which we will build and optimize a recommender system, starting from raw data. This chapter will cover the following topics:

  • Preparing the data to build a recommendation engine
  • Exploring the data through visualization techniques
  • Choosing and building a recommendation model
  • Optimizing the performance of the recommendation model by setting its parameters

In the end, we will build an engine that generates recommendations.

Preparing the data

Starting from raw data, this section will show you how to prepare the input for the recommendation models.

Description of the data

The data is about Microsoft users visiting a website during one week. For each user, the data displays which areas the users visited. For the sake of simplicity, from now on we will refer to the website areas with the term "items".

There are 5,000 users and they are represented by sequential numbers between 10,001 and 15,000. Items are represented by numbers between 1,000 and 1,297, even if they are less than 298.

The dataset is an unstructured text file. Each record contains a number of fields between 2 and 6. The first field is a letter defining what the record contains. There are three main types of records, which are as follows:

  • Attribute (A): This is the description of the website area
  • Case (C): This is the case for each user, containing its ID
  • Vote (V): This is the vote lines for the case

Each case record is followed by one or more votes, and there is just one case for each user.

Our target is to recommend each user to explore some areas of the website that they haven't explored yet.

Importing the data

This section will show you how to import data. First, let's load the packages that we will use:

library("data.table")
library("ggplot2")
library("recommenderlab")
library("countrycode")

The preceding code is explained in the following points:

  • data.table: This manipulates the data
  • ggplot2: This builds charts
  • recommenderlab: This builds recommendation engines
  • countrycode: This package contains the country names

Then, let's load the table into memory. If the text file is already in our working directory, it's enough to define its name. Otherwise, we need to define its full path:

file_in <- "anonymous-msweb.test.txt"

The rows contain different numbers of columns, which means that the data is unstructured. However, there are at most six columns, so we can load the file into a table using read.csv. The rows with fewer than six fields will have just empty values:

table_in <- read.csv(file_in, header = FALSE)
head(table_in)

V1

V2

V3

V4

V5

V6

I

4

www.microsoft.com

created by getlog.pl

  

T

1

VRoot

0

0

VRoot

N

0

0

   

N

1

1

   

T

2

Hide1

0

0

Hide

N

0

0

   

The first two columns contain the user IDs and their purchases. We can just drop the other columns:

table_users <- table_in[, 1:2]

In order to process the data more easily, we can convert it into a data table, using this command:

table_users <- data.table(table_users)

The columns are as follows:

  • category: This is a letter specifying the content of the column. The columns containing a user or an item ID belong to the categories C and V, respectively.
  • value: This is a number specifying the user or item ID.

We can assign the column names and select the rows containing either users or items:

setnames(table_users, 1:2, c("category", "value"))
table_users <- table_users[category %in% c("C", "V")]
head(table_users)

category

value

C

10001

V

1038

V

1026

V

1034

C

10002

V

1008

The table_users object contains structured data, which is our starting point to define a rating matrix.

Defining a rating matrix

Our target is to define a table having a row for each item and a column for each purchase. For each user, table_users contains its ID and purchases in separate rows. In each block or rows, the first column contains the user ID and the other contains the item IDs.

You can use the following steps to define a rating matrix:

  1. Label the cases.
  2. Define a table in the long format.
  3. Define a table in the wide format.
  4. Define the rating matrix.

In order to reshape the table, the first step is to define a field called chunk_user containing an incremental number for each user. The category == "C" condition is true for the user rows, which are the first rows of the chunks. Using cumsum, we are incrementing the index of 1 whenever there is a row with a new user:

table_users[, chunk_user := cumsum(category == "C")]
head(table_users)

category

value

chunk_user

C

10001

1

V

1038

1

V

1026

1

V

1034

1

C

10002

2

V

1008

2

The next step is to define a table in which rows correspond to the purchases. We need a column with the user ID and a column with the item ID. The new table is called table_long, because it's in a long format:

table_long <- table_users[, list(user = value[1], item = value[-1]), by = "chunk_user"]
head(table_long)

chunk_user

user

item

1

10001

1038

1

10001

1026

1

10001

1034

2

10002

1008

2

10002

1056

2

10002

1032

Now, we can define a table having a row for each user and a column for each item. The values are equal to 1 if the item has been purchased, and 0 otherwise. We can build the table using the reshape function. Its inputs are as follows:

  • data: This is the table in the long format.
  • direction: This shows whether we are reshaping from long to wide or otherwise.
  • idvar: This is the variable identifying the group, which, in this case, is the user.
  • timevar: This is the variable identifying the record within the same group. In this case, it's the item.
  • v.names: This is name of the values. In this case, it's the rating that is always equal to one. Missing user-item combinations will be NA values.

After defining the column value equal to 1, we can build table_wide using reshape:

table_long[, value := 1]
table_wide <- reshape(data = table_long,direction = "wide",idvar = "user",timevar = "item",v.names = "value")
head(table_wide[, 1:5, with = FALSE])

chunk_user

user

value.1038

value.1026

value.1034

1

10001

1

1

1

2

10002

NA

NA

NA

3

10003

1

1

NA

4

10004

NA

NA

NA

5

10005

1

1

1

6

10006

NA

NA

1

In order to build the rating matrix, we need to keep only the columns containing ratings. In addition, the user name will be the matrix row names, so we need to store them in the vector_users vector:

vector_users <- table_wide[, user]
table_wide[, user := NULL]
table_wide[, chunk_user := NULL]

In order to have the column names equal to the item names, we need from the value prefix. For this purpose, we can use the substring function:

setnames(x = table_wide,old = names(table_wide),new = substring(names(table_wide), 7))

We need to store the rating matrix within a recommenderlab object. For this purpose, we need to convert table_wide in a matrix first. In addition, we need to set the row names equal to the user names:

matrix_wide <- as.matrix(table_wide)rownames(matrix_wide) <- vector_users
head(matrix_wide[, 1:6])

user

1038

1026

1034

1008

1056

1032

10001

1

1

1

NA

NA

NA

10002

NA

NA

NA

1

1

1

10003

1

1

NA

NA

NA

NA

10004

NA

NA

NA

NA

NA

NA

10005

1

1

1

1

NA

NA

10006

NA

NA

1

NA

NA

NA

The last step is coercing matrix_wide into a binary rating matrix using as, in the following way:

matrix_wide[is.na(matrix_wide)] <- 0
ratings_matrix <- as(matrix_wide, "binaryRatingMatrix")
ratings_matrix
## 5000 x 236 rating matrix of class binaryRatingMatrix with 15191 ratings.

Let's take a look at the matrix using image:

image(ratings_matrix[1:50, 1:50], main = "Binary rating matrix")

The following image shows the binary rating matrix:

Defining a rating matrix

As expected, the matrix is sparse. We can also visualize the distributions of the number of users purchasing an item:

n_users <- colCounts(ratings_matrix)
qplot(n_users) + stat_bin(binwidth = 100) + ggtitle("Distribution of the number of users")

The following image displays the distribution of the number of users:

Defining a rating matrix

There are some outliers, that is, items purchased by many users. Let's visualize the distribution excluding them:

qplot(n_users[n_users < 100]) + stat_bin(binwidth = 10) + ggtitle("Distribution of the number of users")

The following image displays the distribution of the numbers of users:

Defining a rating matrix

There are many items that have been purchased by a few users only, and we won't recommend them. Since they increase the computational time, we can just remove them by defining a minimum number of purchases, for example, 5:

ratings_matrix <- ratings_matrix[, colCounts(ratings_matrix) >= 5]
ratings_matrix
## 5000 x 166 rating matrix of class 'binaryRatingMatrix' with 15043 ratings.

Now, we have 166 items, compared to the initial 236. As regards users, we want to recommend items to everyone. However, there might be users that have purchased only items that we removed. Let's check it:

sum(rowCounts(ratings_matrix) == 0)
## _15_

There are 15 users with no purchases. These purchases should be removed. In addition, users who have purchased just a few items are difficult to deal with. Therefore, we only keep users that have purchased at least five items:

ratings_matrix <- ratings_matrix[rowCounts(ratings_matrix) >= 5, ]
ratings_matrix
## 959 x 166 rating matrix of class 'binaryRatingMatrix' with 6816 ratings

Extracting item attributes

The table_in raw data contains some records starting with A, and they display some information about the items. In order to extract these records, we can convert table_in into a data table and extract the rows having A in the first column:

table_in <- data.table(table_in)
table_items <- table_in[V1 == "A"]
head(table_items)

V1

V2

V3

V4

V5

A

1277

1

NetShow for PowerPoint

/stream

A

1253

1

MS Word Development

/worddev

A

1109

1

TechNet (World Wide Web Edition)

/technet

A

1038

1

SiteBuilder Network Membership

/sbnmember

A

1205

1

Hardware Supprt

/hardwaresupport

A

1076

1

NT Workstation Support

/ntwkssupport

The relevant columns are:

  • V2: Item ID
  • V4: Item description
  • V5: Web page URL

In order to have a more clear table, we can extract and rename them. In addition, we can sort the table by item ID:

table_items <- table_items[, c(2, 4, 5), with = FALSE]
setnames(table_items, 1:3, c("id", "description", "url"))
table_items <- table_items[order(id)]
head(table_items)

id

description

url

1000

regwiz

/regwiz

1001

Support desktop

/support

1002

End user produced view

/athome

1003

Knowledge base

/kb

1004

Microsoft.com search

/search

1005

Norway

/norge

We need to identify one or more features describing the items. If we look at the table, we can identify two categories of web pages:

  • Microsoft product
  • Geographic location

We can identify the records containing a geographic location, and consider the remaining as products. For this purpose, we can start defining the field category that, at the moment, is equal to product for all the records:

table_items[, category := "product"]

The country code package provides us with the countrycode_data object that contains most of the country names. We can define the name_countries vector that contains the names of countries and geographic locations. Then, we can categorize as region all the records whose description is in name_countries:

name_countries <- c(countrycode_data$country.name, "Taiwan", "UK", "Russia", "Venezuela", "Slovenija", "Caribbean", "Netherlands (Holland)", "Europe", "Central America", "MS North Africa")
table_items[description %in% name_countries, category := "region"]

There are other records containing the word region. We can identify them through a regular expression using grepl:

table_items[grepl("Region", description), category := "region"]
head(table_items)

V2

description

url

category

1000

regwiz

/regwiz

product

1001

Support Desktop

/support

product

1002

End User Produced View

/athome

product

1003

Knowledge Base

/kb

product

1004

Microsoft.com Search

/search

product

1005

Norway

/norge

region

Let's take a look at the result and find out the number of items we have for each category:

table_items[, list(n_items = .N), by = category]

category

n_items

product

248

region

46

About 80 percent of the web pages are products, and the remaining 20 percent are regions.

We are now ready to build recommendation models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.230.81