Chapter 5. Recommendation Systems

Recommendation systems find their natural application whenever a user is exposed to a wide choice of products or services that they cannot evaluate in a reasonable timeframe. These engines are an important part of an e-commerce business because they assist the clients on the web to facilitate the task of deciding the appropriate items to buy or choose over a large number of candidates not relevant to the end user. Typical examples are Amazon, Netflix, eBay, and Google Play stores that suggest each user the items they may like to buy using the historical data they have collected. Different techniques have been developed in the past 20 years and we will focus on the most important (and employed) methods used in the industry to date, specifying the advantages and disadvantages that characterize each of these methods. The recommendation systems are classified in Content-based Filtering (CBF) and Collaborative Filtering (CF) techniques and other different approaches (association rules, the log-likelihood method, and hybrid methods) will be discussed together with different ways to evaluate their accuracy. The methods will be tested on the MovieLens database (from http://grouplens.org/datasets/movielens/) consisting of 100,000 movie ratings (1 to 5 values) from 943 users on 1,682 movies. Each user has at least 20 ratings and each movie has a list of genres that it belongs to. All the codes shown in this chapter are available, as usual, at https://github.com/ai2010/machine_learning_for_the_web/tree/master/chapter_5 in the rec_sys_methods.ipynb file.

We will start by introducing the main matrix used to arrange the dataset employed by the recommendation system and the metric measures typically used before starting to discuss the algorithms in the following sections.

Utility matrix

The data used in a recommendation system is divided in two categories: the users and the items. Each user likes certain items, and the rating value rij (from 1 to 5) is the data associated with each user i and item j and represents how much the user appreciates the item. These rating values are collected in matrix, called utility matrix R, in which each row i represents the list of rated items for user i while each column j lists all the users who have rated item j. In our case, the data folder ml-100k contains a file called u.data (and also u.item with the list of movie titles) that has been converted into a Pandas DataFrame (and saved into a csv, utilitymatrix.csv) by the following script:

Utility matrix

The output of the first two lines is as follows:

Utility matrix

Each column name, apart from the first (which is the user id), defines the name of the movie and the ID of the movie in the MovieLens database (separated by a semicolon). The 0 values represent the missing values and we expect to have a large number of them because the users evaluated far fewer than 1,600 movies. Note that the movies with less than 50 ratings have been removed from the utility matrix, so the number of columns is 604 (603 movies rated more than 50 times). The goal of the recommendation system is to predict these values, but for some techniques to work properly it will be necessary for us to initially set these values (imputation). Usually, two imputation approaches are used: ratings average per user or ratings average per item, and both of them are implemented in the following function:

Utility matrix

This function will be called by many of the algorithms implemented in this chapter, so we decided to discuss it here as a reference for future use. Furthermore, in this chapter the utility matrix R will have dimensions N×M with N number of users and M number of items. Due to the recurrent use of the similarity measures by different algorithms, we will define the most commonly used definitions hereafter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.15.43