Chapter 8. Building a Recommendation System

If one were to choose an algorithm to showcase data science to the public, a recommendation system would certainly be in the frame. Today, recommendation systems are everywhere. The reason for their popularity is down to their versatility, usefulness, and broad applicability. Whether they are used to recommend products based on user's shopping behavior or to suggest new movies based on viewing preferences, recommenders are now a fact of life. It is even possible that this book was magically suggested based on what marketing companies know about you, such as your social network preferences, your job status, or your browsing history.

In this chapter, we will demonstrate how to recommend music content using raw audio signal. For that purpose, we will cover the following topics:

  • Using Spark to process audio files stored on HDFS
  • Learning about Fourier transform for audio signal transformation
  • Using Cassandra as a caching layer between online and offline layers
  • Using PageRank as an unsupervised recommendation algorithm
  • Integrating Spark Job Server with the Play framework to build an end-to-end prototype

Different approaches

The end goal of a recommendation system is to suggest new items based on a user's historical usage and preferences. The basic idea is to use a ranking for any product that a customer has been interested in in the past. This ranking can be explicit (asking a user to rank a movie from 1 to 5) or implicit (how many times a user visited this page). Whether it is a product to buy, a song to listen to, or an article to read, data scientists usually address this issue from two different angles: collaborative filtering and content-based filtering.

Collaborative filtering

Using this approach, we leverage big data by collecting more information about the behavior of people. Although an individual is by definition unique, their shopping behavior is usually not, and some similarities can always be found with others. The recommended items will be targeted for a particular individual, but they will be derived by combining the user's behavior with that of similar users. This is the famous quote from most retail websites:

"People who bought this also bought that..."

Of course, this requires prior knowledge about the customer, their past purchases and you must also have enough information about other customers to compare against. Therefore, a major limiting factor is that items must have been viewed at least once in order to be shortlisted as a potential recommended item. In fact, we cannot recommend an item until it has been seen/bought at least once.

Note

The iris dataset of collaborative filtering is usually done using samples of the LastFM dataset: http://labrosa.ee.columbia.edu/millionsong/lastfm.

Content-based filtering

An alternative approach, rather than using similarities with other users, involves looking at the product itself and the type of products a customer has been interested in in the past. If you are interested in both classical music and speed metal, it is safe to assume that you would probably buy (at least consider) any new albums mixing up both classical rhythms with heavy metal riffs. Such a recommendation would be difficult to find in a collaborative filtering approach as no one in your neighborhood shares your musical taste.

The main advantage of this approach is that, assuming we have enough knowledge about the content to recommend (such as the categories, labels, and so on), we can recommend a new item even when no one has seen it before. The downside is that the model can be more difficult to build and selecting the right features with no loss of information can be challenging.

Custom approach

As the focus of this book is Mastering Spark for Data Science we wish to provide the reader with a new and innovative way of addressing the recommendation issue, rather than just explaining the standard collaborative filtering algorithm that anyone could build using the out-of-the-box Spark APIs and following a basic tutorial http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html. Let's start with a hypothesis:

If we were to recommend songs to end-users, couldn't we build a system that would recommend songs, not based on what people like or dislike, nor on the song attributes (genre, artist), but rather on how the song really sounds and how you feel about it?

In order to demonstrate how to build such a system, (and since you likely do not have access to a public dataset containing both music content and ranking a legitimate one at least), we will explain how to construct it locally using your own personal music library. Feel free to play along!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.27.119