Chapter 6. Recommendation Engine with Apache Mahout

Recommendation engines are probably one of the most applied data science approaches in startups today. There are two principal techniques for building a recommendation system: content-based filtering and collaborative filtering. The content-based algorithm uses the properties of the items to find items with similar properties. Collaborative filtering algorithms take user ratings or other user behavior and make recommendations based on what users with similar behavior liked or purchased.

This chapter will first explain the basic concepts required to understand recommendation engine principles and then demonstrate how to utilize Apache Mahout's implementation of various algorithms to quickly get a scalable recommendation engine. This chapter will cover the following topics:

  • How to build a recommendation engine
  • Getting Apache Mahout ready
  • Content-based approach
  • Collaborative filtering approach

By the end of the chapter, you will learn the kind of recommendation engine that is appropriate for our problem and how to quickly implement one.

Basic concepts

Recommendation engines aim to show user items of interest. What makes them different from search engines is that the relevant content usually appears on a website without requesting it and users don't have to build queries as recommendation engines observe user's actions and construct query for users without their knowledge.

Arguably, the most well-known example of recommendation engine is www.amazon.com, providing personalized recommendation in a number of ways. The following image shows an example of Customers Who Bought This Item Also Bought. As we will see later, this is an example of collaborative item-based recommendation, where items similar to a particular item are recommended:

Basic concepts

An example of recommendation engine from www.amazon.com.

In this section, we will introduce key concepts related to understanding and building recommendation engines.

Key concepts

Recommendation engine requires the following four inputs to make recommendations:

  • Item information described with attributes
  • User profile such as age range, gender, location, friends, and so on
  • User interactions in form of ratings, browsing, tagging, comparing, saving, and emailing
  • Context where the items will be displayed, for example, item category and item's geographical location

These inputs are then combined together by the recommendation engine to help us answer the following questions:

  • Users who bought, watched, viewed, or bookmarked this item also bought, watched, viewed, or bookmarked…
  • Items similar to this item…
  • Other users you may know…
  • Other users who are similar to you…

Now let's have a closer look at how this combining works.

User-based and item-based analysis

Building a recommendation engine depends on whether the engine searches for related items or users when trying to recommend a particular item.

In item-based analysis, the engine focuses on identifying items that are similar to a particular item; while in user-based analysis, users similar to the particular user are first determined. For example, users with the same profile information (age, gender, and so on) or actions history (bought, watched, viewed, and so on) are determined and then the same items are recommended to other similar users.

Both approaches require us to compute a similarity matrix, depending on whether we're analyzing item attributes or user actions. Let's take a deeper look at how this is done.

Approaches to calculate similarity

There are three fundamental approaches to calculate similarity, as follows:

  • Collaborative filtering algorithms take user ratings or other user behavior and make recommendations based on what users with similar behavior liked or purchased
  • The content-based algorithm uses the properties of the items to find items with similar properties
  • A hybrid approach combining collaborative and content-based filtering

Let's take a look at each approach in detail.

Collaborative filtering

Collaborative filtering is based solely on user ratings or other user behavior, making recommendations based on what users with similar behavior liked or purchased.

A key advantage of collaborative filtering is that it does not rely on item content, and therefore, it is capable of accurately recommending complex items such as movies, without understanding the item itself. The underlying assumption is that people who agreed in the past will agree in the future, and that they will like similar kinds of items as they liked in the past.

A major disadvantage of this approach is the so-called cold start, meaning that if we want to build an accurate collaborative filtering system, the algorithm often needs a large amount of user ratings. This usually takes collaborative filtering out of the first version of the product and it is introduced later when a decent amount of data is collected.

Content-based filtering

Content-based filtering, on the other hand, is based on a description of items and a profile of user's preferences combined as follows. First, the items are described with attributes, and to find similar items, we measure the distance between items using a distance measure such as cosine distance or Pearson coefficient (more about distance measures is in Chapter 1, Applied Machine Learning Quick Start). Now, the user profile enters the equation. Given the feedback about the kind of items the user likes, we can introduce weights specifying the importance of a specific item attribute. For instance, Pandora Radio streaming service applies content-based filtering to create stations using more than 400 attributes. A user initially picks a song with specific attributes, and by providing feedback, important song attributes are emphasized.

This approach initially needs very little information on user feedback, thus it effectively avoids the cold-start issue.

Hybrid approach

Now colaborative versus content-based to choose? Collaborative filtering is able to learn user preferences from user's actions regarding one content source and use them across other content types. Content-based filtering is limited to recommending content of the same type that the user is already using. This provides value to different use cases, for example, recommending news articles based on news browsing is useful, but it is much more useful if different sources such as books and movies can be recommended based on news browsing.

Collaborative filtering and content-based filtering are not mutually exclusive; they can be combined to be more effective in some cases. For example, Netflix uses collaborative filtering to analyze searching and watching patterns of similar users, as well as content-based filtering to offer movies that share characteristics with films that the user has rated highly.

There is a wide variety of hybridization techniques such as weighted, switching, mixed, feature combination, feature augmentation, cascade, meta-level, and so on. Recommendation systems are an active area in machine learning and data mining community with special tracks on data science conferences. A good overview of techniques is summarized in the paper Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions by Adomavicius and Tuzhilin (2005), where the authors discuss different approaches and underlying algorithms and provide references to further papers. To get more technical and understand all the tiny details when a particular approach makes sense, you should look at the book edited by Ricci et al. (2010) Recommender Systems Handbook (1st ed.), Springer-Verlag New York.

Exploitation versus exploration

In recommendation system, there is always a tradeoff between recommending items that fall into the user's sweet spot based on what we already know about the user (exploitation) and recommending items that don't fall into user's sweet spot with the aim to expose user to some novelties (exploration). Recommendation systems with little exploration will only recommend items consistent with the previous user ratings, preventing showing items outside their current bubble. In practice, serendipity of getting new items out of user's sweet spot is often desirable, leading to pleasant surprise and potential discovery of new sweet spots.

In this section, we discussed the essential concepts required to start building recommendation engines. Now, let's take a look at how to actually build one with Apache Mahout.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.141.75