Content-based recommendation steps

We follow these steps to arrive at a mode to make content-based recommendations:

  1. Compute vectors to describe items.
  2. Building profiles of user preferences.
  3. Predicting user interest in items.

First we take our items dataset and identify the features we want to encode for each item. Next we generate a pretend item for a user, based on a user's interaction with items. We can use a user's activity with items such as clicks, likes, purchases, and reviews. So essentially each user now encoded has the same features, and is represented the same as other items in the dataset. Therefore we have a set of feature vectors for all items, and also a pretend feature vector for the target user:

User -> Likes -> Item profile

Now the idea is simple. Apply a similarity function for the user features with all the items, and sort them by the most similar at the top. For example:

Let item feature vectors be Content-based recommendation steps, and let the pretend item for user be Content-based recommendation steps. Also, let our similarity function be Content-based recommendation steps. So in a pseudo-code form, finding top K similar items would be:

Content-based recommendation steps

or Content-based recommendation steps

This gives us top K items, which are most similar to the pretend item we created for our target user. And that is exactly what we would recommend to the user. However, this is an expensive operation. If you can afford to perform these matches, then it is perfectly fine. However, when the data size is huge, this won't be an option.

To enhance it, we should understand what exactly is happening here. We will discuss that in the next section. Can you think about it on your own for now?

Note here that the key ideas are:

  • To model items according to relevant attributes
  • To infer user likes by modeling the user as a set of item features, and then use the model built above to make recommendations

Let's now discuss how clustering can give us some performance benefits.

Clustering for performance

In our items and preceding user example, we have to scan all the items every time we find top K recommendations for a user. First, we need to choose a similarity function. Then we are essentially finding the items nearest to the user. From these items, we are choosing only the K nearest items. But we really don't need to perform similarity calculation of a user with all the items every time. We can pre-process items and cluster them together such that the most similar items are always grouped together. This is where we will use the K-Means clustering algorithm. The K-Means algorithm gives us a nice model to find the closest set of points to a given point using cluster centroids. Here is how it will look:

Clustering for performance

In this figure, there are three users labeled as A, B, and C. These three points are the pretend points based on their interaction with the system. First, we cluster all the items into three clusters hexagon, star, and ring. Once we have done that, we can find which cluster a user most likely belongs to and only recommend items from that cluster.

We will run our implementation of this recommendation approach on the Amazon dataset. So first let's extract the data for all the items from MongoDB into a CSV file.

$ mongo amazon_dataset --quiet < ../scripts/item-features.js > datasets/content-based-dataset-new.csv

Since we need to extract each item as a set of features, we have the following attributes for each item:

  • Title
  • Group
  • Sales rank
  • Average rating
  • Categories

Let's look at a sample entry:

  • ASIN: 078510870X
  • Title: Ultimate Marvel Team-Up
  • Group: Book
  • Sales rank: 612475
  • Average rating: 3.5
  • Categories: Books::Subjects::Children's Books::Literature::Science Fiction, Fantasy, Mystery & Horror::Comics & Graphic Novels::Books::Subjects::Comics & Graphic Novels::Publishers::Marvel::Books::Subjects::Teens::Science Fiction & Fantasy::Fantasy::Books::Subjects::Comics & Graphic Novels::Graphic Novels::Superheroes

Our first task is to convert each item into a feature vector. For the preceding text attributes, we will use a HashingTF transformer. We set it up with a cardinality of 1024 (number of features) in the following code:

val dim = math.pow(2, 10).toInt
val hashingTF = new HashingTF(dim)

Notice how we create a user defined function to extract values and create a vector:

Clustering for performance

Now based on the limited and mixed data (both numeric and textual), we would still like to be able to extract numeric features, the reason being that K-Means (or any distance based clustering) works only with numeric data. Here is what we will do:

  1. We will extract title terms, group and categories terms.
  2. We will calculate term frequencies of all these terms using HashingTF.
  3. We will create a vector out of these term frequencies and append two features sales rank and average rating to this vector.

For this, we will create a Spark UDF (user defined function) that will operate very nicely on a dataframe. Once we have converted every item into a vector representation, we learn a K-Means model, evaluate it, update its hyper parameters, and so on. So, finally, when we have obtained a good K-Means model, we will label each of the items with the cluster ID to which they belong.

Clustering for performance

The full code for this implementation is in the src/main/scala/chapter06/ContentBasedRSExample.scala file.

First, we load the CSV data into a dataframe, and then transform it into feature vectors using the UDF we defined earlier. Next, we train the model and assign cluster IDs to all the items. Later to make recommendations, we would also need to store the K-Means model to disk. The bad news is this feature is only available in Spark 1.4 and we are using Spark 1.3 (see SPARK-5986 at https://issues.apache.org/jira/browse/SPARK-5986 ). But don't worry, we only need to save the K centroids (see spark/pull/4951 at https://github.com/apache/spark/pull/4951/files).

Clustering for performance

Once we have a K-Means model, and the many clusters of items, we are only left with one task—picking a user and making item recommendations.

$ sbt 'set fork := true' 'run-main chapter06.ContentBasedRSExample'
[info] asin   clusterID
[info] 0827229534 4    
[info] 0313230269 1    
[info] B00004W1W1 4    
[info] 1559362022 4    
[info] 1859677800 3    
[info] B000051T8A 4    
...
[info] 1577943082 4    
[info] 0486220125 9    
[info] B00000AU3R 0    
[info] 0231118597 5    
[info] 0375709363 9    
[info] 0939165252 5    
[info] Within Set Sum of Squared Errors = 2.604849376963733E14
[success] Total time: 28 s, completed 31 Jul, 2015 9:36:54 PM

Some points to note here are:

  • Spark 1.3 doesn't have a mechanism to store the K-Means model to disk. This feature is available in Spark 1.4 and is pretty easy to use.
  • To convert users to pretend items, you have many options like:
    1. Take an aggregate sum of user's items feature vectors.
    2. Take an average of user's item feature vectors.
    3. Take a weighted sum (using average ratings).
    4. Take a weighted sum (using distance of an item from its cluster centroid).

However, keep in mind that if you do scaling/normalization on feature vectors while learning the models, you also need to perform same scaling/normalization operation on pretend item vectors too.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.81.201