We follow these steps to arrive at a mode to make content-based recommendations:
First we take our items dataset and identify the features we want to encode for each item. Next we generate a pretend item for a user, based on a user's interaction with items. We can use a user's activity with items such as clicks, likes, purchases, and reviews. So essentially each user now encoded has the same features, and is represented the same as other items in the dataset. Therefore we have a set of feature vectors for all items, and also a pretend feature vector for the target user:
User -> Likes -> Item profile
Now the idea is simple. Apply a similarity function for the user features with all the items, and sort them by the most similar at the top. For example:
Let item feature vectors be , and let the pretend item for user be . Also, let our similarity function be . So in a pseudo-code form, finding top K similar items would be:
or
This gives us top K items, which are most similar to the pretend item we created for our target user. And that is exactly what we would recommend to the user. However, this is an expensive operation. If you can afford to perform these matches, then it is perfectly fine. However, when the data size is huge, this won't be an option.
To enhance it, we should understand what exactly is happening here. We will discuss that in the next section. Can you think about it on your own for now?
Note here that the key ideas are:
Let's now discuss how clustering can give us some performance benefits.
In our items and preceding user example, we have to scan all the items every time we find top K recommendations for a user. First, we need to choose a similarity function. Then we are essentially finding the items nearest to the user. From these items, we are choosing only the K nearest items. But we really don't need to perform similarity calculation of a user with all the items every time. We can pre-process items and cluster them together such that the most similar items are always grouped together. This is where we will use the K-Means clustering algorithm. The K-Means algorithm gives us a nice model to find the closest set of points to a given point using cluster centroids. Here is how it will look:
In this figure, there are three users labeled as A, B, and C. These three points are the pretend points based on their interaction with the system. First, we cluster all the items into three clusters hexagon, star, and ring. Once we have done that, we can find which cluster a user most likely belongs to and only recommend items from that cluster.
We will run our implementation of this recommendation approach on the Amazon dataset. So first let's extract the data for all the items from MongoDB into a CSV file.
$ mongo amazon_dataset --quiet < ../scripts/item-features.js > datasets/content-based-dataset-new.csv
Since we need to extract each item as a set of features, we have the following attributes for each item:
Let's look at a sample entry:
Our first task is to convert each item into a feature vector. For the preceding text attributes, we will use a HashingTF transformer. We set it up with a cardinality of 1024 (number of features) in the following code:
val dim = math.pow(2, 10).toInt val hashingTF = new HashingTF(dim)
Notice how we create a user defined function to extract values and create a vector:
Now based on the limited and mixed data (both numeric and textual), we would still like to be able to extract numeric features, the reason being that K-Means (or any distance based clustering) works only with numeric data. Here is what we will do:
For this, we will create a Spark UDF (user defined function) that will operate very nicely on a dataframe. Once we have converted every item into a vector representation, we learn a K-Means model, evaluate it, update its hyper parameters, and so on. So, finally, when we have obtained a good K-Means model, we will label each of the items with the cluster ID to which they belong.
The full code for this implementation is in the src/main/scala/chapter06/ContentBasedRSExample.scala
file.
First, we load the CSV data into a dataframe, and then transform it into feature vectors using the UDF we defined earlier. Next, we train the model and assign cluster IDs to all the items. Later to make recommendations, we would also need to store the K-Means model to disk. The bad news is this feature is only available in Spark 1.4 and we are using Spark 1.3 (see SPARK-5986 at https://issues.apache.org/jira/browse/SPARK-5986 ). But don't worry, we only need to save the K centroids (see spark/pull/4951 at https://github.com/apache/spark/pull/4951/files).
Once we have a K-Means model, and the many clusters of items, we are only left with one task—picking a user and making item recommendations.
$ sbt 'set fork := true' 'run-main chapter06.ContentBasedRSExample' [info] asin clusterID [info] 0827229534 4 [info] 0313230269 1 [info] B00004W1W1 4 [info] 1559362022 4 [info] 1859677800 3 [info] B000051T8A 4 ... [info] 1577943082 4 [info] 0486220125 9 [info] B00000AU3R 0 [info] 0231118597 5 [info] 0375709363 9 [info] 0939165252 5 [info] Within Set Sum of Squared Errors = 2.604849376963733E14 [success] Total time: 28 s, completed 31 Jul, 2015 9:36:54 PM
Some points to note here are:
However, keep in mind that if you do scaling/normalization on feature vectors while learning the models, you also need to perform same scaling/normalization operation on pretend item vectors too.
3.17.81.201