Building a recommendation engine

To demonstrate both the content-based filtering and collaborative filtering approaches, we'll build a book-recommendation engine.

Book ratings dataset

In this chapter, we will work with book ratings dataset (Ziegler et al, 2005) collected in a four-week crawl. It contains data on 278,858 members of the Book-Crossing website and 1,157,112 ratings, both implicit and explicit, referring to 271,379 distinct ISBNs. User data is anonymized, but with demographic information. The dataset is available at:

http://www2.informatik.uni-freiburg.de/~cziegler/BX/.

The Book-Crossing dataset comprises three files described at their website as follows:

  • BX-Users: This contains the users. Note that user IDs (User-ID) have been anonymized and mapped to integers. Demographic data is provided (Location and Age) if available. Otherwise, these fields contain NULL-values.
  • BX-Books: Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (Book-Title, Book-Author, Year-Of-Publication, and Publisher), obtained from Amazon Web Services. Note that in case of several authors, only the first author is provided. URLs linking to cover images are also given, appearing in three different flavors (Image-URL-S, Image-URL-M, and Image-URL-L), that is, small, medium, and large. These URLs point to the Amazon website.
  • BX-Book-Ratings: This contains the book rating information. Ratings (Book-Rating) are either explicit, expressed on a scale of 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.

Loading the data

There are two approaches for loading the data according to where the data is stored: file or database. First, we will take a detailed look at how to load the data from the file, including how to deal with custom formats. At the end, we quickly take a look at how to load the data from a database.

Loading data from file

Loading data from file can be achieved with the FileDataModel class, expecting a comma-delimited file, where each line contains a userID, itemID, optional preference value, and optional timestamp in the same order, as follows:

userID,itemID[,preference[,timestamp]]

Optional preference accommodates applications with binary preference values, that is, user either expresses a preference for an item or not, without degree of preference, for example, with like/dislike.

A line that begins with hash, #, or an empty line will be ignored. It is also acceptable for the lines to contain additional fields, which will be ignored.

The DataModel class assumes the following types:

  • userID, itemID can be parsed as long
  • preference value can be parsed as double
  • timestamp can be parsed as long

If you are able to provide the dataset in the preceding format, you can simply use the following line to load the data:

DataModel model = new FileDataModel(new File(path));

This class is not intended to be used for very large amounts of data, for example, tens of millions of rows. For that, a JDBC-backed DataModel and a database are more appropriate.

In real world, however, we cannot always ensure that the input data supplied to us contain only integer values for userID and itemID. For example, in our case, itemID correspond to ISBN book numbers uniquely identifying items, but these are not integers and the FileDataModel default won't be suitable to process our data.

Now, let's consider how to deal with the case where our itemID is a string. We will define our custom data model by extending FileDataModel and overriding the long readItemIDFromString(String) method in order to read itemID as a string and convert the it into long and return a unique long value. To convert String to unique long, we'll extend another Mahout AbstractIDMigrator helper class, which is designed exactly for this task.

Now, let's first look at how FileDataModel is extended:

class StringItemIdFileDataModel extends FileDataModel {

  //initialize migrator to covert String to unique long
  public ItemMemIDMigrator memIdMigtr;

  public StringItemIdFileDataModel(File dataFile, String regex) throws IOException {
    super(dataFile, regex);
  }

  @Override
  protected long readItemIDFromString(String value) {
    
    if (memIdMigtr == null) {
      memIdMigtr = new ItemMemIDMigrator();
    }
    
    // convert to long
    long retValue = memIdMigtr.toLongID(value);
    //store it to cache 
    if (null == memIdMigtr.toStringID(retValue)) {
      try {
        memIdMigtr.singleInit(value);
      } catch (TasteException e) {
        e.printStackTrace();
      }
    }
    return retValue;
  }
  
  // convert long back to String
  String getItemIDAsString(long itemId) {
    return memIdMigtr.toStringID(itemId);
  }
}

Other useful methods that can be overridden are as follows:

  • readUserIDFromString(String value) if user IDs are not numeric
  • readTimestampFromString(String value) to change how timestamp is parsed

Now, let's take a look how AbstractIDMIgrator is extended:

class ItemMemIDMigrator extends AbstractIDMigrator {

  private FastByIDMap<String> longToString;

  public ItemMemIDMigrator() {
    this.longToString = new FastByIDMap<String>(10000);
  }

  public void storeMapping(long longID, String stringID) {
    longToString.put(longID, stringID);
  }

  public void singleInit(String stringID) throws TasteException {
    storeMapping(toLongID(stringID), stringID);
  }

  public String toStringID(long longID) {
    return longToString.get(longID);
  }
}

Now, we have everything in place and we can load our dataset with the following code:

StringItemIdFileDataModel model = new StringItemIdFileDataModel(
  new File("datasets/chap6/BX-Book-Ratings.csv"), ";");
System.out.println(
"Total items: " + model.getNumItems() + 
"
Total users: " +model.getNumUsers());

This outputs the total number of users and items:

Total items: 340556
Total users: 105283

We are ready to move on and start making recommendations.

Loading data from database

Alternately, we can also load the data from database using one of the JDBC data models. In this chapter, we will not dive into the detailed instructions about how to set up database, connection, and so on, but just give a sketch on how this can be done.

Database connectors have been moved to a separate package mahout-integration, hence we have to first add the package to our dependency list. Open the pom.xml file and add the following dependency:

<dependency>
  <groupId>org.apache.mahout</groupId>
  <artifactId>mahout-integration</artifactId>
  <version>0.7</version>
</dependency>

Consider that we want to connect a MySQL database. In this case, we will also need a package that handles database connections. Add the following to the pom.xml file:

<dependency>
  <groupId>mysql</groupId>
  <artifactId>mysql-connector-java</artifactId>
  <version>5.1.35</version>
</dependency>

Now, we have all the packages, so we can create a connection. First, let's initialize a DataSource class with connection details, as follows:

MysqlDataSource dbsource = new MysqlDataSource();
  dbsource.setUser("user");
  dbsource.setPassword("pass");
  dbsource.setServerName("hostname.com");
  dbsource.setDatabaseName("db");

Mahout integration implements JDBCDataModel to various databases that can be accessed via JDBC. By default, this class assumes that there is DataSource available under the JNDI name jdbc/taste, which gives access to a database with a taste_preferences table with the following schema:

CREATE TABLE taste_preferences (
  user_id BIGINT NOT NULL,
  item_id BIGINT NOT NULL,
  preference REAL NOT NULL,
  PRIMARY KEY (user_id, item_id)
)
CREATE INDEX taste_preferences_user_id_index ON taste_preferences (user_id);
CREATE INDEX taste_preferences_item_id_index ON taste_preferences (item_id);

A database-backed data model is initialized as follows. In addition to the DB connection object, we can also specify the custom table name and table column names, as follows:

DataModel dataModel = new MySQLJDBCDataModel(dbsource, "taste_preferences", 
  "user_id", "item_id", "preference", "timestamp");

In-memory database

Last, but not least, the data model can be created on the fly and held in memory. A database can be created from an array of preferences holding user ratings for a set of items.

We can proceed as follows. First, we create a FastByIdMap hash map of preference arrays, PreferenceArray, which stores an array of preferences:

FastByIDMap <PreferenceArray> preferences = new FastByIDMap <PreferenceArray> (); 

Next, we can create a new preference array for a user that will hold their ratings. The array must be initialized with a size parameter that reserves that many slots in memory:

PreferenceArray prefsForUser1 = 
  new GenericUserPreferenceArray (10);  

Next, we set user ID for current preference at position 0. This will actually set the user ID for all preferences:

prefsForUser1.setUserID (0, 1L); 

Set item ID for current preference at position 0:

prefsForUser1.setItemID (0, 101L); 

Set preference value for preference at 0:

prefsForUser1.setValue (0, 3.0f);  

Continue for other item ratings:

prefsForUser1.setItemID (1, 102L); 
prefsForUser1.setValue (1, 4.5F); 

Finally, add user preferences to the hash map:

preferences.put (1L, prefsForUser1); // use userID as the key 

The preference hash map can be now used to initialize GenericDataModel:

DataModel dataModel = new GenericDataModel(preferences);

This code demonstrates how to add two preferences for a single user; while in practical application, you'll want to add multiple preferences for multiple users.

Collaborative filtering

Recommendation engines in Mahout can be built with the org.apache.mahout.cf.taste package, which was formerly a separate project called Taste and has continued development in Mahout.

A Mahout-based collaborative filtering engine takes the users' preferences for items (tastes) and returns the estimated preferences for other items. For example, a site that sells books or CDs could easily use Mahout to figure out the CDs that a customer might be interested in listening to with the help of the previous purchase data.

Top-level packages define the Mahout interfaces to the following key abstractions:

  • DataModel: This represents a repository of information about users and their preferences for items
  • UserSimilarity: This defines a notion of similarity between two users
  • ItemSimilarity: This defines a notion of similarity between two items
  • UserNeighborhood: This computes neighborhood users for a given user
  • Recommender: This recommends items for user

A general structure of the concepts is shown in the following diagram:

Collaborative filtering

User-based filtering

The most basic user-based collaborative filtering can be implemented by initializing the previously described components as follows.

First, load the data model:

StringItemIdFileDataModel model = new StringItemIdFileDataModel(
    new File("/datasets/chap6/BX-Book-Ratings.csv", ";");

Next, define how to calculate how the users are correlated, for example, using Pearson correlation:

UserSimilarity similarity = 
  new PearsonCorrelationSimilarity(model);

Next, define how to tell which users are similar, that is, users that are close to each other according to their ratings:

UserNeighborhood neighborhood = 
  new ThresholdUserNeighborhood(0.1, similarity, model);

Now, we can initialize a GenericUserBasedRecommender default engine with data model, neighborhood, and similar objects, as follows:

UserBasedRecommender recommender = 
new GenericUserBasedRecommender(model, neighborhood, similarity);

That's it. Our first basic recommendation engine is ready. Let's discuss how to invoke recommendations. First, let's print the items that the user already rated along with ten recommendations for this user:

long userID = 80683;
int noItems = 10;

List<RecommendedItem> recommendations = recommender.recommend(
  userID, noItems);

System.out.println("Rated items by user:");
for(Preference preference : model.getPreferencesFromUser(userID)) {
  // convert long itemID back to ISBN
  String itemISBN = model.getItemIDAsString(
  preference.getItemID());
  System.out.println("Item: " + books.get(itemISBN) + 
    " | Item id: " + itemISBN + 
    " | Value: " + preference.getValue());
}

System.out.println("
Recommended items:");
for (RecommendedItem item : recommendations) {
  String itemISBN = model.getItemIDAsString(item.getItemID());
  System.out.println("Item: " + books.get(itemISBN) + 
    " | Item id: " + itemISBN + 
    " | Value: " + item.getValue());
}

This outputs the following recommendations along with their scores:

Rated items:
Item: The Handmaid's Tale | Item id: 0395404258 | Value: 0.0
Item: Get Clark Smart : The Ultimate Guide for the Savvy Consumer | Item id: 1563526298 | Value: 9.0
Item: Plum Island | Item id: 0446605409 | Value: 0.0
Item: Blessings | Item id: 0440206529 | Value: 0.0
Item: Edgar Cayce on the Akashic Records: The Book of Life | Item id: 0876044011 | Value: 0.0
Item: Winter Moon | Item id: 0345386108 | Value: 6.0
Item: Sarah Bishop | Item id: 059032120X | Value: 0.0
Item: Case of Lucy Bending | Item id: 0425060772 | Value: 0.0
Item: A Desert of Pure Feeling (Vintage Contemporaries) | Item id: 0679752714 | Value: 0.0
Item: White Abacus | Item id: 0380796155 | Value: 5.0
Item: The Land of Laughs : A Novel | Item id: 0312873115 | Value: 0.0
Item: Nobody's Son | Item id: 0152022597 | Value: 0.0
Item: Mirror Image | Item id: 0446353957 | Value: 0.0
Item: All I Really Need to Know | Item id: 080410526X | Value: 0.0
Item: Dreamcatcher | Item id: 0743211383 | Value: 7.0
Item: Perplexing Lateral Thinking Puzzles: Scholastic Edition | Item id: 0806917695 | Value: 5.0
Item: Obsidian Butterfly | Item id: 0441007813 | Value: 0.0

Recommended items:
Item: Keeper of the Heart | Item id: 0380774933 | Value: 10.0
Item: Bleachers | Item id: 0385511612 | Value: 10.0
Item: Salem's Lot | Item id: 0451125452 | Value: 10.0
Item: The Girl Who Loved Tom Gordon | Item id: 0671042858 | Value: 10.0
Item: Mind Prey | Item id: 0425152898 | Value: 10.0
Item: It Came From The Far Side | Item id: 0836220730 | Value: 10.0
Item: Faith of the Fallen (Sword of Truth, Book 6) | Item id: 081257639X | Value: 10.0
Item: The Talisman | Item id: 0345444884 | Value: 9.86375
Item: Hamlet | Item id: 067172262X | Value: 9.708363
Item: Untamed | Item id: 0380769530 | Value: 9.708363

Item-based filtering

The ItemSimilarity is the most important point to discuss here. Item-based recommenders are useful as they can take advantage of something very fast: they base their computations on item similarity, not user similarity, and item similarity is relatively static. It can be precomputed, instead of recomputed in real time.

Thus, it's strongly recommended that you use GenericItemSimilarity with precomputed similarities if you're going to use this class. You can use PearsonCorrelationSimilarity too, which computes similarities in real time, but you will probably find this painfully slow for large amounts of data:

StringItemIdFileDataModel model = new StringItemIdFileDataModel(
  new File("datasets/chap6/BX-Book-Ratings.csv"), ";");

ItemSimilarity itemSimilarity = new PearsonCorrelationSimilarity(model);

ItemBasedRecommender recommender = new GenericItemBasedRecommender(model, itemSimilarity);

String itemISBN = "0395272238";
long itemID = model.readItemIDFromString(itemISBN);
int noItems = 10;
List<RecommendedItem> recommendations = recommender.mostSimilarItems(itemID, noItems);

System.out.println("Recommendations for item: "+books.get(itemISBN));

System.out.println("
Most similar items:");
for (RecommendedItem item : recommendations) {
  itemISBN = model.getItemIDAsString(item.getItemID());
  System.out.println("Item: " + books.get(itemISBN) + " | Item id: " + itemISBN + " | Value: " + item.getValue());
}
Recommendations for item: Close to the Bone

Most similar items:
Item: Private Screening | Item id: 0345311396 | Value: 1.0
Item: Heartstone | Item id: 0553569783 | Value: 1.0
Item: Clockers / Movie Tie In | Item id: 0380720817 | Value: 1.0
Item: Rules of Prey | Item id: 0425121631 | Value: 1.0
Item: The Next President | Item id: 0553576666 | Value: 1.0
Item: Orchid Beach (Holly Barker Novels (Paperback)) | Item id: 0061013412 | Value: 1.0
Item: Winter Prey | Item id: 0425141233 | Value: 1.0
Item: Night Prey | Item id: 0425146413 | Value: 1.0
Item: Presumed Innocent | Item id: 0446359866 | Value: 1.0
Item: Dirty Work (Stone Barrington Novels (Paperback)) | Item id: 0451210158 | Value: 1.0

The resulting list returns a set of items similar to particular item that we selected.

Adding custom rules to recommendations

It often happens that some business rules require us to boost the score of the selected items. In the book dataset, for example, if a book is recent, we want to give it a higher score. That's possible using the IDRescorer interface implementing, as follows:

  • rescore(long, double) that takes itemId and original score as an argument and returns a modified score
  • isFiltered(long) that may return true to exclude a specific item from recommendation or false otherwise

Our example could be implemented as follows:

class MyRescorer implements IDRescorer {

  public double rescore(long itemId, double originalScore) {
    double newScore = originalScore;
    if(bookIsNew(itemId)){
      originalScore *= 1.3;
    }
    return newScore;
  }

  public boolean isFiltered(long arg0) {
    return false;
  }

}

An instance of IDRescorer is provided when invoking recommender.recommend:

IDRescorer rescorer = new MyRescorer();
List<RecommendedItem> recommendations = 
recommender.recommend(userID, noItems, rescorer);

Evaluation

You might wonder how to make sure that the returned recommendations make any sense? The only way to be really sure about how effective recommendations are is to use A/B testing in a live system with real users. For example, the A group receives a random item as a recommendation, while the B group receives an item recommended by our engine.

As this is neither always possible nor practical, we can get an estimate with offline statistical evaluation. One way to proceed is to use the k-fold cross validation introduced in Chapter 1, Applied Machine Learning Quick Start. We partition dataset into multiple sets, some are used to train our recommendation engine and the rest to test how well it recommends items to unknown users.

Mahout implements the RecommenderEvaluator class that splits a dataset in two parts. The first part, 90% by default, is used to produce recommendations, while the rest of the data is compared against estimated preference values to test the match. The class does not accept a recommender object directly, you need to build a class implementing the RecommenderBuilder interface instead, which builds a recommender object for a given DataModel object that is then used for testing. Let's take a look at how this is implemented.

First, we create a class that implements the RecommenderBuilder interface. We need to implement the buildRecommender method, which will return a recommender, as follows:

public class BookRecommender implements RecommenderBuilder  {
  public Recommender buildRecommender(DataModel dataModel) {
    UserSimilarity similarity = 
      new PearsonCorrelationSimilarity(model);
    UserNeighborhood neighborhood = 
      new ThresholdUserNeighborhood(0.1, similarity, model);
    UserBasedRecommender recommender = 
      new GenericUserBasedRecommender(
        model, neighborhood, similarity);
    return recommender;
  }
}

Now that we have class that returns a recommender object, we can initialize a RecommenderEvaluator instance. Default implementation of this class is the AverageAbsoluteDifferenceRecommenderEvaluator class, which computes the average absolute difference between the predicted and actual ratings for users. The following code shows how to put the pieces together and run a hold-out test.

First, load a data model:

DataModel dataModel = new FileDataModel(
  new File("/path/to/dataset.csv"));

Next, initialize an evaluator instance, as follows:

RecommenderEvaluator evaluator = 
  new AverageAbsoluteDifferenceRecommenderEvaluator();

Initialize the BookRecommender object, implementing the RecommenderBuilder interface:

RecommenderBuilder builder = new MyRecommenderBuilder();

Finally, call the evaluate() method, which accepts the following parameters:

  • RecommenderBuilder: This is the object implementing RecommenderBuilder that can build recommender to test
  • DataModelBuilder: DataModelBuilder to use, or if null, a default DataModel implementation will be used
  • DataModel: This is the dataset that will be used for testing
  • trainingPercentage: This indicates the percentage of each user's preferences to use to produced recommendations; the rest are compared to estimated preference values to evaluate the recommender performance
  • evaluationPercentage: This is the percentage of users to be used in evaluation

The method is called as follows:

double result = evaluator.evaluate(builder, null, model, 0.9, 1.0);
System.out.println(result);

The method returns a double, where 0 presents the best possible evaluation, meaning that the recommender perfectly matches user preferences. In general, lower the value, better the match.

Online learning engine

What about the online aspect? The above will work great for existing users; but what about new users which register in the service? For sure, we want to provide some reasonable recommendations for them as well. Creating a recommendation instance is expensive (it definitely takes longer than a usual network request), so we can't just create a new recommendation each time.

Luckily, Mahout has a possibility of adding temporary users to a data model. The general set up is as follows:

  • Periodically recreate the whole recommendation using current data (for example, each day or hour, depending on how long it takes)
  • When doing a recommendation, check whether the user exists in the system
  • If yes, complete the recommendation as always
  • If no, create a temporary user, fill in the preferences, and do the recommendation

The first part (periodically recreating the recommender) may be actually quite tricky if you are limited on memory: when creating the new recommender, you need to hold two copies of the data in memory (in order to still be able to server requests from the old one). However, as this doesn't really have anything to do with recommendations, I won't go into details here.

As for the temporary users, we can wrap our data model with a PlusAnonymousConcurrentUserDataModel instance. This class allows us to obtain a temporary user ID; the ID must be later released so that it can be reused (there's a limited number of such IDs). After obtaining the ID, we have to fill in the preferences, and then, we can proceed with the recommendation as always:

class OnlineRecommendation{

  Recommender recommender;
  int concurrentUsers = 100;
  int noItems = 10;

  public OnlineRecommendation() throws IOException {
    
    
    DataModel model = new StringItemIdFileDataModel(
      new File /chap6/BX-Book-Ratings.csv"), ";");
    PlusAnonymousConcurrentUserDataModel plusModel = new PlusAnonymousConcurrentUserDataModel(model, concurrentUsers);
    recommender = ...;
    
  }
  
  public List<RecommendedItem> recommend(long userId, PreferenceArray preferences){
    
    if(userExistsInDataModel(userId)){
      return recommender.recommend(userId, noItems);
    }
    
    else{
      
      PlusAnonymousConcurrentUserDataModel plusModel =
        (PlusAnonymousConcurrentUserDataModel) recommender.getDataModel();
      
      // Take an available anonymous user form the poll
      Long anonymousUserID = plusModel.takeAvailableUser();
      
      // Set temporary preferences
      PreferenceArray tempPrefs = preferences;
      tempPrefs.setUserID(0, anonymousUserID);
      tempPrefs.setItemID(0, itemID);
       plusModel.setTempPrefs(tempPrefs, anonymousUserID);
      
      List<RecommendedItem> results = recommender.recommend(anonymousUserID, noItems);
      
      // Release the user back to the poll
      plusModel.releaseUser(anonymousUserID);
      
      return results;

    }
    
  }
}
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.137.7