Loading data from a file

Loading data from a file can be achieved with the FileDataModel class. We will be expecting a comma-delimited file, where each line contains a userID, an itemID, an optional preference value, and an optional timestamp, in the same order, as follows:

userID,itemID[,preference[,timestamp]] 

An optional preference accommodates applications with binary preference values, that is, the user either expresses a preference for an item or not, without a degree of preference; for example, with a like or dislike.

A line that begins with a hash (#) or an empty line will be ignored. It is also acceptable for the lines to contain additional fields, which will be ignored.

The DataModel class assumes the following types:

  • The userID and itemID can be parsed as long
  • The preference value can be parsed as double
  • The timestamp can be parsed as long

If you are able to provide the dataset in the preceding format, you can simply use the following line to load the data:

DataModel model = new FileDataModel(new File(path)); 

This class is not intended to be used for very large amounts of data; for example, tens of millions of rows. For that, a JDBC-backed DataModel and a database are more appropriate.

In the real world, however, we cannot always ensure that the input data supplied to us contains only integer values for userID and itemID. For example, in our case, itemID corresponds to ISBN book numbers, which uniquely identify items, but these are not integers, and the FileDataModel default won't be suitable to process our data.

Now, let's consider how to deal with a case where our itemID is a string. We will define our custom data model by extending FileDataModel and overriding the long readItemIDFromString(String) method in order to read the itemID as a string and convert it into long, and return a unique long value. To convert a String into a unique long, we'll extend another Mahout AbstractIDMigrator helper class, which is designed exactly for this task.

Now, let's look at how FileDataModel is extended:

class StringItemIdFileDataModel extends FileDataModel { 
 
  //initialize migrator to covert String to unique long 
  public ItemMemIDMigrator memIdMigtr; 
 
  public StringItemIdFileDataModel(File dataFile, String regex) 
throws IOException { super(dataFile, regex); } @Override protected long readItemIDFromString(String value) { if (memIdMigtr == null) { memIdMigtr = new ItemMemIDMigrator(); } // convert to long long retValue = memIdMigtr.toLongID(value); //store it to cache if (null == memIdMigtr.toStringID(retValue)) { try { memIdMigtr.singleInit(value); } catch (TasteException e) { e.printStackTrace(); } } return retValue; } // convert long back to String String getItemIDAsString(long itemId) { return memIdMigtr.toStringID(itemId); } }

Other useful methods that can be overridden are as follows:

  • readUserIDFromString(String value), if user IDs are not numeric
  • readTimestampFromString(String value), to change how timestamp is parsed

Now, let's take a look at how AbstractIDMIgrator is extended:

class ItemMemIDMigrator extends AbstractIDMigrator { 
 
  private FastByIDMap<String> longToString; 
 
  public ItemMemIDMigrator() { 
    this.longToString = new FastByIDMap<String>(10000); 
  } 
 
  public void storeMapping(long longID, String stringID) { 
    longToString.put(longID, stringID); 
  } 
 
  public void singleInit(String stringID) throws TasteException { 
    storeMapping(toLongID(stringID), stringID); 
  } 
 
  public String toStringID(long longID) { 
    return longToString.get(longID); 
  } 
} 

Now, we have everything in place, and we can load our dataset with the following code:

StringItemIdFileDataModel model = new StringItemIdFileDataModel( 
  new File("datasets/chap6/BX-Book-Ratings.csv"), ";"); 
System.out.println( 
"Total items: " + model.getNumItems() +  
"
Total users: " +model.getNumUsers()); 

This provides the total number of users and items as output:

    Total items: 340556
    Total users: 105283
  

We are ready to move on and start making recommendations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.46.69