Chapter 5. Recommendation Engines and Where They Fit in?

In the earlier chapters, we set the stage for creating our implementations of recommendation engines. In this chapter, we will implement our first recommender system on a products dataset. Here we continue with the following topics:

  • Populate an Amazon dataset
  • Create a web app with user/product pages
  • Add recommendation pages
  • Add product and customer trends

Populating the Amazon dataset

Let's start by downloading the SNAP Amazon dataset from this page https://snap.stanford.edu/data/amazon-meta.html. First go to the datasets folder, download the file, and decompress it. Note that this file is about 2002 MB compressed and 933 MB uncompressed; it could take some time to download:

$ cd datasets/
$ wget -c https://snap.stanford.edu/data/bigdata/amazon/amazon-meta.txt.gz
$ gunzip amazon-meta.txt.gz

A single product entry in this file looks like this:

Id:  1
ASIN: 0827229534
  title: Patterns of Preaching: A Sermon Sampler
  group: Book
  salesrank: 396585
  similar: 5 0804215715 156101074X 0687023955 0687074231 082721619X
  categories: 2
  |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]| Preaching[12368]
  |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]
  reviews: total: 2 downloaded: 2 avg rating: 5
  2000-7-28 cutomer: A2JW67OY8U6HHK rating: 5 votes: 10 helpful:  9
  2003-12-14 cutomer: A2VE83MZF98ITY rating: 5 votes: 6 helpful:  5

The fields to note are Id, ASIN, title, salesrank, group, similar, categories and reviews. We have already created code using Scala libraries to parse this file to make it easy for you to process it. We use the ReactiveMongo driver, and Play JSON library, and a custom parser script to scan through the data file:

Populating the Amazon dataset
Populating the Amazon dataset

Notice how we have used Play JSON API, ReactiveMongo macros and implicit values in companion objects. This allows us to transform JSON to Case class and vice-versa seamlessly, without writing any extra code.

Next we put all these entries into a MongoDB collection. We will call the database amazon_dataset and the collection products. As you can see in the following code (see LoadAmazonDataset Scala object in this chapter), we are using ReactiveMongo to utilize its JSON macros. This saves us from typing a lot of boilerplate code for JSON conversion. With a bunch of regular expressions and a simple iteration we can process the full text file. We only need to be careful with the end of file. The following is a snippet of regular expressions that have been used:

Populating the Amazon dataset

Before executing the following command, make sure that you have started a MongoDB server on your local machine:

$ sbt "run-main chapter05.LoadAmazonDataset datasets/amazon-dataset/amazon-meta.txt"

It will take some time to load all the product entries into MongoDB. Meanwhile, you can also check your MongoDB instance:

$ mongo
> use amazon_dataset
> db.products.findOne()
{
    "_id" : ObjectId("5579f5bd1d00001d002109dd"),
    "id" : 1,
    "asin" : "0827229534",
    "title" : "Patterns of Preaching: A Sermon Sampler",
    "group" : "Book",
    "salesrank" : 396585,
    "similar" : ["0804215715", "156101074X", "0687023955", "0687074231", "082721619X"],
    "categories" : [
        [
            {"name" : "Books", "code" : 283155},
            {"name" : "Subjects", "code" : 1000},
            {"name" : "Religion & Spirituality", "code" : 22},
            {"name" : "Christianity", "code" : 12290},
            {"name" : "Clergy", "code" : 12360},
            {"name" : "Preaching", "code" : 12368}
        ],
        [
            {"name" : "Books", "code" : 283155},
            {"name" : "Subjects", "code" : 1000},
            {"name" : "Religion & Spirituality", "code" : 22},
            {"name" : "Christianity", "code" : 12290},
            {"name" : "Clergy", "code" : 12360},
            {"name" : "Sermons", "code" : 12370}
        ]
    ],
    "reviews" : [
        {"date" : "2000-7-28", "customer" : "A2JW67OY8U6HHK", "rating" : 5, "votes" : 10, "helpful" : 9},
        {"date" : "2003-12-14", "customer" : "A2VE83MZF98ITY", "rating" : 5, "votes" : 6, "helpful" : 5}
    ],
    "overallReview" : {"total" : 2, "downloaded" : 2, "averageRating" : 5}
}

You may also want to check the different product groups present in the whole dataset:

> db.runCommand({distinct: "products", key: "group"})
{
    "values" : [
        "Book",
        "Music",
        "DVD",
        "Video",
        "Toy",
        "Video Games",
        "Software",
        "Baby Product",
        "CE",
        "Sports"
    ],
    "stats" : {
        "n" : 533023,
        "nscanned" : 533023,
        "nscannedObjects" : 533023,
        "timems" : 614,
        "cursor" : "BasicCursor"
    },
    "ok" : 1
}

Hopefully, if all goes well our dataset is populated into MongoDB. Next we create a web-interface to explore this data. We will also add two specialized pages for most popular and top-rated products.

Let's also populate some fake customer data. For this we will use the Fake Name Generator website, which is free and very convenient to use. Visit this page http://www.fakenamegenerator.com/order.php (bulk order), then select all the fields, and ensure that you put in an order for 50,000 entries. After a while you will receive a link to download the randomly generated data. Also make sure that if you publish that data, then you must follow the guidelines of the GPL v3 / Creative Commons licenses (see readme.txt in the file you receive).

So if you receive the file named FakeNameGenerator.com_data.zip, we will proceed as follows:

$ unzip FakeNameGenerator.com_qw3r7y.zip

This will extract a CSV file and a readme.txt (this contains the license information I mentioned earlier). Now we will import this data directly into a MongoDB collection. Let's call the collection customers.

$ mongoimport -d amazon_dataset -c customers --type csv --file ./FakeNameGenerator.com_qw3r7y.csv --headerline
connected to: 127.0.0.1
2015-06-27T18:34:34.462+0530 check 9 50001
2015-06-27T18:34:34.705+0530 imported 50000 objects

Now that the data is imported into MongoDB, we can query for some data. Let's see how Gender is encoded in this dataset:

$ mongo
> use amazon_dataset
> db.runCommand({distinct: "customers", key: "Gender"})
{
    "values" : [
        "female",
        "male"
    ],
    "stats" : {
        "n" : 50000,
        "nscanned" : 50000,
        "nscannedObjects" : 50000,
        "timems" : 39,
        "cursor" : "BasicCursor"
    },
    "ok" : 1
}

As we can see the encoding as male/female strings, we can explore other fields to decide how we will encode these in Scala. Now also find out the NameSet column, which essentially represents what kind of a name a record represents:

> db.runCommand({distinct: "customers", key: "NameSet"})
{
    "values" : [
        "Chinese (Traditional)",
        "Russian",
        "Danish",
        "Japanese (Anglicized)",
             ... OUTPUT SKIPPED ...
        "Norwegian",
        "Dutch",
        "Thai"
    ],
    "stats" : {
        "n" : 50000,
        "nscanned" : 50000,
        "nscannedObjects" : 50000,
        "timems" : 38,
        "cursor" : "BasicCursor"
    },
    "ok" : 1
}

Our customers' data is set up, however we also need to find a way to map these random names to Amazon dataset customers. Remember that the Amazon dataset has customer reviews with customer IDs. So first, we need to figure out how many customers there are. For that we will split the reviews data into a separate collection called reviews.

Create a separate collection for reviews:

> db.reviews.drop()
true
> db.products.find({}).forEach(  function(doc) {   var rs = doc.reviews;   for(r in rs) {    var elem = rs[r];    elem.asin = doc.asin;    elem.productId = doc.id;    elem.date = new Date(elem.date);    db.reviews.insert(elem);   }  }  );
> db.reviews.count()
6874336

As we can see in the above output, there are 6,874,336 reviews in total. We will also add an index to the customer column:

> db.reviews.createIndex( { "customer": 1 } )

All right, so now we have a separate collection reviews, and we will also create a mappings collection between the Amazon customer ID and a random user. So first we extract a customer_mapping collection with the actual customer IDs and then assign a random number between 0 and 50,000 to each of these customers. This is because in our random users' dataset, we have only 50,000 users but the actual customers are far more than that as we will see now:

> db.reviews.aggregate({$group: {_id: "customer"}}, {$out: "customer_mapping"})
> db.customer_mapping.count()
1510224
> db.customer_mapping.find().forEach( 
...  function(doc) {
...   var custNum = Math.floor(Math.random() * db.customers.count());
...   db.customer_mapping.update( 
...      {_id: doc._id},
...      {$set: {customer_number: custNum}}
...     )
...  })

Finally, we have the mapping complete. Although not perfect, it will be good enough for us to create a nice web application for our first recommendation project. That's exactly what we will do next.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.3.167