Chapter 11. Example Applications

Throughout this text, almost all of the examples have been in JavaScript. This chapter explores using MongoDB with languages that are more likely to be used in a real application.

Chemical Search Engine: Java

The Java driver is the oldest MongoDB driver. It has been used in production for years and is stable and a popular choice for enterprise developers.

We’ll be using the Java driver to build a search engine for chemical compounds, heavily inspired by http://www.chemeo.com. This search engine has the chemical and physical properties of thousands of compounds on file, and its goal is to make this information fully searchable.

Installing the Java Driver

The Java driver comes as a JAR file that can be downloaded from Github. To install, add the JAR to your classpath.

All of the Java classes you will probably need to use in a normal application are in the com.mongodb and com.mongodb.gridfs packages. There are a number of other packages included in the .JAR that are useful if you are planning on manipulating the driver’s internals or expanding its functionality, but most applications can ignore them.

Using the Java Driver

Like most things in Java, the API is a bit verbose (especially compared to the other languages’ APIs). However, all of the concepts are similar to using the shell, and almost all of the method names are identical.

The com.mongodb.Mongo class creates a connection to a MongoDB server. You can access a database from the connection and then get a collection from the database:

import com.mongodb.Mongo;
import com.mongodb.DB;
import com.mongodb.DBCollection;

class ChemSearch {

    public static void main(String[] args) {
        Mongo connection = new Mongo();
        DB db = connection.getDB("search");
        DBCollection chemicals = db.getCollection("chemicals");

        /* ... */
    }
}

This will connect to localhost:27017 and get the search.chemicals namespace.

Documents in Java must be instances of org.bson.DBObject, an interface that is basically an ordered java.util.Map. While there are a few ways to create a document in Java, the simplest one is to use the com.mongodb.BasicDBObject class. Thus, creating the document that could be represented by the shell as {"x" : 1, "y" : "foo"} would look like this:

BasicDBObject doc = new BasicDBObject();
doc.put("x", 1);
doc.put("y", "foo");

If we wanted to add an embedded document, such as "z" : {"hello" : "world"}, we would create another BasicDBObject and then put it in the top-level one:

BasicDBObject z = new BasicDBObject();
z.put("hello", "world");

doc.put("z", z);

Then we would have the document {"x" : 1, "y" : "foo", "z" : {"hello" : "world"}}.

From there, all of the other methods implemented by the Java driver are similar to the shell. For instance, we could say chemicals.insert(doc) or chemicals.find(doc). There is full API documentation for the Java driver at http://api.mongodb.org/java and some articles on specific areas of interest (concurrency, data types, etc.) at the MongoDB Java Language Center.

Schema Design

The interesting thing about this problem is that there are thousands of possible properties for each chemical, and we want to be able to search for any of them efficiently. Take two simple examples: silicon and silicon nitride. A document representing silicon might look something like this:

{
    "name" : "silicon",
    "mw" : 32.1173
} 

mw stands for “molecular weight.”

Silicon nitride might have a couple other properties, so its document would look like this:

{
    "name" : "silicon nitride",
    "mw" : 42.0922,
    "ΔfH°gas" : {
        "value" : 372.38,
        "units" : "kJ/mol"
    },
    "S°gas" : {
        "value" : 216.81,
        "units" : "J/mol×K"
    }
}

MongoDB lets us store chemicals with any number or structure of properties, which makes this application nicely extensible, but there’s no efficient way to index it in its current form. To be able to quickly search for any property, we would need to index almost every key! As we learned in Chapter 5, this is a bad idea.

There is a solution. We can take advantage of the fact that MongoDB indexes every element of an array, so we can store all of the properties we want to search for in an array with common key names. For example, with silicon nitride we can add an array just for indexing containing each property of the given chemical:

{
    "name" : "silicon nitride",
    "mw" : 42.0922,
    "ΔfH°gas" : {
        "value" : 372.38,
        "units" : "kJ/mol"
    },
    "S°gas" : {
        "value" : 216.81,
        "units" : "J/mol×K"
    },
    "index" : [
        {"name" : "mw", "value" : 42.0922},
        {"name" : "ΔfH°gas", "value" : 372.38},
        {"name" : "S°gas", "value" : 216.81}
    ]
}

Silicon, on the other hand, would have a single-element array with just the molecular weight:

{
    "name" : "silicon",
    "mw" : 32.1173,
    "index" : [
        {"name" : "mw", "value" : 32.1173}
    ]
}

Now, all we need to do is create a compound index on the "index.name" and "index.value" keys. Then we’ll be able to do a fairly quick search through the chemical compounds for any attribute.

Writing This in Java

Going back to our original Java code snippet, we’ll create a compound index with the ensureIndex function:

BasicDBObject index = new BasicDBObject();
index.put("index.name", 1);
index.put("index.value", 1);

chemicals.ensureIndex(index);

Creating a document for, say, silicon nitride is not difficult, but it is verbose:

public static DBObject createSiliconNitride() {
    BasicDBObject sn = new BasicDBObject();
    sn.put("name", "silicon nitride");
    sn.put("mw", 42.0922);

    BasicDBObject deltafHgas = new BasicDBObject();
    deltafHgas.put("value", 372.38);
    deltafHgas.put("units", "kJ/mol");

    sn.put("ΔfH°gas", deltafHgas);

    BasicDBObject sgas = new BasicDBObject();
    sgas.put("value", 216.81);
    sgas.put("units", "J/mol×K");

    sn.put("S°gas", sgas);

    ArrayList<BasicDBObject> index = new ArrayList<BasicDBObject>();
    index.add(BasicDBObjectBuilder.start()
              .add("name", "mw").add("value", 42.0922).get());
    index.add(BasicDBObjectBuilder.start()
              .add("name", "ΔfH°gas").add("value", 372.38).get());
    index.add(BasicDBObjectBuilder.start()
              .add("name", "S°gas").add("value", 216.81).get());

    sn.put("index", index);

    return sn;
}

Arrays can be represented by anything that implements java.util.List, so we create a java.util.ArrayList of embedded documents for the chemical’s properties.

Issues

One issue with this structure is that, if we are querying for multiple criteria, search order matters. For example, suppose we are looking for all documents with a molecular weight of less than 1000, a boiling point greater than 0°, and a freezing point of -20°. Naively, we could do this query by concatenating the criteria in an $all conditional:

BasicDBObject criteria = new BasicDBObject();

BasicDBObject all = new BasicDBObject();

BasicDBObject mw = new BasicDBObject("name", "mw");
mw.put("value", new BasicDBObject("$lt", 1000));

BasicDBObject bp = new BasicDBObject("name", "bp");
bp.put("value", new BasicDBObject("$gt", 0));

BasicDBObject fp = new BasicDBObject("name", "fp");
fp.put("value", -20);

all.put("$elemMatch", mw);
all.put("$elemMatch", bp);
all.put("$elemMatch", fp);
criteria.put("index", new BasicDBObject("$all", all));

chemicals.find(criteria);

The problem with this approach is that MongoDB can use an index only for the first item in an $all conditional. Suppose there are 1 million documents with a "mw" key whose value is less than 1,000. MongoDB can use the index for that part of the query, but then it will have to scan for the boiling and freezing points, which will take a long time.

If we know some of the characteristics of our data, for instance, that there are only 43 chemicals with a freezing point of -20°, we can rearrange the $all to do that query first:

all.put("$elemMatch", fp);
all.put("$elemMatch", mw);
all.put("$elemMatch", bp);
criteria.put("index", new BasicDBObject("$all", all));

Now the database can quickly find those 43 elements and, for the subsequent clauses, has to scan only 43 elements (instead of 1 million). Figuring out a good ordering for arbitrary searches is the real trick of course, of course. This could be done with pattern recognition and data aggregation algorithms that are beyond the scope of this book.

News Aggregator: PHP

We will be creating a basic news aggregation application: users submit links to interesting sites, and other users can comment and vote on the quality of the links (and other comments). This will involve creating a tree of comments and implementing a voting system.

Installing the PHP Driver

The MongoDB PHP driver is a PHP extension. It is easy to install on almost any system. It should work on any system with PHP 5.1 or newer installed.

Windows install

Look at the output of phpinfo() and determine the version of PHP you are running (PHP 5.2 and 5.3 are supported on Windows; 5.1 is not), including VC version, if shown. If you are using Apache, you should use VC6; otherwise, you’re probably running a VC9 build. Some obscure Zend installs use VC8. Also notice whether it is thread-safe (usually abbreviated “ts”).

While you’re looking at phpinfo(), make a note of the extension_dir value, which is where we’ll need to put the extension.

Now that you know what you’re looking for, go to Github. Download the package that matches your PHP version, VC version, and thread safety. Unzip the package, and move php_mongo.dll to the extension_dir directory.

Finally, add the following line to your php.ini file:

extension=php_mongo.dll

If you are running an application server (Apache, WAMPP, and so on), restart it. The next time you start PHP, it will automatically load the Mongo extension.

Mac OS X Install

It is easiest to install the extension through PECL, if you have it available. Try running the following:

$ pecl install mongo

Some Macs do not, however, come with PECL or the correct PHP libraries to install extensions.

If PECL does not work, you can download binary builds for OS X, available at Github (http://www.github.com/mongodb/mongo-php-driver/downloads). Run php -i to see what version of PHP you are running and what the value of extension_dir is, and then download the correct version. (It will have “osx” in the filename.) Unarchive the extension, and move mongo.so to the directory specified by extension_dir.

After the extension is installed via either method, add the following line to your php.ini file:

extension=mongo.so

Restart any application server you might have running, and the Mongo extension will be loaded the next time PHP starts.

Linux and Unix install

Run the following:

$ pecl install mongo

Then add the following line to your php.ini file:

extension=mongo.so

Restart any application server you might have running, and the Mongo extension will be loaded the next time PHP is started.

Using the PHP Driver

The Mongo class is a connection to the database. By default, the constructor attempts to connect to a database server running locally on the default port.

You can use the __get function to get a database from the connection and a collection from the database (even a subcollection from a collection). For example, this connects to MongoDB and gets the bar collection in the foo database:

<?php

$connection = new Mongo();

$collection = $connection->foo->bar;

?>

You can continue chaining getters to access subcollections. For example, to get the bar.baz collection, you can say the following:

$collection = $connection->foo->bar->baz;

Documents are represented by associative arrays in PHP. Thus, something like {"foo" : "bar"} in JavaScript could be represented as array("foo" => "bar") in PHP. Arrays are also represented as arrays in PHP, which sometimes leads to confusion: ["foo", "bar", "baz"] in JavaScript is equivalent to array("foo", "bar", "baz").

The PHP driver uses PHP’s native types for null, booleans, numbers, strings, and arrays. For all other types, there is a Mongo-prefixed type: MongoCollection is a collection, MongoDB is a database, and MongoRegex is a regular expression. There is extensive documentation in the PHP manual for all of these classes.

Designing the News Aggregator

We’ll be creating a simple news aggregator, where users can submit links to interesting stories and other users can vote and comment on them. We will just be covering two aspects of it: creating a tree of comments and handling votes.

To store the submissions and comments, we need only a single collection, posts. The initial posts linking to some article will look something like the following:

{
    "_id" : ObjectId(),
    "title" : "A Witty Title",
    "url" : "http://www.example.com",
    "date" : new Date(),
    "votes" : 0,
    "author" : {
        "name" : "joe",
        "_id" : ObjectId(),
    }
}

The comments will be almost identical, but they need a "content" key instead of a "url" key.

Trees of Comments

There are several different ways to represent a tree in MongoDB; the choice of which representation to use depends on the types of query being performed.

We’ll be storing an array of ancestors tree: each node will contain an array of its parent, grandparent, and so on. So, if we had the following comment structure:

original link
|- comment 1
|   |- comment 3 (reply to comment 1)
|   |- comment 4 (reply to comment 1)
|       |- comment 5 (reply to comment 4)
|- comment 2
|   |- comment 6 (reply to comment 2)

then comment 5’s array of ancestors would contain the original link’s _id, comment 1’s _id, and comment 4’s _id. Comment 2’s ancestors would be the original link’s _id and comment 2’s _id. This allows us to easily search for “all comments for link X" or “the subtree of comment 2’s replies.”

This method of storing comments assumes that we are going to have a lot of them and that we might be interested in seeing just parts of a comment thread. If we knew that we always wanted to display all of the comments and there weren’t going to be thousands, we could store the entire tree of comments as an embedded document in the submitted link’s document.

Using the array of ancestors approach, when someone wants to create a new comment, we need to add a new document to the collection. To create this document, we create a leaf document by linking it to the parent’s "_id" value and its array of ancestors.

function createLeaf($parent, $replyInfo) {
    $child = array(
        "_id" => new MongoId(),
        "content" => $replyInfo['content'],
        "date" => new MongoDate(),
        "votes" => 0,
        "author" => array(
            "name" => $replyInfo['name'],
            "name" => $replyInfo['name'],
        ),
        "ancestors" => $parent['ancestors'],
        "parent" => $parent['_id']
    );

    // add the parent's _id to the ancestors array
    $child['ancestors'][] = $parent['_id'];

    return $child;
}

Then we can add the new comment to the posts collection:

$comment = createLeaf($parent, $replyInfo);

$posts = $connection->news->posts;
$posts->insert($comment);

We can get a list of the latest submissions (sans comments) with the following:

$cursor = $posts->find(array("ancestors" => array('$size' => 0)));
$cursor = $cursor->sort(array("date" => -1));

If someone wants to see the comments for a given post, we can find them all with the following:

$cursor = $posts->find(array("ancestors" => $postId));

In fact, we can use this query to access any subtree of comments. If the root of the subtree is passed in as $postId, every child will contain $postId in its ancestor’s array and be returned.

To make these queries fast, we should index the "date" and "ancestors" keys:

$pageOfComments = $posts->ensureIndex(array("date" => -1, "ancestors" => 1));

Now we can quickly query for the main page, a tree of comments, or a subtree of comments.

Voting

There are many ways of implementing voting, depending on the functionality and information you want: do you allow up and down votes? Will you prevent users from voting more than once? Will you allow them to switch their vote? Do you care when people voted, to see if a link is trending? Each of these requires a different solution with far more complex coding than the simplest way of doing it: using "$inc":

$posts->update(array("_id" => $postId), array('$inc' => array("votes", 1)));

For a controversial or popular link, we wouldn’t want people to be able to vote hundreds of times, so we want to limit users to one vote each. A simple way to do this is to add a "voters" array to keep track of who has voted on this post, keeping an array of user "_id" values. When someone tries to vote, we do an update that checks the user "_id" against the array of "_id" values:

$posts->update(array("_id" => $postId, "voters" => array('$ne' => $userId)),
    array('$inc' => array("votes", 1), '$push' => array("voters" => $userId)));

This will work for up to a couple million users. For larger voting pools, we would hit the 4MB limit, and we would have to special-case the most popular links by putting spillover votes into a new document.

Custom Submission Forms: Ruby

MongoDB is a popular choice for Ruby developers, likely because the document-oriented approach meshes well with Ruby’s dynamism and flexibility. In this example we’ll use the MongoDB Ruby driver to build a framework for custom form submissions, inspired by a New York Times blog post about how it uses MongoDB to handle submission forms (http://open.blogs.nytimes.com/2010/05/25/building-a-better-submission-form/). For even more documentation on using MongoDB from Ruby, check out the Ruby Language Center.

Installing the Ruby Driver

The Ruby driver is available as a RubyGem, hosted at http://rubygems.org. Installation using the gem is the easiest way to get up and running. Make sure you’re using an up-to-date version of RubyGems (with gem update --system) and then install the mongo gem:

$ gem install mongo
Successfully installed bson-1.0.2
Successfully installed mongo-1.0.2
2 gems installed
Installing ri documentation for bson-1.0.2...
Building YARD (yri) index for bson-1.0.2...
Installing ri documentation for mongo-1.0.2...
Building YARD (yri) index for mongo-1.0.2...
Installing RDoc documentation for bson-1.0.2...
Installing RDoc documentation for mongo-1.0.2...

Installing the mongo gem will also install the bson gem on which it depends. The bson gem handles all of the BSON encoding and decoding for the driver (for more on BSON, see BSON). The bson gem will also make use of C extensions available in the bson_ext gem to improve performance, if that gem has been installed. For maximum performance, be sure to install bson_ext:

$ gem install bson_ext
Building native extensions.  This could take a while...
Successfully installed bson_ext-1.0.1
1 gem installed

If bson_ext is on the load path, it will be used automatically.

Using the Ruby Driver

To connect to an instance of MongoDB, use the Mongo::Connection class. Once we have an instance of Mongo::Connection, we can get an individual database (here we use the stuffy database) using bracket notation:

> require 'rubygems'
 => true
> require 'mongo'
 => true
> db = Mongo::Connection.new["stuffy"]

The Ruby driver uses hashes to represent documents. Aside from that, the API is similar to that of the shell with most method names being the same. (Although the Ruby driver uses underscore_naming, whereas the shell often uses camelCase.) To insert the document {"x" : 1} into the bar collection and query for the result, we would do the following:

> db["bar"].insert :x => 1
 => BSON::ObjectID('4c168343e6fb1b106f000001')
> db["bar"].find_one
 => {"_id"=>BSON::ObjectID('4c168343e6fb1b106f000001'), "x"=>1}

There are some important gotchas about documents in Ruby that you need to be aware of:

  • Hashes are ordered in Ruby 1.9, which matches how documents work in MongoDB. In Ruby 1.8, however, hashes are unordered. The driver provides a special type, BSON::OrderedHash, which must be used instead of a regular hash whenever key order is important.

  • Hashes being saved to MongoDB can have symbols as either keys or values. Hashes returned from MongoDB will have symbol values wherever they were present in the input, but any symbol keys will be returned as strings. So, {:x => :y} will become {"x" => :y}. This is a side effect of the way documents are represented in BSON (see Appendix C for more on BSON).

Custom Form Submission

The problem at hand is to generate custom forms for user-submitted data and to handle user submissions for those forms. Forms are created by editors and can contain arbitrary fields, each with different types and rules for validation. Here we’ll leverage the ability to embed documents and store each field as a separate document within a form. A form document for a comment submission form might look like this:

comment_form = {
    :_id => "comments",
    :fields => [
        {
            :name => "name",
            :label => "Your Name",
            :help_text => "Required",
            :required => true,
            :type => "string",
            :max_length => 200
        },
        {
            :name => "email",
            :label => "Your E-mail Address",
            :help_text => "Required, but will not be displayed",
            :required => true,
            :type => "email"
        },
        {
            :name => "comment",
            :label => "Your Comment",
            :help_text => "Comments will be moderated",
            :required => true,
            :type => "string",
            :word_limit => 200
        }
    ]
}

This form shows some of the benefits of working with a document-oriented database like MongoDB. First, we’re able to embed the form’s fields directly within the form document. We don’t need to store them separately and do a join—we can get the entire representation for a form by querying for a single document. We’re also able to specify different keys for different types of fields. In the previous example, the name field has a :max_length, key and the comment field has a :word_limit key, while the email field has neither.

In this example we use "_id" to store a human-readable name for our form. This works well because we need to index on the form name anyway to make queries efficient. Because the "_id" index is a unique index, we’re also guaranteed that form names will be unique across the system.

When an editor adds a new form, we simply save the resultant document. To save the comment_form document that we created, we’d do the following:

db["forms"].save comment_form

Each time we want to render a page with the comment form, we can query for the form document by its name:

db["forms"].find_one :_id => "comments"

The single document returned contains all the information we need in order to render the form, including the name, label, and type for each input field that needs to be rendered. When a form needs to be changed, editors can easily add a field or specify additional constraints for an existing field.

When we get a user submission for a form, we can run the same query as earlier to get the relevant form document. We’ll need this in order to validate that the user’s submission includes values for all required fields and meets any other requirements specified in our form. After validation, we can save the submission as a separate document in a submissions collection. A submission for our comment form might look like this:

comment_submission = {
    :form_id => "comments",
    :name => "Mike D.",
    :email => "[email protected]",
    :comment => "MongoDB is flexible!"
}

We’re again leveraging the document model by including custom keys for each submission (here we use :name, :email, and :comment). The only key that we require in each submission is :form_id. This allows us to efficiently retrieve all submissions for a certain form:

db["submissions"].find :form_id => "comments"

To perform this query, we should have an index on :form_id:

db["submissions"].create_index :form_id

We can also use :form_id to retrieve the form document for a given submission.

Ruby Object Mappers and Using MongoDB with Rails

There are several libraries written on top of the basic Ruby driver to provide things like models, validations, and associations for MongoDB documents. The most popular of these tools seem to be MongoMapper and Mongoid. If you’re used to working with tools like ActiveRecord or DataMapper, you might consider using one of these object mappers in addition to the basic Ruby driver.

MongoDB also works nicely with Ruby on Rails, especially when working with one of the previously mentioned mappers. There are up-to-date instructions on integrating MongoDB with Rails on the MongoDB site.

Real-Time Analytics: Python

The Python driver for MongoDB is called PyMongo. In this section, we’ll use PyMongo to implement some real-time tracking of metrics for a web application. The most up-to-date documentation on PyMongo is available at http://api.mongodb.org/python.

Installing PyMongo

PyMongo is available in the Python Package Index and can be installed using easy_install (http://pypi.python.org/pypi/setuptools):

$ easy_install pymongo
Searching for pymongo
Reading http://pypi.python.org/simple/pymongo/
Reading http://github.com/mongodb/mongo-python-driver
Best match: pymongo 1.6
Downloading ...
Processing pymongo-1.6-py2.6-macosx-10.6-x86_64.egg
Moving ...
Adding pymongo 1.6 to easy-install.pth file

Installed ...
Processing dependencies for pymongo
Finished processing dependencies for pymongo

This will install PyMongo and will attempt to install an optional C extension as well. If the C extension fails to build or install, everything will continue to work, but performance will suffer. An error message will be printed during install in that case.

As an alternative to easy_install, PyMongo can also be installed by running python setup.py install from a source checkout.

Using PyMongo

We use the pymongo.connection.Connection class to connect to a MongoDB server. Here we create a new Connection and use attribute-style access to get the analytics database:

from pymongo import Connection
db = Connection().analytics

The rest of the API for PyMongo is similar to the API of the MongoDB shell; like the Ruby driver, PyMongo uses underscore_naming instead of camelCase, however. Documents are represented using dictionaries in PyMongo, so to insert and retrieve the document {"a" : [1, 2, 3]}, we do the following:

db.test.insert({"a": [1, 2, 3]})
db.test.find_one()

Dictionaries in Python are unordered, so PyMongo provides an ordered subclass of dict, pymongo.son.SON. In most places where ordering is required, PyMongo provides APIs that hide it from the user. If not, applications can use SON instances instead of dictionaries to ensure their documents maintain key order.

MongoDB for Real-Time Analytics

MongoDB is a great tool for tracking metrics in real time for a couple of reasons:

  • Upsert operations (see Chapter 3) allow us to send a single message to either create a new tracking document or increment the counters on an existing document.

  • The upsert we send does not wait for a response; it’s fire-and-forget. This allows our application code to avoid blocking on each analytics update. We don’t need to wait and see whether the operation is successful, because an error in analytics code wouldn’t get reported to a user anyway.

  • We can use an $inc update to increment a counter without having to do a separate query and update operation. We also eliminate any contention issues if multiple updates are happening simultaneously.

  • MongoDB’s update performance is very good, so doing one or more updates per request for analytics is reasonable.

Schema

In our example we will be tracking page views for our site, with hourly roll-ups. We’ll track both total page views as well as page views for each individual URL. The goal is to end up with a collection, hourly, containing documents like this:

{ "hour" : "Tue Jun 15 2010 9:00:00 GMT-0400 (EDT)", "url" : "/foo", "views" : 5 }
{ "hour" : "Tue Jun 15 2010 9:00:00 GMT-0400 (EDT)", "url" : "/bar", "views" : 5 }
{ "hour" : "Tue Jun 15 2010 10:00:00 GMT-0400 (EDT)", "url" : "/", "views" : 12 }
{ "hour" : "Tue Jun 15 2010 10:00:00 GMT-0400 (EDT)", "url" : "/bar", "views" : 3 }
{ "hour" : "Tue Jun 15 2010 10:00:00 GMT-0400 (EDT)", "url" : "/foo", "views" : 10 }
{ "hour" : "Tue Jun 15 2010 11:00:00 GMT-0400 (EDT)", "url" : "/foo", "views" : 21 }
{ "hour" : "Tue Jun 15 2010 11:00:00 GMT-0400 (EDT)", "url" : "/", "views" : 3 }
...

Each document represents all of the page views for a single URL in a given hour. If a URL gets no page views in an hour, there is no document for it. To track total page views for the entire site, we’ll use a separate collection, hourly_totals, which has the following documents:

{ "hour" : "Tue Jun 15 2010 9:00:00 GMT-0400 (EDT)", "views" : 10 }
{ "hour" : "Tue Jun 15 2010 10:00:00 GMT-0400 (EDT)", "views" : 25 }
{ "hour" : "Tue Jun 15 2010 11:00:00 GMT-0400 (EDT)", "views" : 24 }
...

The difference here is just that we don’t need a "url" key, because we’re doing site-wide tracking. If our entire site doesn’t get any page views during an hour, there will be no document for that hour.

Handling a Request

Each time our application receives a request, we need to update our analytics collections appropriately. We need to add a page view both to the hourly collection for the specific URL requested and to the hourly_totals collection. Let’s define a function that takes a URL and updates our analytics appropriately:

from datetime import datetime

def track(url):
    hour = datetime.utcnow().replace(minute=0, second=0, microsecond=0)
    db.hourly.update({"hour": hour, "url": url},
                     {"$inc": {"views": 1}}, upsert=True)
    db.hourly_totals.update({"hour": hour},
                            {"$inc": {"views": 1}}, upsert=True)

We’ll also want to make sure that we have indexes in place to be able to perform these updates efficiently:

from pymongo import ASCENDING

db.hourly.create_index([("url", ASCENDING), ("hour", ASCENDING)], unique=True)
db.hourly_totals.create_index("hour", unique=True)

For the hourly collection, we create a compound index on "url" and "hour", while for hourly_totals we just index on "hour". Both of the indexes are created as unique, because we want only one document for each of our roll-ups.

Now, each time we get a request, we just call track a single time with the requested URL. It will perform two upserts; each will create a new roll-up document if necessary or increment the "views" for an existing roll-up.

Using Analytics Data

Now that we’re tracking page views, we need a way to query that data and put it to use. Here we print the hourly page view totals for the last 10 hours:

from pymongo import DESCENDING

for rollup in db.hourly_totals.find().sort("hour", DESCENDING).limit(10):
    pretty_date = rollup["hour"].strftime("%Y/%m/%d %H")
    print "%s - %d" % (pretty_date, rollup["views"])

This query will be able to leverage the index we’ve already created on "hour". We can perform a similar operation for an individual url:

for rollup in db.hourly.find({"url": url}).sort("hour", DESCENDING).limit(10):
    pretty_date = rollup["hour"].strftime("%Y/%m/%d %H")
    print "%s - %d" % (pretty_date, rollup["views"])

The only difference is that here we add a query document for selecting an individual "url". Again, this will leverage the compound index we’ve already created on "url", and "hour".

Other Considerations

One thing we might want to consider is running a periodic cleaning task to remove old analytics documents. If we’re displaying only the last 10 hours of data, then we can conserve space by not keeping around a month’s worth of documents. To remove all documents older than 24 hours, we can do the following, which could be run using cron or a similar mechanism:

from datetime import timedelta

remove_before = datetime.utcnow() - timedelta(hours=24)

db.hourly.remove({"hour": {"$lt": remove_before}})
db.hourly_totals.remove({"hour": {"$lt": remove_before}})

In this example, the first remove will actually need to do a table scan because we haven’t defined an index on "hour". If we need to perform this operation efficiently (or any other operation querying by "hour" for all URLs), we should consider adding a second index on "hour" to the hourly collection.

Another important note about this example is that it would be easy to add tracking for other metrics besides page views or to do roll-ups on a window other than hourly (or even to do roll-ups on multiple windows at once). All we need to do is to tweak the track function to perform upserts tracking whatever metric we’re interested in, at whatever granularity we want.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.217.198