Throughout this text, almost all of the examples have been in JavaScript. This chapter explores using MongoDB with languages that are more likely to be used in a real application.
The Java driver is the oldest MongoDB driver. It has been used in production for years and is stable and a popular choice for enterprise developers.
We’ll be using the Java driver to build a search engine for chemical compounds, heavily inspired by http://www.chemeo.com. This search engine has the chemical and physical properties of thousands of compounds on file, and its goal is to make this information fully searchable.
The Java driver comes as a JAR file that can be downloaded from Github. To install, add the JAR to your classpath.
All of the Java classes you will probably need to use in a normal
application are in the com.mongodb
and
com.mongodb.gridfs
packages. There are a number of
other packages included in the .JAR that are useful if you are planning
on manipulating the driver’s internals or expanding its functionality,
but most applications can ignore them.
Like most things in Java, the API is a bit verbose (especially compared to the other languages’ APIs). However, all of the concepts are similar to using the shell, and almost all of the method names are identical.
The com.mongodb.Mongo
class creates a
connection to a MongoDB server. You can access a database from the
connection and then get a collection from the database:
import com.mongodb.Mongo; import com.mongodb.DB; import com.mongodb.DBCollection; class ChemSearch { public static void main(String[] args) { Mongo connection = new Mongo(); DB db = connection.getDB("search"); DBCollection chemicals = db.getCollection("chemicals"); /* ... */ } }
This will connect to localhost:27017
and get
the search.chemicals
namespace.
Documents in Java must be instances of
org.bson.DBObject
, an interface that is basically
an ordered java.util.Map. While there are a few ways
to create a document in Java, the simplest one is to use the
com.mongodb.BasicDBObject
class. Thus, creating the
document that could be represented by the shell as {"x" : 1,
"y" : "foo"}
would look like this:
BasicDBObject doc = new BasicDBObject(); doc.put("x", 1); doc.put("y", "foo");
If we wanted to add an embedded document, such as "z" :
{"hello" : "world"}
, we would create another BasicDBObject and
then put
it in the top-level one:
BasicDBObject z = new BasicDBObject(); z.put("hello", "world"); doc.put("z", z);
Then we would have the document {"x" : 1, "y" : "foo",
"z" : {"hello" : "world"}}
.
From there, all of the other methods implemented by the Java
driver are similar to the shell. For instance, we could say
chemicals.insert(doc)
or chemicals.find(doc)
. There is full API
documentation for the Java driver at http://api.mongodb.org/java and some articles on specific
areas of interest (concurrency, data types, etc.) at the MongoDB
Java Language Center.
The interesting thing about this problem is that there are thousands of possible properties for each chemical, and we want to be able to search for any of them efficiently. Take two simple examples: silicon and silicon nitride. A document representing silicon might look something like this:
{ "name" : "silicon", "mw" : 32.1173 }
mw
stands for “molecular weight.”
Silicon nitride might have a couple other properties, so its document would look like this:
{ "name" : "silicon nitride", "mw" : 42.0922, "ΔfH°gas" : { "value" : 372.38, "units" : "kJ/mol" }, "S°gas" : { "value" : 216.81, "units" : "J/mol×K" } }
MongoDB lets us store chemicals with any number or structure of properties, which makes this application nicely extensible, but there’s no efficient way to index it in its current form. To be able to quickly search for any property, we would need to index almost every key! As we learned in Chapter 5, this is a bad idea.
There is a solution. We can take advantage of the fact that MongoDB indexes every element of an array, so we can store all of the properties we want to search for in an array with common key names. For example, with silicon nitride we can add an array just for indexing containing each property of the given chemical:
{ "name" : "silicon nitride", "mw" : 42.0922, "ΔfH°gas" : { "value" : 372.38, "units" : "kJ/mol" }, "S°gas" : { "value" : 216.81, "units" : "J/mol×K" }, "index" : [ {"name" : "mw", "value" : 42.0922}, {"name" : "ΔfH°gas", "value" : 372.38}, {"name" : "S°gas", "value" : 216.81} ] }
Silicon, on the other hand, would have a single-element array with just the molecular weight:
{ "name" : "silicon", "mw" : 32.1173, "index" : [ {"name" : "mw", "value" : 32.1173} ] }
Now, all we need to do is create a compound index on the
"index.name"
and "index.value"
keys. Then we’ll be able to do a fairly quick search through the
chemical compounds for any attribute.
Going back to our original Java code snippet, we’ll create a
compound index with the ensureIndex
function:
BasicDBObject index = new BasicDBObject(); index.put("index.name", 1); index.put("index.value", 1); chemicals.ensureIndex(index);
Creating a document for, say, silicon nitride is not difficult, but it is verbose:
public static DBObject createSiliconNitride() { BasicDBObject sn = new BasicDBObject(); sn.put("name", "silicon nitride"); sn.put("mw", 42.0922); BasicDBObject deltafHgas = new BasicDBObject(); deltafHgas.put("value", 372.38); deltafHgas.put("units", "kJ/mol"); sn.put("ΔfH°gas", deltafHgas); BasicDBObject sgas = new BasicDBObject(); sgas.put("value", 216.81); sgas.put("units", "J/mol×K"); sn.put("S°gas", sgas); ArrayList<BasicDBObject> index = new ArrayList<BasicDBObject>(); index.add(BasicDBObjectBuilder.start() .add("name", "mw").add("value", 42.0922).get()); index.add(BasicDBObjectBuilder.start() .add("name", "ΔfH°gas").add("value", 372.38).get()); index.add(BasicDBObjectBuilder.start() .add("name", "S°gas").add("value", 216.81).get()); sn.put("index", index); return sn; }
Arrays can be represented by anything that implements
java.util.List
, so we create a
java.util.ArrayList
of embedded documents for the
chemical’s properties.
One issue with this structure is that, if we are querying for
multiple criteria, search order matters. For example, suppose we are
looking for all documents with a molecular weight of less than 1000, a
boiling point greater than 0°, and a freezing point of -20°. Naively, we
could do this query by concatenating the criteria in an
$all
conditional:
BasicDBObject criteria = new BasicDBObject(); BasicDBObject all = new BasicDBObject(); BasicDBObject mw = new BasicDBObject("name", "mw"); mw.put("value", new BasicDBObject("$lt", 1000)); BasicDBObject bp = new BasicDBObject("name", "bp"); bp.put("value", new BasicDBObject("$gt", 0)); BasicDBObject fp = new BasicDBObject("name", "fp"); fp.put("value", -20); all.put("$elemMatch", mw); all.put("$elemMatch", bp); all.put("$elemMatch", fp); criteria.put("index", new BasicDBObject("$all", all)); chemicals.find(criteria);
The problem with this approach is that MongoDB can use an index
only for the first item in an $all
conditional.
Suppose there are 1 million documents with a "mw"
key
whose value is less than 1,000. MongoDB can use the index for that part
of the query, but then it will have to scan for the boiling and freezing
points, which will take a long time.
If we know some of the characteristics of our data, for instance,
that there are only 43 chemicals with a freezing point of -20°, we can
rearrange the $all
to do that query first:
all.put("$elemMatch", fp); all.put("$elemMatch", mw); all.put("$elemMatch", bp); criteria.put("index", new BasicDBObject("$all", all));
Now the database can quickly find those 43 elements and, for the subsequent clauses, has to scan only 43 elements (instead of 1 million). Figuring out a good ordering for arbitrary searches is the real trick of course, of course. This could be done with pattern recognition and data aggregation algorithms that are beyond the scope of this book.
We will be creating a basic news aggregation application: users submit links to interesting sites, and other users can comment and vote on the quality of the links (and other comments). This will involve creating a tree of comments and implementing a voting system.
The MongoDB PHP driver is a PHP extension. It is easy to install on almost any system. It should work on any system with PHP 5.1 or newer installed.
Look at the output of phpinfo()
and
determine the version of PHP you are running (PHP 5.2 and 5.3 are
supported on Windows; 5.1 is not), including VC version, if shown. If
you are using Apache, you should use VC6; otherwise, you’re probably
running a VC9 build. Some obscure Zend installs use VC8. Also notice
whether it is thread-safe (usually abbreviated “ts”).
While you’re looking at phpinfo()
, make a
note of the extension_dir
value, which is where
we’ll need to put the extension.
Now that you know what you’re looking for, go to Github.
Download the package that matches your PHP version, VC version, and
thread safety. Unzip the package, and move
php_mongo.dll to the
extension_dir
directory.
Finally, add the following line to your php.ini file:
extension=php_mongo.dll
If you are running an application server (Apache, WAMPP, and so on), restart it. The next time you start PHP, it will automatically load the Mongo extension.
It is easiest to install the extension through PECL, if you have it available. Try running the following:
$ pecl install mongo
Some Macs do not, however, come with PECL or the correct PHP libraries to install extensions.
If PECL does not work, you can download binary builds for OS X,
available at Github (http://www.github.com/mongodb/mongo-php-driver/downloads).
Run php -i
to see what version of PHP you are
running and what the value of extension_dir
is, and
then download the correct version. (It will have “osx” in the
filename.) Unarchive the extension, and move
mongo.so to the directory specified by
extension_dir
.
After the extension is installed via either method, add the following line to your php.ini file:
extension=mongo.so
Restart any application server you might have running, and the Mongo extension will be loaded the next time PHP starts.
The Mongo
class is a connection to the
database. By default, the constructor attempts to connect to a database
server running locally on the default port.
You can use the __get
function to get a
database from the connection and a collection from the database (even a
subcollection from a collection). For example, this connects to MongoDB
and gets the bar
collection in the
foo
database:
<?php $connection = new Mongo(); $collection = $connection->foo->bar; ?>
You can continue chaining getters to access subcollections. For
example, to get the bar.baz
collection, you can say
the following:
$collection = $connection->foo->bar->baz;
Documents are represented by associative arrays in PHP. Thus,
something like {"foo" : "bar"}
in JavaScript could be
represented as array("foo" => "bar")
in PHP.
Arrays are also represented as arrays in PHP, which sometimes leads to
confusion: ["foo", "bar", "baz"]
in JavaScript is
equivalent to array("foo", "bar", "baz")
.
The PHP driver uses PHP’s native types for null, booleans,
numbers, strings, and arrays. For all other types, there is a
Mongo-prefixed type: MongoCollection
is a
collection, MongoDB
is a database, and
MongoRegex
is a regular expression. There is
extensive documentation in the
PHP
manual for all of these classes.
We’ll be creating a simple news aggregator, where users can submit links to interesting stories and other users can vote and comment on them. We will just be covering two aspects of it: creating a tree of comments and handling votes.
To store the submissions and comments, we need only a single collection, posts. The initial posts linking to some article will look something like the following:
{ "_id" : ObjectId(), "title" : "A Witty Title", "url" : "http://www.example.com", "date" : new Date(), "votes" : 0, "author" : { "name" : "joe", "_id" : ObjectId(), } }
The comments will be almost identical, but they need a
"content"
key instead of a "url"
key.
There are several different ways to represent a tree in MongoDB; the choice of which representation to use depends on the types of query being performed.
We’ll be storing an array of ancestors tree: each node will contain an array of its parent, grandparent, and so on. So, if we had the following comment structure:
original link |- comment 1 | |- comment 3 (reply to comment 1) | |- comment 4 (reply to comment 1) | |- comment 5 (reply to comment 4) |- comment 2 | |- comment 6 (reply to comment 2)
then comment 5’s array of ancestors would contain the original
link’s _id
, comment 1’s _id
, and
comment 4’s _id
. Comment 2’s ancestors would be the
original link’s _id
and comment 2’s
_id
. This allows us to easily search for “all
comments for link X"
or “the subtree of
comment 2’s replies.”
This method of storing comments assumes that we are going to have a lot of them and that we might be interested in seeing just parts of a comment thread. If we knew that we always wanted to display all of the comments and there weren’t going to be thousands, we could store the entire tree of comments as an embedded document in the submitted link’s document.
Using the array of ancestors approach, when someone wants to
create a new comment, we need to add a new document to the collection.
To create this document, we create a leaf document by linking it to the
parent’s "_id"
value and its array of
ancestors.
function createLeaf($parent, $replyInfo) { $child = array( "_id" => new MongoId(), "content" => $replyInfo['content'], "date" => new MongoDate(), "votes" => 0, "author" => array( "name" => $replyInfo['name'], "name" => $replyInfo['name'], ), "ancestors" => $parent['ancestors'], "parent" => $parent['_id'] ); // add the parent's _id to the ancestors array $child['ancestors'][] = $parent['_id']; return $child; }
Then we can add the new comment to the posts collection:
$comment = createLeaf($parent, $replyInfo); $posts = $connection->news->posts; $posts->insert($comment);
We can get a list of the latest submissions (sans comments) with the following:
$cursor = $posts->find(array("ancestors" => array('$size' => 0))); $cursor = $cursor->sort(array("date" => -1));
If someone wants to see the comments for a given post, we can find them all with the following:
$cursor = $posts->find(array("ancestors" => $postId));
In fact, we can use this query to access any subtree of comments.
If the root of the subtree is passed in as $postId
,
every child will contain $postId
in its ancestor’s
array and be returned.
To make these queries fast, we should index the
"date"
and "ancestors"
keys:
$pageOfComments = $posts->ensureIndex(array("date" => -1, "ancestors" => 1));
Now we can quickly query for the main page, a tree of comments, or a subtree of comments.
There are many ways of implementing voting, depending on the
functionality and information you want: do you allow up and down votes?
Will you prevent users from voting more than once? Will you allow them
to switch their vote? Do you care when people
voted, to see if a link is trending? Each of these requires a different
solution with far more complex coding than the simplest way of doing it:
using "$inc"
:
$posts->update(array("_id" => $postId), array('$inc' => array("votes", 1)));
For a controversial or popular link, we wouldn’t want people to be
able to vote hundreds of times, so we want to limit users to one vote
each. A simple way to do this is to add a "voters"
array to keep track of who has voted on this post, keeping an array of
user "_id"
values. When someone tries to vote, we do
an update that checks the user "_id"
against the
array of "_id"
values:
$posts->update(array("_id" => $postId, "voters" => array('$ne' => $userId)), array('$inc' => array("votes", 1), '$push' => array("voters" => $userId)));
This will work for up to a couple million users. For larger voting pools, we would hit the 4MB limit, and we would have to special-case the most popular links by putting spillover votes into a new document.
MongoDB is a popular choice for Ruby developers, likely because the document-oriented approach meshes well with Ruby’s dynamism and flexibility. In this example we’ll use the MongoDB Ruby driver to build a framework for custom form submissions, inspired by a New York Times blog post about how it uses MongoDB to handle submission forms (http://open.blogs.nytimes.com/2010/05/25/building-a-better-submission-form/). For even more documentation on using MongoDB from Ruby, check out the Ruby Language Center.
The Ruby driver is available as a RubyGem, hosted at http://rubygems.org. Installation using the gem is the
easiest way to get up and running. Make sure you’re using an up-to-date
version of RubyGems (with gem update --system
) and
then install the mongo
gem:
$ gem install mongo Successfully installed bson-1.0.2 Successfully installed mongo-1.0.2 2 gems installed Installing ri documentation for bson-1.0.2... Building YARD (yri) index for bson-1.0.2... Installing ri documentation for mongo-1.0.2... Building YARD (yri) index for mongo-1.0.2... Installing RDoc documentation for bson-1.0.2... Installing RDoc documentation for mongo-1.0.2...
Installing the mongo gem will also install the bson gem on which it depends. The bson gem handles all of the BSON encoding and decoding for the driver (for more on BSON, see BSON). The bson gem will also make use of C extensions available in the bson_ext gem to improve performance, if that gem has been installed. For maximum performance, be sure to install bson_ext:
$ gem install bson_ext Building native extensions. This could take a while... Successfully installed bson_ext-1.0.1 1 gem installed
If bson_ext is on the load path, it will be used automatically.
To connect to an instance of MongoDB, use the
Mongo::Connection
class. Once we have an instance of
Mongo::Connection
, we can get an individual database
(here we use the stuffy database) using bracket
notation:
> require 'rubygems' => true > require 'mongo' => true > db = Mongo::Connection.new["stuffy"]
The Ruby driver uses hashes to represent documents. Aside from
that, the API is similar to that of the shell with most method names
being the same. (Although the Ruby driver uses underscore_naming,
whereas the shell often uses camelCase.) To insert the document
{"x" : 1}
into the bar
collection and query for the result, we would do the following:
> db["bar"].insert :x => 1 => BSON::ObjectID('4c168343e6fb1b106f000001') > db["bar"].find_one => {"_id"=>BSON::ObjectID('4c168343e6fb1b106f000001'), "x"=>1}
There are some important gotchas about documents in Ruby that you need to be aware of:
Hashes are ordered in Ruby 1.9, which matches how documents
work in MongoDB. In Ruby 1.8,
however, hashes are unordered. The driver provides a special type,
BSON::OrderedHash
, which must be used instead of
a regular hash whenever key order is important.
Hashes being saved to MongoDB can have symbols as either keys
or values. Hashes returned from MongoDB will have symbol values
wherever they were present in the input, but any symbol keys will be
returned as strings. So, {:x => :y}
will
become {"x" => :y}
. This is a side effect of
the way documents are represented in BSON (see Appendix C for more on BSON).
The problem at hand is to generate custom forms for user-submitted data and to handle user submissions for those forms. Forms are created by editors and can contain arbitrary fields, each with different types and rules for validation. Here we’ll leverage the ability to embed documents and store each field as a separate document within a form. A form document for a comment submission form might look like this:
comment_form = { :_id => "comments", :fields => [ { :name => "name", :label => "Your Name", :help_text => "Required", :required => true, :type => "string", :max_length => 200 }, { :name => "email", :label => "Your E-mail Address", :help_text => "Required, but will not be displayed", :required => true, :type => "email" }, { :name => "comment", :label => "Your Comment", :help_text => "Comments will be moderated", :required => true, :type => "string", :word_limit => 200 } ] }
This form shows some of the benefits of working with a
document-oriented database like MongoDB. First, we’re able to embed the
form’s fields directly within the form document. We don’t need to store
them separately and do a join—we can get the entire representation for a
form by querying for a single document. We’re also able to specify
different keys for different types of fields. In the previous example,
the name field has a :max_length
, key and the comment
field has a :word_limit
key, while the email field
has neither.
In this example we use "_id"
to store a
human-readable name for our form. This works well because we need to
index on the form name anyway to make queries efficient. Because the
"_id"
index is a unique index, we’re also guaranteed
that form names will be unique across the system.
When an editor adds a new form, we simply save the resultant
document. To save the comment_form
document that we
created, we’d do the following:
db["forms"].save comment_form
Each time we want to render a page with the comment form, we can query for the form document by its name:
db["forms"].find_one :_id => "comments"
The single document returned contains all the information we need in order to render the form, including the name, label, and type for each input field that needs to be rendered. When a form needs to be changed, editors can easily add a field or specify additional constraints for an existing field.
When we get a user submission for a form, we can run the same query as earlier to get the relevant form document. We’ll need this in order to validate that the user’s submission includes values for all required fields and meets any other requirements specified in our form. After validation, we can save the submission as a separate document in a submissions collection. A submission for our comment form might look like this:
comment_submission = { :form_id => "comments", :name => "Mike D.", :email => "[email protected]", :comment => "MongoDB is flexible!" }
We’re again leveraging the document model by including custom keys
for each submission (here we use :name
,
:email
, and :comment
). The only
key that we require in each submission is :form_id
.
This allows us to efficiently retrieve all submissions for a certain
form:
db["submissions"].find :form_id => "comments"
To perform this query, we should have an index on
:form_id
:
db["submissions"].create_index :form_id
We can also use :form_id
to retrieve the form
document for a given submission.
There are several libraries written on top of the basic Ruby driver to provide things like models, validations, and associations for MongoDB documents. The most popular of these tools seem to be MongoMapper and Mongoid. If you’re used to working with tools like ActiveRecord or DataMapper, you might consider using one of these object mappers in addition to the basic Ruby driver.
MongoDB also works nicely with Ruby on Rails, especially when working with one of the previously mentioned mappers. There are up-to-date instructions on integrating MongoDB with Rails on the MongoDB site.
The Python driver for MongoDB is called PyMongo. In this section, we’ll use PyMongo to implement some real-time tracking of metrics for a web application. The most up-to-date documentation on PyMongo is available at http://api.mongodb.org/python.
PyMongo is available in the Python Package Index
and can be installed using easy_install
(http://pypi.python.org/pypi/setuptools):
$ easy_install pymongo Searching for pymongo Reading http://pypi.python.org/simple/pymongo/ Reading http://github.com/mongodb/mongo-python-driver Best match: pymongo 1.6 Downloading ... Processing pymongo-1.6-py2.6-macosx-10.6-x86_64.egg Moving ... Adding pymongo 1.6 to easy-install.pth file Installed ... Processing dependencies for pymongo Finished processing dependencies for pymongo
This will install PyMongo and will attempt to install an optional C extension as well. If the C extension fails to build or install, everything will continue to work, but performance will suffer. An error message will be printed during install in that case.
As an alternative to easy_install
, PyMongo can
also be installed by running python setup.py install
from a source checkout.
We use the pymongo.connection.Connection
class
to connect to a MongoDB server. Here we create a new
Connection
and use attribute-style access to get the
analytics database:
from pymongo import Connection db = Connection().analytics
The rest of the API for PyMongo is similar to the API of the
MongoDB shell; like the Ruby driver, PyMongo uses underscore_naming
instead of camelCase, however. Documents are represented using
dictionaries in PyMongo, so to insert and retrieve the document
{"a" : [1, 2, 3]}
, we do the following:
db.test.insert({"a": [1, 2, 3]}) db.test.find_one()
Dictionaries in Python are unordered, so PyMongo provides an
ordered subclass of dict
,
pymongo.son.SON
. In most places where ordering is
required, PyMongo provides APIs that hide it from the user. If not,
applications can use SON
instances instead of
dictionaries to ensure their documents maintain key order.
MongoDB is a great tool for tracking metrics in real time for a couple of reasons:
Upsert operations (see Chapter 3) allow us to send a single message to either create a new tracking document or increment the counters on an existing document.
The upsert we send does not wait for a response; it’s fire-and-forget. This allows our application code to avoid blocking on each analytics update. We don’t need to wait and see whether the operation is successful, because an error in analytics code wouldn’t get reported to a user anyway.
We can use an $inc
update to increment a
counter without having to do a separate query and update operation.
We also eliminate any contention issues if multiple updates are
happening simultaneously.
MongoDB’s update performance is very good, so doing one or more updates per request for analytics is reasonable.
In our example we will be tracking page views for our site, with hourly roll-ups. We’ll track both total page views as well as page views for each individual URL. The goal is to end up with a collection, hourly, containing documents like this:
{ "hour" : "Tue Jun 15 2010 9:00:00 GMT-0400 (EDT)", "url" : "/foo", "views" : 5 } { "hour" : "Tue Jun 15 2010 9:00:00 GMT-0400 (EDT)", "url" : "/bar", "views" : 5 } { "hour" : "Tue Jun 15 2010 10:00:00 GMT-0400 (EDT)", "url" : "/", "views" : 12 } { "hour" : "Tue Jun 15 2010 10:00:00 GMT-0400 (EDT)", "url" : "/bar", "views" : 3 } { "hour" : "Tue Jun 15 2010 10:00:00 GMT-0400 (EDT)", "url" : "/foo", "views" : 10 } { "hour" : "Tue Jun 15 2010 11:00:00 GMT-0400 (EDT)", "url" : "/foo", "views" : 21 } { "hour" : "Tue Jun 15 2010 11:00:00 GMT-0400 (EDT)", "url" : "/", "views" : 3 } ...
Each document represents all of the page views for a single URL in a given hour. If a URL gets no page views in an hour, there is no document for it. To track total page views for the entire site, we’ll use a separate collection, hourly_totals, which has the following documents:
{ "hour" : "Tue Jun 15 2010 9:00:00 GMT-0400 (EDT)", "views" : 10 } { "hour" : "Tue Jun 15 2010 10:00:00 GMT-0400 (EDT)", "views" : 25 } { "hour" : "Tue Jun 15 2010 11:00:00 GMT-0400 (EDT)", "views" : 24 } ...
The difference here is just that we don’t need a
"url"
key, because we’re doing site-wide tracking. If
our entire site doesn’t get any page views during an hour, there will be
no document for that hour.
Each time our application receives a request, we need to update our analytics collections appropriately. We need to add a page view both to the hourly collection for the specific URL requested and to the hourly_totals collection. Let’s define a function that takes a URL and updates our analytics appropriately:
from datetime import datetime def track(url): hour = datetime.utcnow().replace(minute=0, second=0, microsecond=0) db.hourly.update({"hour": hour, "url": url}, {"$inc": {"views": 1}}, upsert=True) db.hourly_totals.update({"hour": hour}, {"$inc": {"views": 1}}, upsert=True)
We’ll also want to make sure that we have indexes in place to be able to perform these updates efficiently:
from pymongo import ASCENDING db.hourly.create_index([("url", ASCENDING), ("hour", ASCENDING)], unique=True) db.hourly_totals.create_index("hour", unique=True)
For the hourly collection, we create a
compound index on "url"
and
"hour"
, while for hourly_totals
we just index on "hour"
. Both of the indexes are
created as unique, because we want only one document for each of our
roll-ups.
Now, each time we get a request, we just call
track
a single time with the requested URL. It will
perform two upserts; each will create a new roll-up document if
necessary or increment the "views"
for an existing
roll-up.
Now that we’re tracking page views, we need a way to query that data and put it to use. Here we print the hourly page view totals for the last 10 hours:
from pymongo import DESCENDING for rollup in db.hourly_totals.find().sort("hour", DESCENDING).limit(10): pretty_date = rollup["hour"].strftime("%Y/%m/%d %H") print "%s - %d" % (pretty_date, rollup["views"])
This query will be able to leverage the index we’ve already
created on "hour"
. We can perform a similar operation
for an individual url
:
for rollup in db.hourly.find({"url": url
}).sort("hour", DESCENDING).limit(10):
pretty_date = rollup["hour"].strftime("%Y/%m/%d %H")
print "%s - %d" % (pretty_date, rollup["views"])
The only difference is that here we add a query document for
selecting an individual "url"
. Again, this will
leverage the compound index we’ve already created on
"url"
, and "hour"
.
One thing we might want to consider is running a periodic cleaning task to remove old analytics documents. If we’re displaying only the last 10 hours of data, then we can conserve space by not keeping around a month’s worth of documents. To remove all documents older than 24 hours, we can do the following, which could be run using cron or a similar mechanism:
from datetime import timedelta remove_before = datetime.utcnow() - timedelta(hours=24) db.hourly.remove({"hour": {"$lt": remove_before}}) db.hourly_totals.remove({"hour": {"$lt": remove_before}})
In this example, the first remove
will
actually need to do a table scan because we haven’t defined an index on
"hour"
. If we need to perform this operation
efficiently (or any other operation querying by
"hour"
for all URLs), we should consider adding a
second index on "hour"
to the
hourly collection.
Another important note about this example is that it would be easy
to add tracking for other metrics besides page views or to do roll-ups
on a window other than hourly (or even to do roll-ups on multiple
windows at once). All we need to do is to tweak the
track
function to perform upserts tracking whatever
metric we’re interested in, at whatever granularity we want.
18.119.142.232