MongoDB provides a number of aggregation tools that go beyond basic query functionality. These range from simply counting the number of documents in a collection to using MapReduce to do complex data analysis.
The simplest aggregation tool is count
, which
returns the number of documents in the collection:
> db.foo.count() 0 > db.foo.insert({"x" : 1}) > db.foo.count() 1
Counting the total number of documents in a collection is fast regardless of collection size.
You can also pass in a query, and Mongo will count the number of results for that query:
> db.foo.insert({"x" : 2}) > db.foo.count() 2 > db.foo.count({"x" : 1}) 1
This can be useful for getting a total for pagination: “displaying results 0–10 of 439.” Adding criteria does make the count slower, however.
The distinct
command finds all of the distinct
values for a given key. You must specify a collection and key:
> db.runCommand({"distinct" : "people", "key" : "age"})
For example, suppose we had the following documents in our collection:
{"name" : "Ada", "age" : 20} {"name" : "Fred", "age" : 35} {"name" : "Susan", "age" : 60} {"name" : "Andy", "age" : 35}
If you call distinct on the "age"
key, you will
get back all of the distinct ages:
> db.runCommand({"distinct" : "people", "key" : "age"}) {"values" : [20, 35, 60], "ok" : 1}
A common question at this point is if there’s a way to get all of the distinct keys in a collection. There is no built-in way of doing this, although you can write something to do it yourself using MapReduce (described in a moment).
group
allows you to perform more complex
aggregation. You choose a key to group by, and MongoDB divides the
collection into separate groups for each value of the chosen key. For each
group, you can create a result document by aggregating the documents that
are members of that group.
If you are familiar with SQL, group
is
similar to SQL’s GROUP BY
.
Suppose we have a site that keeps track of stock prices. Every few
minutes from 10 a.m. to 4 p.m., it gets the latest price for a stock,
which it stores in MongoDB. Now, as part of a reporting application, we
want to find the closing price for the past 30 days. This can be easily
accomplished using group
.
The collection of stock prices contains thousands of documents with the following form:
{"day" : "2010/10/03", "time" : "10/3/2010 03:57:01 GMT-400", "price" : 4.23} {"day" : "2010/10/04", "time" : "10/4/2010 11:28:39 GMT-400", "price" : 4.27} {"day" : "2010/10/03", "time" : "10/3/2010 05:00:23 GMT-400", "price" : 4.10} {"day" : "2010/10/06", "time" : "10/6/2010 05:27:58 GMT-400", "price" : 4.30} {"day" : "2010/10/04", "time" : "10/4/2010 08:34:50 GMT-400", "price" : 4.01}
You should never store money amounts as floating-point numbers because of inexactness concerns, but we’ll do it anyway in this example for simplicity.
We want our results to be a list of the latest time and price for each day, something like this:
[ {"time" : "10/3/2010 05:00:23 GMT-400", "price" : 4.10}, {"time" : "10/4/2010 11:28:39 GMT-400", "price" : 4.27}, {"time" : "10/6/2010 05:27:58 GMT-400", "price" : 4.30} ]
We can accomplish this by splitting the collection into sets of documents grouped by day then finding the document with the latest timestamp for each day and adding it to the result set. The whole function might look something like this:
> db.runCommand({"group" : { ... "ns" : "stocks", ... "key" : "day", ... "initial" : {"time" : 0}, ... "$reduce" : function(doc, prev) { ... if (doc.time > prev.time) { ... prev.price = doc.price; ... prev.time = doc.time; ... } ... }}})
Let’s break this command down into its component keys:
"ns" : "stocks"
This determines which collection we’ll be running the group on.
"key" : "day"
This specifies the key on which to group the documents in
the collection. In this case, that would be the
"day"
key. All of the documents with a
"day"
key of a given value will be grouped
together.
"initial" : {"time" : 0}
The first time the reduce
function is
called for a given group, it will be passed the initialization
document. This same accumulator will be used for each member of a
given group, so any changes made to it can be persisted.
"$reduce" : function(doc, prev) { ...
}
This will be called once for each document in the
collection. It is passed the current document and an accumulator
document: the result so far for that group. In this example, we
want the reduce
function to compare the current
document’s time with the accumulator’s time. If the current
document has a later time, we’ll set the accumulator’s day and
price to be the current document’s values. Remember that there is
a separate accumulator for each group, so there is no need to
worry about different days using the same accumulator.
In the initial statement of the problem, we said that we wanted only
the last 30 days worth of prices. Our current solution is iterating over
the entire collection, however. This is why you can include a
"condition"
that documents must satisfy in order to be
processed by the group command at all:
> db.runCommand({"group" : { ... "ns" : "stocks", ... "key" : "day", ... "initial" : {"time" : 0}, ... "$reduce" : function(doc, prev) { ... if (doc.time > prev.time) { ... prev.price = doc.price; ... prev.time = doc.time; ... }}, ... "condition" : {"day" : {"$gt" : "2010/09/30"}} ... }})
Some documentation refers to a "cond"
or
"q"
key, both of which are identical to the
"condition"
key (just less descriptive).
Now the command will return an array of 30 documents, each of which
is a group. Each group has the key on which the group was based (in this
case, "day" :
) and the final
value of string
prev
for that group. If some of the documents
do not contain the key, these will be grouped into a single group with a
day : null
element. You can eliminate this group by
adding "day" : {"$exists" : true}
to the
"condition"
. The group
command also
returns the total number of documents used and the number of distinct
values for "key"
:
> db.runCommand({"group" : {...}}) { "retval" : [ { "day" : "2010/10/04", "time" : "Mon Oct 04 2010 11:28:39 GMT-0400 (EST)" "price" : 4.27 }, ... ], "count" : 734, "keys" : 30, "ok" : 1 }
We explicitly set the "price"
for each group, and
the "time"
was set by the initializer and then updated.
The "day"
is included because the key being grouped by
is included by default in each "retval"
embedded
document. If you don’t want to return this key, you can use a finalizer to
change the final accumulator document into anything, even a nondocument
(e.g., a number or string).
Finalizers can be used to minimize the amount of data that needs
to be transferred from the database to the user, which is important,
because the group
command’s output needs to fit in a
single database response. To demonstrate this, we’ll take the example of
a blog where each post has tags. We want to find the most popular tag
for each day. We can group by day (again) and keep a count for each tag.
This might look something like this:
> db.posts.group({ ... "key" : {"tags" : true}, ... "initial" : {"tags" : {}}, ... "$reduce" : function(doc, prev) { ... for (i in doc.tags) { ... if (doc.tags[i] in prev.tags) { ... prev.tags[doc.tags[i]]++; ... } else { ... prev.tags[doc.tags[i]] = 1; ... } ... } ... }})
This will return something like this:
[ {"day" : "2010/01/12", "tags" : {"nosql" : 4, "winter" : 10, "sledding" : 2}}, {"day" : "2010/01/13", "tags" : {"soda" : 5, "php" : 2}}, {"day" : "2010/01/14", "tags" : {"python" : 6, "winter" : 4, "nosql": 15}} ]
Then we could find the largest value in the
"tags"
document on the client side. However, sending
the entire tags document for every day is a lot of extra overhead to
send to the client: an entire set of key/value pairs for each day, when
all we want is a single string. This is why group
takes an optional "finalize"
key.
"finalize"
can contain a function that is run on each
group once, right before the result is sent back to the client. We can
use a "finalize"
function to trim out all of the
cruft from our results:
> db.runCommand({"group" : { ... "ns" : "posts", ... "key" : {"tags" : true}, ... "initial" : {"tags" : {}}, ... "$reduce" : function(doc, prev) { ... for (i in doc.tags) { ... if (doc.tags[i] in prev.tags) { ... prev.tags[doc.tags[i]]++; ... } else { ... prev.tags[doc.tags[i]] = 1; ... } ... }, ... "finalize" : function(prev) { ... var mostPopular = 0; ... for (i in prev.tags) { ... if (prev.tags[i] > mostPopular) { ... prev.tag = i; ... mostPopular = prev.tags[i]; ... } ... } ... delete prev.tags ... }}})
Now, we’re only getting the information we want; the server will send back something like this:
[ {"day" : "2010/01/12", "tag" : "winter"}, {"day" : "2010/01/13", "tag" : "soda"}, {"day" : "2010/01/14", "tag" : "nosql"} ]
finalize
can either modify the argument passed
in or return a new value.
Sometimes you may have more complicated criteria that you want to
group by, not just a single key. Suppose you are using
group
to count how many blog posts are in each
category. (Each blog post is in a single category.) Post authors were
inconsistent, though, and categorized posts with haphazard
capitalization. So, if you group by category name, you’ll end up with
separate groups for “MongoDB” and “mongodb.” To make sure any variation
of capitalization is treated as the same key, you can define a function
to determine documents’ grouping key.
To define a grouping function, you must use a
$keyf
key (instead of "key"
).
Using "$keyf"
makes the group
command look something like this:
> db.posts.group({"ns" : "posts", ... "$keyf" : function(x) { return x.category.toLowerCase(); }, ... "initializer" : ... })
"$keyf"
allows you can group by arbitrarily
complex criteria.
MapReduce is the Uzi of aggregation tools. Everything described with
count
, distinct
, and group
can
be done with MapReduce, and more. It is a method of aggregation that can
be easily parallelized across multiple servers. It splits up a problem,
sends chunks of it to different machines, and lets each machine solve its
part of the problem. When all of the machines are finished, they merge all
of the pieces of the solution back into a full solution.
MapReduce has a couple of steps. It starts with the map step, which
maps an operation onto every document in a
collection. That operation could be either “do nothing” or “emit these
keys with X
values.” There is then an
intermediary stage called the shuffle step: keys are grouped and lists of
emitted values are created for each key. The reduce takes this list of
values and reduces it to a single element. This
element is returned to the shuffle step until each key has a list
containing a single value: the result.
The price of using MapReduce is speed: group
is
not particularly speedy, but MapReduce is slower and is not supposed to
be used in “real time.” You run MapReduce as a background job, it creates a
collection of results, and then you can query that collection in real
time.
We’ll go through a couple of MapReduce examples because it is incredibly useful and powerful but also a somewhat complex tool.
Using MapReduce for this problem might be overkill, but it is a good way to get familiar with how MapReduce works. If you already understand MapReduce, feel free to skip ahead to the last part of this section, where we cover MongoDB-specific MapReduce considerations.
MongoDB is schemaless, so it does not keep track of the keys in
each document. The best way, in general, to find all the keys across all
the documents in a collection is to use MapReduce. In this example,
we’ll also get a count of how many times each key appears in the
collection. This example doesn’t include keys for embedded documents,
but it would be a simple addition to the map
function
to do so.
For the mapping step, we want to get every key of every document
in the collection. The map
function uses a special
function to “return” values that we want to process later:
emit
. emit
gives MapReduce a
key (like the one used by group
earlier) and a value.
In this case, we emit a count of how many times a given key appeared in
the document (once: {count : 1}
). We want a separate
count for each key, so we’ll call emit for every key in the document.
this
is a reference to the current document we are
mapping:
> map = function() { ... for (var key in this) { ... emit(key, {count : 1}); ... }};
Now we have a ton of little {count : 1}
documents floating around, each associated with a key from the
collection. An array of one or more of these {count :
1}
documents will be passed to the reduce
function. The reduce
function is passed two
arguments: key
, which is the first argument from
emit
, and an array of one or more {count :
1}
documents that were emitted for that key:
> reduce = function(key, emits) { ... total = 0; ... for (var i in emits) { ... total += emits[i].count; ... } ... return {"count" : total}; ... }
reduce
must be able to be called repeatedly on
results from either the map phase or previous reduce phases. Therefore,
reduce
must return a document that can be re-sent to
reduce
as an element of its second argument. For
example, say we have the key x
mapped to three
documents: {count : 1, id : 1}
, {count : 1,
id : 2}
, and {count : 1, id : 3}
. (The ID
keys are just for identification purposes.) MongoDB might call
reduce
in the following pattern:
> r1 = reduce("x", [{count : 1, id : 1}, {count : 1, id : 2}]) {count : 2} > r2 = reduce("x", [{count : 1, id : 3}]) {count : 1} > reduce("x", [r1, r2]) {count : 3}
You cannot depend on the second argument always holding one of the
initial documents ({count : 1}
in this case) or being
a certain length. reduce
should be able to be run on
any combination of emit
documents and
reduce
return values.
Altogether, this MapReduce function would look like this:
> mr = db.runCommand({"mapreduce" : "foo", "map" : map, "reduce" : reduce}) { "result" : "tmp.mr.mapreduce_1266787811_1", "timeMillis" : 12, "counts" : { "input" : 6 "emit" : 14 "output" : 5 }, "ok" : true }
The document MapReduce returns gives you a bunch of metainformation about the operation:
"result" :
"tmp.mr.mapreduce_1266787811_1"
This is the name of the collection the MapReduce results were stored in. This is a temporary collection that will be deleted when the connection that did the MapReduce is closed. We will go over how to specify a nicer name and make the collection permanent in a later part of this chapter.
"timeMillis" : 12
How long the operation took, in milliseconds.
"counts" : { ... }
This embedded document contains three keys:
"input" : 6
The number of documents sent to the
map
function.
"emit" : 14
The number of times emit
was
called in the map
function.
"output" : 5
The number of documents created in the result collection.
"counts"
is mostly useful for
debugging.
If we do a find on the resulting collection, we can see all of the keys and their counts from our original collection:
> db[mr.result].find() { "_id" : "_id", "value" : { "count" : 6 } } { "_id" : "a", "value" : { "count" : 4 } } { "_id" : "b", "value" : { "count" : 2 } } { "_id" : "x", "value" : { "count" : 1 } } { "_id" : "y", "value" : { "count" : 1 } }
Each of the key values becomes an "_id"
, and
the final result of the reduce step(s) becomes the
"value"
.
Suppose we have a site where people can submit links to other pages, such as reddit.com. Submitters can tag a link as related to certain popular topics, e.g., “politics,” “geek,” or “icanhascheezburger.” We can use MapReduce to figure out which topics are the most popular, as a combination of recent and most-voted-for.
First, we need a map
function that emits tags
with a value based on the popularity and recency of a document:
map = function() { for (var i in this.tags) { var recency = 1/(new Date() - this.date); var score = recency * this.score; emit(this.tags[i], {"urls" : [this.url], "score" : score}); } };
Now we need to reduce all of the emitted values for a tag into a single score for that tag:
reduce = function(key, emits) { var total = {urls : [], score : 0} for (var i in emits) { emits[i].urls.forEach(function(url) { total.urls.push(url); } total.score += emits[i].score; } return total; };
The final collection will end up with a full list of URLs for each tag and a score showing how popular that particular tag is.
Both of the previous examples used only the
mapreduce
, map
, and
reduce
keys. These three keys are required, but there
are many optional keys that can be passed to the MapReduce
command.
"finalize" :
function
A final step to send reduce
’s output
to.
"keeptemp" :
boolean
If the temporary result collection should be saved when the connection is closed.
"output" :
string
Name for the output collection. Setting this option implies
keeptemp : true
.
"query" :
document
Query to filter documents by before sending to the
map
function.
"sort" :
document
Sort to use on documents before sending to the map (useful
in conjunction with the limit
option).
"limit" :
integer
Maximum number of documents to send to the
map
function.
"scope" :
document
Variables that can be used in any of the JavaScript code.
"verbose" :
boolean
Whether to use more verbose output in the server logs.
As with the previous group
command, MapReduce
can be passed a finalize
function that will be run
on the last reduce
’s output before it is saved to a
temporary collection.
Returning large result sets is less critical with MapReduce than
group
, because the whole result doesn’t have to fit
in 4MB. However, the information will be passed over the wire
eventually, so finalize
is a good chance to take
averages, chomp arrays, and remove extra information in
general.
By default, Mongo creates a temporary collection while it is
processing the MapReduce with a name that you are unlikely to choose
for a collection: a dot-separated string containing
mr, the name of the collection you’re
MapReducing, a timestamp, and the job’s ID with the database. It ends
up looking something like mr.stuff.18234210220.2.
MongoDB will automatically destroy this collection when the connection
that did the MapReduce is closed. (You can also drop it manually when
you’re done with it.) If you want to persist this collection even
after disconnecting, you can specify keeptemp :
true
as an option.
If you’ll be using the temporary collection regularly, you may
want to give it a better name. You can specify a more human-readable
name with the out
option, which takes a string. If
you specify out
, you need not specify
keeptemp : true
, because it is implied. Even if you
specify a “pretty” name for the collection, MongoDB will use the
autogenerated collection name for intermediate steps of the MapReduce.
When it has finished, it will automatically and atomically rename the
collection from the autogenerated name to your chosen name. This means
that if you run MapReduce multiple times with the same target
collection, you will never be using an incomplete collection for
operations.
The output collection created by MapReduce is a normal collection, which means that there is no problem with doing a MapReduce on it, or a MapReduce on the results from that MapReduce, ad infinitum!
Sometimes you need to run MapReduce on only part of a
collection. You can add a query to filter the documents before they
are passed to the map
function.
Every document passed to the map
function
needs to be deserialized from BSON into a JavaScript object, which is
a fairly expensive operation. If you know that you will need to run
MapReduce only on a subset of the documents in the collection, adding
a filter can greatly speed up the command. The filter is specified by
the "query"
, "limit"
, and
"sort"
keys.
The "query"
key takes a query document as a
value. Any documents that would ordinarily be returned by that query
will be passed to the map
function. For example, if
we have an application tracking analytics and want a summary for the
last week, we can use MapReduce on only the most recent week’s
documents with the following command:
> db.runCommand({"mapreduce" : "analytics", "map" : map, "reduce" : reduce, "query" : {"date" : {"$gt" : week_ago}}})
The sort
option is mostly useful in
conjunction with limit
. limit
can be used on its own, as well, to simply provide a cutoff on the
number of documents sent to the map
function.
If, in the previous example, we wanted an analysis of the last
10,000 page views (instead of the last week), we could use
limit
and sort
:
> db.runCommand({"mapreduce" : "analytics", "map" : map, "reduce" : reduce, "limit" : 10000, "sort" : {"date" : -1}})
query
, limit
, and
sort
can be used in any combination, but
sort
isn’t useful if limit
isn’t
present.
MapReduce can take a code type for the map
,
reduce
, and finalize
functions,
and, in most languages, you can specify a scope to be passed with
code. However, MapReduce ignores this scope. It has its own scope key,
"scope"
, and you must use that if there are
client-side values you want to use in your MapReduce. You can set them
using a plain document of the form variable_name :
value
, and they will be available in your
map
, reduce
, and
finalize
functions. The scope is immutable from
within these functions.
For instance, in the example in the previous section, we
calculated the recency of a page using 1/(new Date() -
this.date)
. We could, instead, pass in the current date as
part of the scope with the following code:
> db.runCommand({"mapreduce" : "webpages", "map" : map, "reduce" : reduce, "scope" : {now : new Date()}})
Then, in the map
function, we could say
1/(now - this.date)
.
18.118.200.154