Chapter 6. Aggregation

MongoDB provides a number of aggregation tools that go beyond basic query functionality. These range from simply counting the number of documents in a collection to using MapReduce to do complex data analysis.

count

The simplest aggregation tool is count, which returns the number of documents in the collection:

> db.foo.count()
0
> db.foo.insert({"x" : 1})
> db.foo.count()
1

Counting the total number of documents in a collection is fast regardless of collection size.

You can also pass in a query, and Mongo will count the number of results for that query:

> db.foo.insert({"x" : 2})
> db.foo.count()
2
> db.foo.count({"x" : 1})
1

This can be useful for getting a total for pagination: “displaying results 0–10 of 439.” Adding criteria does make the count slower, however.

distinct

The distinct command finds all of the distinct values for a given key. You must specify a collection and key:

> db.runCommand({"distinct" : "people", "key" : "age"})

For example, suppose we had the following documents in our collection:

{"name" : "Ada", "age" : 20}
{"name" : "Fred", "age" : 35}
{"name" : "Susan", "age" : 60}
{"name" : "Andy", "age" : 35}

If you call distinct on the "age" key, you will get back all of the distinct ages:

> db.runCommand({"distinct" : "people", "key" : "age"})
{"values" : [20, 35, 60], "ok" : 1}

A common question at this point is if there’s a way to get all of the distinct keys in a collection. There is no built-in way of doing this, although you can write something to do it yourself using MapReduce (described in a moment).

group

group allows you to perform more complex aggregation. You choose a key to group by, and MongoDB divides the collection into separate groups for each value of the chosen key. For each group, you can create a result document by aggregating the documents that are members of that group.

Note

If you are familiar with SQL, group is similar to SQL’s GROUP BY.

Suppose we have a site that keeps track of stock prices. Every few minutes from 10 a.m. to 4 p.m., it gets the latest price for a stock, which it stores in MongoDB. Now, as part of a reporting application, we want to find the closing price for the past 30 days. This can be easily accomplished using group.

The collection of stock prices contains thousands of documents with the following form:

{"day" : "2010/10/03", "time" : "10/3/2010 03:57:01 GMT-400", "price" : 4.23}
{"day" : "2010/10/04", "time" : "10/4/2010 11:28:39 GMT-400", "price" : 4.27}
{"day" : "2010/10/03", "time" : "10/3/2010 05:00:23 GMT-400", "price" : 4.10}
{"day" : "2010/10/06", "time" : "10/6/2010 05:27:58 GMT-400", "price" : 4.30}
{"day" : "2010/10/04", "time" : "10/4/2010 08:34:50 GMT-400", "price" : 4.01}

Note

You should never store money amounts as floating-point numbers because of inexactness concerns, but we’ll do it anyway in this example for simplicity.

We want our results to be a list of the latest time and price for each day, something like this:

[
    {"time" : "10/3/2010 05:00:23 GMT-400", "price" : 4.10},
    {"time" : "10/4/2010 11:28:39 GMT-400", "price" : 4.27},
    {"time" : "10/6/2010 05:27:58 GMT-400", "price" : 4.30}
]

We can accomplish this by splitting the collection into sets of documents grouped by day then finding the document with the latest timestamp for each day and adding it to the result set. The whole function might look something like this:

> db.runCommand({"group" : {
... "ns" : "stocks",
... "key" : "day",
... "initial" : {"time" : 0},
... "$reduce" : function(doc, prev) {
...     if (doc.time > prev.time) {
...         prev.price = doc.price;
...         prev.time = doc.time;
...     }
... }}})

Let’s break this command down into its component keys:

"ns" : "stocks"

This determines which collection we’ll be running the group on.

"key" : "day"

This specifies the key on which to group the documents in the collection. In this case, that would be the "day" key. All of the documents with a "day" key of a given value will be grouped together.

"initial" : {"time" : 0}

The first time the reduce function is called for a given group, it will be passed the initialization document. This same accumulator will be used for each member of a given group, so any changes made to it can be persisted.

"$reduce" : function(doc, prev) { ... }

This will be called once for each document in the collection. It is passed the current document and an accumulator document: the result so far for that group. In this example, we want the reduce function to compare the current document’s time with the accumulator’s time. If the current document has a later time, we’ll set the accumulator’s day and price to be the current document’s values. Remember that there is a separate accumulator for each group, so there is no need to worry about different days using the same accumulator.

In the initial statement of the problem, we said that we wanted only the last 30 days worth of prices. Our current solution is iterating over the entire collection, however. This is why you can include a "condition" that documents must satisfy in order to be processed by the group command at all:

> db.runCommand({"group" : {
... "ns" : "stocks",
... "key" : "day",
... "initial" : {"time" : 0},
... "$reduce" : function(doc, prev) {
...     if (doc.time > prev.time) {
...         prev.price = doc.price;
...         prev.time = doc.time;
...     }},
... "condition" : {"day" : {"$gt" : "2010/09/30"}}
... }})

Note

Some documentation refers to a "cond" or "q" key, both of which are identical to the "condition" key (just less descriptive).

Now the command will return an array of 30 documents, each of which is a group. Each group has the key on which the group was based (in this case, "day" : string) and the final value of prev for that group. If some of the documents do not contain the key, these will be grouped into a single group with a day : null element. You can eliminate this group by adding "day" : {"$exists" : true} to the "condition". The group command also returns the total number of documents used and the number of distinct values for "key":

> db.runCommand({"group" : {...}})
{
    "retval" :
        [
            {
                "day" : "2010/10/04",
                "time" : "Mon Oct 04 2010 11:28:39 GMT-0400 (EST)"
                "price" : 4.27
            },
            ...
        ],
    "count" : 734,
    "keys" : 30,
    "ok" : 1
}

We explicitly set the "price" for each group, and the "time" was set by the initializer and then updated. The "day" is included because the key being grouped by is included by default in each "retval" embedded document. If you don’t want to return this key, you can use a finalizer to change the final accumulator document into anything, even a nondocument (e.g., a number or string).

Using a Finalizer

Finalizers can be used to minimize the amount of data that needs to be transferred from the database to the user, which is important, because the group command’s output needs to fit in a single database response. To demonstrate this, we’ll take the example of a blog where each post has tags. We want to find the most popular tag for each day. We can group by day (again) and keep a count for each tag. This might look something like this:

> db.posts.group({
... "key" : {"tags" : true},
... "initial" : {"tags" : {}},
... "$reduce" : function(doc, prev) {
...     for (i in doc.tags) {
...         if (doc.tags[i] in prev.tags) {
...             prev.tags[doc.tags[i]]++;
...         } else {
...             prev.tags[doc.tags[i]] = 1;
...         }
...     }
... }})

This will return something like this:

[
    {"day" : "2010/01/12", "tags" : {"nosql" : 4, "winter" : 10, "sledding" : 2}},
    {"day" : "2010/01/13", "tags" : {"soda" : 5, "php" : 2}},
    {"day" : "2010/01/14", "tags" : {"python" : 6, "winter" : 4, "nosql": 15}}
]

Then we could find the largest value in the "tags" document on the client side. However, sending the entire tags document for every day is a lot of extra overhead to send to the client: an entire set of key/value pairs for each day, when all we want is a single string. This is why group takes an optional "finalize" key. "finalize" can contain a function that is run on each group once, right before the result is sent back to the client. We can use a "finalize" function to trim out all of the cruft from our results:

> db.runCommand({"group" : {
... "ns" : "posts",
... "key" : {"tags" : true},
... "initial" : {"tags" : {}},
... "$reduce" : function(doc, prev) {
...     for (i in doc.tags) {
...         if (doc.tags[i] in prev.tags) {
...             prev.tags[doc.tags[i]]++;
...         } else {
...             prev.tags[doc.tags[i]] = 1;
...         }
...     },
... "finalize" : function(prev) {
...     var mostPopular = 0;
...     for (i in prev.tags) {
...         if (prev.tags[i] > mostPopular) {
...             prev.tag = i;
...             mostPopular = prev.tags[i];
...         }
...     }
...     delete prev.tags
... }}})

Now, we’re only getting the information we want; the server will send back something like this:

[
    {"day" : "2010/01/12", "tag" : "winter"},
    {"day" : "2010/01/13", "tag" : "soda"},
    {"day" : "2010/01/14", "tag" : "nosql"}
]

finalize can either modify the argument passed in or return a new value.

Using a Function as a Key

Sometimes you may have more complicated criteria that you want to group by, not just a single key. Suppose you are using group to count how many blog posts are in each category. (Each blog post is in a single category.) Post authors were inconsistent, though, and categorized posts with haphazard capitalization. So, if you group by category name, you’ll end up with separate groups for “MongoDB” and “mongodb.” To make sure any variation of capitalization is treated as the same key, you can define a function to determine documents’ grouping key.

To define a grouping function, you must use a $keyf key (instead of "key"). Using "$keyf" makes the group command look something like this:

> db.posts.group({"ns" : "posts",
... "$keyf" : function(x) { return x.category.toLowerCase(); },
... "initializer" : ... })

"$keyf" allows you can group by arbitrarily complex criteria.

MapReduce

MapReduce is the Uzi of aggregation tools. Everything described with count, distinct, and group can be done with MapReduce, and more. It is a method of aggregation that can be easily parallelized across multiple servers. It splits up a problem, sends chunks of it to different machines, and lets each machine solve its part of the problem. When all of the machines are finished, they merge all of the pieces of the solution back into a full solution.

MapReduce has a couple of steps. It starts with the map step, which maps an operation onto every document in a collection. That operation could be either “do nothing” or “emit these keys with X values.” There is then an intermediary stage called the shuffle step: keys are grouped and lists of emitted values are created for each key. The reduce takes this list of values and reduces it to a single element. This element is returned to the shuffle step until each key has a list containing a single value: the result.

The price of using MapReduce is speed: group is not particularly speedy, but MapReduce is slower and is not supposed to be used in “real time.” You run MapReduce as a background job, it creates a collection of results, and then you can query that collection in real time.

We’ll go through a couple of MapReduce examples because it is incredibly useful and powerful but also a somewhat complex tool.

Example 1: Finding All Keys in a Collection

Using MapReduce for this problem might be overkill, but it is a good way to get familiar with how MapReduce works. If you already understand MapReduce, feel free to skip ahead to the last part of this section, where we cover MongoDB-specific MapReduce considerations.

MongoDB is schemaless, so it does not keep track of the keys in each document. The best way, in general, to find all the keys across all the documents in a collection is to use MapReduce. In this example, we’ll also get a count of how many times each key appears in the collection. This example doesn’t include keys for embedded documents, but it would be a simple addition to the map function to do so.

For the mapping step, we want to get every key of every document in the collection. The map function uses a special function to “return” values that we want to process later: emit. emit gives MapReduce a key (like the one used by group earlier) and a value. In this case, we emit a count of how many times a given key appeared in the document (once: {count : 1}). We want a separate count for each key, so we’ll call emit for every key in the document. this is a reference to the current document we are mapping:

> map = function() {
... for (var key in this) {
...     emit(key, {count : 1});
... }};

Now we have a ton of little {count : 1} documents floating around, each associated with a key from the collection. An array of one or more of these {count : 1} documents will be passed to the reduce function. The reduce function is passed two arguments: key, which is the first argument from emit, and an array of one or more {count : 1} documents that were emitted for that key:

> reduce = function(key, emits) {
... total = 0;
... for (var i in emits) {
...     total += emits[i].count;
... }
... return {"count" : total};
... }

reduce must be able to be called repeatedly on results from either the map phase or previous reduce phases. Therefore, reduce must return a document that can be re-sent to reduce as an element of its second argument. For example, say we have the key x mapped to three documents: {count : 1, id : 1}, {count : 1, id : 2}, and {count : 1, id : 3}. (The ID keys are just for identification purposes.) MongoDB might call reduce in the following pattern:

> r1 = reduce("x", [{count : 1, id : 1}, {count : 1, id : 2}])
{count : 2}
> r2 = reduce("x", [{count : 1, id : 3}])
{count : 1}
> reduce("x", [r1, r2])
{count : 3}

You cannot depend on the second argument always holding one of the initial documents ({count : 1} in this case) or being a certain length. reduce should be able to be run on any combination of emit documents and reduce return values.

Altogether, this MapReduce function would look like this:

> mr = db.runCommand({"mapreduce" : "foo", "map" : map, "reduce" : reduce})
{
    "result" : "tmp.mr.mapreduce_1266787811_1",
    "timeMillis" : 12,
    "counts" : {
        "input" : 6
        "emit" : 14
        "output" : 5
    },
    "ok" : true
}

The document MapReduce returns gives you a bunch of metainformation about the operation:

"result" : "tmp.mr.mapreduce_1266787811_1"

This is the name of the collection the MapReduce results were stored in. This is a temporary collection that will be deleted when the connection that did the MapReduce is closed. We will go over how to specify a nicer name and make the collection permanent in a later part of this chapter.

"timeMillis" : 12

How long the operation took, in milliseconds.

"counts" : { ... }

This embedded document contains three keys:

"input" : 6

The number of documents sent to the map function.

"emit" : 14

The number of times emit was called in the map function.

"output" : 5

The number of documents created in the result collection.

"counts" is mostly useful for debugging.

If we do a find on the resulting collection, we can see all of the keys and their counts from our original collection:

> db[mr.result].find()
{ "_id" : "_id", "value" : { "count" : 6 } }
{ "_id" : "a", "value" : { "count" : 4 } }
{ "_id" : "b", "value" : { "count" : 2 } }
{ "_id" : "x", "value" : { "count" : 1 } }
{ "_id" : "y", "value" : { "count" : 1 } }

Each of the key values becomes an "_id", and the final result of the reduce step(s) becomes the "value".

Example 2: Categorizing Web Pages

Suppose we have a site where people can submit links to other pages, such as reddit.com. Submitters can tag a link as related to certain popular topics, e.g., “politics,” “geek,” or “icanhascheezburger.” We can use MapReduce to figure out which topics are the most popular, as a combination of recent and most-voted-for.

First, we need a map function that emits tags with a value based on the popularity and recency of a document:

map = function() {
    for (var i in this.tags) {
        var recency = 1/(new Date() - this.date);
        var score = recency * this.score;

        emit(this.tags[i], {"urls" : [this.url], "score" : score});
    }
};

Now we need to reduce all of the emitted values for a tag into a single score for that tag:

reduce = function(key, emits) {
    var total = {urls : [], score : 0}
    for (var i in emits) {
        emits[i].urls.forEach(function(url) {
            total.urls.push(url);
        }
        total.score += emits[i].score;
    }
    return total;
};

The final collection will end up with a full list of URLs for each tag and a score showing how popular that particular tag is.

MongoDB and MapReduce

Both of the previous examples used only the mapreduce, map, and reduce keys. These three keys are required, but there are many optional keys that can be passed to the MapReduce command.

"finalize" : function

A final step to send reduce’s output to.

"keeptemp" : boolean

If the temporary result collection should be saved when the connection is closed.

"output" : string

Name for the output collection. Setting this option implies keeptemp : true.

"query" : document

Query to filter documents by before sending to the map function.

"sort" : document

Sort to use on documents before sending to the map (useful in conjunction with the limit option).

"limit" : integer

Maximum number of documents to send to the map function.

"scope" : document

Variables that can be used in any of the JavaScript code.

"verbose" : boolean

Whether to use more verbose output in the server logs.

The finalize function

As with the previous group command, MapReduce can be passed a finalize function that will be run on the last reduce’s output before it is saved to a temporary collection.

Returning large result sets is less critical with MapReduce than group, because the whole result doesn’t have to fit in 4MB. However, the information will be passed over the wire eventually, so finalize is a good chance to take averages, chomp arrays, and remove extra information in general.

Keeping output collections

By default, Mongo creates a temporary collection while it is processing the MapReduce with a name that you are unlikely to choose for a collection: a dot-separated string containing mr, the name of the collection you’re MapReducing, a timestamp, and the job’s ID with the database. It ends up looking something like mr.stuff.18234210220.2. MongoDB will automatically destroy this collection when the connection that did the MapReduce is closed. (You can also drop it manually when you’re done with it.) If you want to persist this collection even after disconnecting, you can specify keeptemp : true as an option.

If you’ll be using the temporary collection regularly, you may want to give it a better name. You can specify a more human-readable name with the out option, which takes a string. If you specify out, you need not specify keeptemp : true, because it is implied. Even if you specify a “pretty” name for the collection, MongoDB will use the autogenerated collection name for intermediate steps of the MapReduce. When it has finished, it will automatically and atomically rename the collection from the autogenerated name to your chosen name. This means that if you run MapReduce multiple times with the same target collection, you will never be using an incomplete collection for operations.

The output collection created by MapReduce is a normal collection, which means that there is no problem with doing a MapReduce on it, or a MapReduce on the results from that MapReduce, ad infinitum!

MapReduce on a subset of documents

Sometimes you need to run MapReduce on only part of a collection. You can add a query to filter the documents before they are passed to the map function.

Every document passed to the map function needs to be deserialized from BSON into a JavaScript object, which is a fairly expensive operation. If you know that you will need to run MapReduce only on a subset of the documents in the collection, adding a filter can greatly speed up the command. The filter is specified by the "query", "limit", and "sort" keys.

The "query" key takes a query document as a value. Any documents that would ordinarily be returned by that query will be passed to the map function. For example, if we have an application tracking analytics and want a summary for the last week, we can use MapReduce on only the most recent week’s documents with the following command:

> db.runCommand({"mapreduce" : "analytics", "map" : map, "reduce" : reduce, 
                 "query" : {"date" : {"$gt" : week_ago}}})

The sort option is mostly useful in conjunction with limit. limit can be used on its own, as well, to simply provide a cutoff on the number of documents sent to the map function.

If, in the previous example, we wanted an analysis of the last 10,000 page views (instead of the last week), we could use limit and sort:

> db.runCommand({"mapreduce" : "analytics", "map" : map, "reduce" : reduce, 
                 "limit" : 10000, "sort" : {"date" : -1}})

query, limit, and sort can be used in any combination, but sort isn’t useful if limit isn’t present.

Using a scope

MapReduce can take a code type for the map, reduce, and finalize functions, and, in most languages, you can specify a scope to be passed with code. However, MapReduce ignores this scope. It has its own scope key, "scope", and you must use that if there are client-side values you want to use in your MapReduce. You can set them using a plain document of the form variable_name : value, and they will be available in your map, reduce, and finalize functions. The scope is immutable from within these functions.

For instance, in the example in the previous section, we calculated the recency of a page using 1/(new Date() - this.date). We could, instead, pass in the current date as part of the scope with the following code:

> db.runCommand({"mapreduce" : "webpages", "map" : map, "reduce" : reduce, 
                 "scope" : {now : new Date()}})

Then, in the map function, we could say 1/(now - this.date).

Getting more output

There is also a verbose option for debugging. If you would like to see the progress of your MapReduce as it runs, you can specify "verbose" : true.

You can also use print to see what’s happening in the map, reduce, and finalize functions. print will print to the server log.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.200.154