CHAPTER 8

image

Advanced Queries

The chapters so far have covered most of the basic query mechanisms to find one or a series of documents by given criteria. There are a number of mechanisms for finding given documents to bring them back to your application so they can be processed. But sometimes these normal query mechanisms fall short and you want to perform complex operations over most or all documents in your collection. Many developers, when queries or operations of this kind are required, either iterate through all documents in the collection or write a series of queries to be executed in sequence to perform the necessary calculations. Although this is a valid way of doing things, it can be burdensome to write and maintain, as well as inefficient. It is for these reasons that MongoDB has some advanced query mechanics that you can use to drive the most from your data. The advanced MongoDB features we’ll examine in this chapter are full-text search, the aggregation framework, and the MapReduce framework.

Full text search is one of the most-requested features to be added to MongoDB –. It represents the ability to create specialized text indexes in MongoDB and then perform text searches on those indexes to locate documents that contain matching text elements. The MongoDB full text search feature goes beyond simple string matching to include a full-stemmed approach based on the language you have selected for your documents, and it is an incredibly powerful tool for performing language queries on your documents. This recently introduced feature is marked as “experimental” in the 2.4 releases of MongoDB, because the development team is still working hard to improve it, which means you must manually activate it for use in your MongoDB environment.

The second feature this chapter will cover is the MongoDB aggregation framework. Introduced in chapters 4 and 6, this feature provides a whole host of query features that let you iterate over selected documents, or all of them, gathering or manipulating information. These query functions are then arranged into a pipeline of operations which are performed one after another on your collection to gather information from your queries.

The third and final feature we will cover is called MapReduce, which will sound familiar to those of you who have worked with Hadoop. MapReduce is a powerful mechanism that makes use of MongoDB’s built-in JavaScript engine to perform abstract code executions in real time. It is an incredibly powerful tool that uses two JavaScript functions, one to map your data and another to transform and pull information out from the mapped data.

Probably the most important thing to remember throughout this chapter is that these are truly advanced features, and it is possible to cause serious performance problems for your MongoDB nodes if they are misused, so whenever possible you should test any of these features in a testing environment before deploying them to important systems.

Text Search

MongoDB’s text search works by first creating a full text index and specifying the fields that you wish to be indexed to facilitate text searching. This text index will go over every document in your collection and tokenize and stem each string of text. This process of tokenizing and stemming involves breaking down the text into tokens, which conceptually are close to words. MongoDB then stems each token to find the root concept for the token. For example, suppose that breaking down a string reaches the token fishing. This token is then stemmed back to the root word fish, so MongoDB creates an index entry of fish for that document. This same process of tokenizing and stemming is applied to the search parameters a user enters to perform a given text search. The parameters are then compared against each document, and a relevance score is calculated. The documents are then returned to the user based on their score.

You are probably wondering how a word like the or it would be stemmed, and what happens if the documents aren’t in English. The answer is that those and similar words would not be stemmed, and MongoDB text search supports many languages.

The MongoDB text search engine is a proprietary engine written for the MongoDB, Inc. team for text data retrieval. MongoDB text search also takes advantage of the Snowball string-processing language, which provides support for the stemming of words and for stop words, those words that aren’t to be stemmed, as they don’t represent any valuable concepts in terms of indexing or searching.

The thing to take away from this is that MongoDB’s text search is incredibly complex and is designed to be as flexible and accurate as possible.

Text Search Costs and Limitations

As you can imagine from what you’ve learned about how text search functions, there are some costs associated with using MongoDB Text search. The first is that it changes the document storage allocation for future documents to the usePowerOf2Sizes option, which instructs MongoDB to allocate storage for more efficient re-use of free space. Second, text indexes are large and can grow very quickly depending on the number of documents you store and the number of tokens within each indexed field. The third limitation is that building a text index on existing documents is time-consuming, and it entails adding new entries to a field that has a text index, which is also more costly. Fourth, like everything in MongoDB, text indexes work better when in RAM. Finally, because of the complexity and size of text indexes they are currently limited to one per collection.

Enabling Text Search

As mentioned earlier, text search was introduced in MongoDB 2.4 as an experimental or beta feature. As such, you need to explicitly enable the text search functions on every MongoDB instance (and MongoS if you are sharding) that will be using this feature in your cluster. There are three ways you can enable text search; the first is to add the following option to the command you use to start or stop your MongoDB processes:

--setParameter textSearchEnabled=true

The second method is to add the following option to your MongoDB instance’s configuration file:

setParameter = textSearchEnabled=true

The third and final method to get text search working on a MongoDB instance is to run the following command via the Mongo shell:

db.adminCommand({ setParameter: 1, textSearchEnabled : true }

With this set you can now work with the MongoDB full text search features on this node.

image Note  The fact that this feature is in beta does not mean it doesn’t work. The MongoDB, Inc. team has put a considerable amount of effort into trying to get this feature right. By using the feature and reporting any issues you have on the MongoDB, Inc. JIRA (jira.mongodb.org), you can help them get this feature ready for full release.

By now you should have enabled the text search features on your MongoDB instance and are ready to take advantage of it!   Let’s look at how to create a text search index and perform text searches.

Using Text Search

Despite all the complexity we’ve described, MongoDB text search is surprisingly easy to use; you create a text index in the same way as any other index. For example, to create a text index on the “content” element of our theoretical blog collection, I would run the following

db.blog.ensureIndex( { content : "text" } );

And that’s it. MongoDB will take care of the rest and insert a text index into your database, and all future documents that have a content field will be processed and have entries added to the text index to be searched. But really, just creating an index isn’t quite enough data to work with. We need a suitable set of text data to work with and query.

Loading Text Data

Originally, we had planned to use a live stream of data from twitter, but the documents were too ungainly to work with. So we have instead created a small batch of eight documents mimicking twitter feeds to take text search out for a spin.

Go ahead and mongoimport the MongoDB data from the twitter.tgz into your database:

$ mongoimport test.json -d test -c texttest
connected to: 127.0.0.1
Sat Jul  6 17:52:19 imported 8 objects

Now that we have the data restored, go ahead and enable text indexing if it isn’t already:

db.adminCommand({ setParameter: 1, textSearchEnabled : true });
{ "was" : false, "ok" : 1 }

Now that we have text indexing enabled, we should create a text index on the twitter data.

Creating a Text Index

In the case of twitter data, the portion we are concerned with is the textfield, which is the body text of the tweet. To set up the text index we run the following command:

use test;
db. texttest.ensureIndex( { body : "text" } );

If you see the error message “text search not enabled,” you need to ensure that the text index is running, by using the commands just shown. Now if you review your logs you will see the following, which shows the text index being built:

Sat Jul  6 17:54:16.078 [conn41] build index test.texttest { _fts: "text", _ftsx: 1 }
Sat Jul  6 17:54:16.089 [conn41] build index done. scanned 8 total records. 0.01 secs

We can also check the indexes for the collection:

db.texttest.getIndexes()
[
     {
          "v" : 1,
          "key" : {
               "_id" : 1
          },
          "ns" : "test.texttest",
          "name" : "_id_"
     },
     {
          "v" : 1,
          "key" : {
               "_fts" : "text",
               "_ftsx" : 1
          },
          "ns" : "test.texttest",
          "name" : "body_text",
          "weights" : {
               "body" : 1
          },
          "default_language" : "english",
          "language_override" : "language",
          "textIndexVersion" : 1
     }
]

Okay, we’ve enabled text search, created our index, and confirmed that it’s there; now let’s run our text search command.

Running the Text Search Command

In the version of MongoDB we are using there is no shell helper for the text command, so we execute it with the runCommand syntax as follows:

> db.texttest.runCommand( "text", { search :"fish" } )

This command will return any documents that match the query string of "fish". In this case it has returned two documents. The output shows quite a bit of debug information, along with a "results" array; this contains a number of documents. These include a combination of the score for the matching document and the returned, the matching document as the obj. You can see that the text portions of our matching documents both contain the word fish or fishing which both match our query! It’s also worth noting that MongoDB text indexes are case-insensitive, which is an important consideration when performing your text queries.

image Note  Remember that all entries in Text Search are tokenized and stemmed. This means that words like fishy or fishing will be stemmed down to the word fish.

In addition, you can see the score, which was 0.75 or 0.666, indicating the relevance of that result to your query— the higher the value, the better the match. You can also see the stats for the query, including the number of objects returned (2) and the time taken, which was 112 microseconds.

{
     "queryDebugString" : "fish||||||",
     "language" : "english",
     "results" : [
          {
               "score" : 0.75,
               "obj" : {
                    "_id" : ObjectId("51d7ccb36bc6f959debe5514"),
                    "number" : 1,
                    "body" : "i like fish",
                    "about" : "food"
               }
          },
          {
               "score" : 0.6666666666666666,
               "obj" : {
                    "_id" : ObjectId("51d7ccb36bc6f959debe5516"),
                    "number" : 3,
                    "body" : "i like to go fishing",
                    "about" : "recreation"
               }
          }
     ],
     "stats" : {
          "nscanned" : 2,
          "nscannedObjects" : 0,
          "n" : 2,
          "nfound" : 2,
          "timeMicros" : 112
     },
     "ok" : 1
}

Now let’s examine some other text search features that we can use to enhance our text queries.

Filtering Text Queries

The first thing we can do is to filter the text queries. To refine our fish query, let’s say we only want documents that refer to fish as food, and not any that match “fishing” the activity. To add this additional parameter, we use the filter option and provide a document with a normal query. So in order to find our fish as food, we run the following:

> db.texttest.runCommand( "text", { search : "fish", filter : { about : "food" } })
{
     "queryDebugString" : "fish||||||",
     "language" : "english",
     "results" : [
          {
               "score" : 0.75,
               "obj" : {
                    "_id" : ObjectId("51d7ccb36bc6f959debe5514"),
                    "number" : 1,
                    "body" : "i like fish",
                    "about" : "food"
               }
          }
     ],
     "stats" : {
          "nscanned" : 2,
          "nscannedObjects" : 2,
          "n" : 1,
          "nfound" : 1,
          "timeMicros" : 101
     },
     "ok" : 1
}

That’s perfect; we’ve returned only the one item as we wanted, without getting the unrelated “fishing” document. Notice that the nScanned and nscannedObjects values are 2, which denotes that this query scanned two documents from the index (nScanned) and then had to retrieve these two documents to review their contents (nScannedObjects) to return one matching document (n). Now let’s look at another example.

More Involved Text Searching

First run the following query, which will return two documents. The results have been cut down to just the text fields for brevity.

db.texttest.runCommand( "text", { search : "cook" })
"body" : "i want to cook dinner",
"body" : "i am to cooking lunch",

As you can see, we have two documents, both of which are about cooking a meal. Let’s say we want to exclude lunch from our search and only return dinner. We can do this by adding –lunch to exclude the text lunch from our search.

> db.texttest.runCommand( "text", { search : "cook -lunch" })
{
     "queryDebugString" : "cook||lunch||||",
     "language" : "english",
     "results" : [
          {
               "score" : 0.6666666666666666,
               "obj" : {
                    "_id" : ObjectId("51d7ccb36bc6f959debe5518"),
                    "number" : 5,
                    "body" : "i want to cook dinner",
                    "about" : "activities"
               }
          }
     ],
     "stats" : {
          "nscanned" : 2,
          "nscannedObjects" : 0,
          "n" : 1,
          "nfound" : 1,
          "timeMicros" : 150
     },
     "ok" : 1
}

Notice first that the queryDebugStringcontains both cook and lunch, as these are the search terms we were using. Also note that two entries were scanned, but only one was returned. The search works by first finding all matches and then eliminating nonmatches.

The last search function  that people may find valuable is string literal searching, which can be used to match specific words or phrases without stemming. As it stands, all the elements of our individual searches are being tokenized and then stemmed and each term evaluated. Take the following query:

> db.texttest.runCommand( "text", { search : "mongodb text search" })
{
     "queryDebugString" : "mongodb|search|text||||||",
     "language" : "english",
     "results" : [
          {
               "score" : 3.875,
               "obj" : {
                    "_id" : ObjectId("51d7ccb36bc6f959debe551a"),
                    "number" : 7,
                    "body" : "i like mongodb text search",
                    "about" : "food"
               }
          },
          {
               "score" : 3.8000000000000003,
               "obj" : {
                    "_id" : ObjectId("51d7ccb36bc6f959debe551b"),
                    "number" : 8,
                    "body" : "mongodb has a new text search feature",
                    "about" : "food"
               }
          }
     ],
     "stats" : {
          "nscanned" : 6,
          "nscannedObjects" : 0,
          "n" : 2,
          "nfound" : 2,
          "timeMicros" : 537
     },
     "ok" : 1
}

You can see in the queryDebugString that each element was evaluated and queried against. You can also see that this query evaluated and found two documents. Now notice the difference when we run the same query with escaped quote marks to make it a string literal:

> db.texttest.runCommand( "text", { search : ""mongodb text search"" })
{
     "queryDebugString" : "mongodb|search|text||||mongodb text search||",
     "language" : "english",
     "results" : [
          {
               "score" : 3.875,
               "obj" : {
                    "_id" : ObjectId("51d7ccb36bc6f959debe551a"),
                    "number" : 7,
                    "body" : "i like mongodb text search",
                    "about" : "food"
               }
          }
     ],
     "stats" : {
          "nscanned" : 6,
          "nscannedObjects" : 0,
          "n" : 1,
          "nfound" : 1,
          "timeMicros" : 134
     },
     "ok" : 1
}

You can see that only one document is returned, the document that actually contains the text in question. You can also see that in the queryDebugString the final element is the string itself rather than just the three tokenized and stemmed elements.

Additional Options

In addition to those we have discussed so far, there are three other options you can add into the text function. The first is limit, which limits the number of documents returned. It can be used as follows:

> db.texttest.runCommand( "text", { search :"fish", limit : 1 } )

The second option is project, which allows you to set the fields that will be displayed as the result of the query. This option takes a document describing which fields you wish to display, with 0 being off and 1 being on. By default when specifying this option all elements are off except for _id, which is on.

> db.texttest.runCommand( "text", { search :"fish", project :  { _id : 0, body : 1 } } )

The third and final option is language, which allows you to specify which language the text search will use. If no language is specified, then the index’s default language is used. The language must be specified all in lower case. It can be invoked as follows:

> db.texttest.runCommand( "text", { search :"fish", lagnuage :  "french" } )

Currently text search supports the following languages.

  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Hungarian
  • Italian
  • Norwegian
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish
  • Turkish

For more details on what is currently supported within MongoDB’s text search, see the page http://docs.mongodb.org/manual/reference/command/text/.

Text Indexes in Other Languages

We originally created a simple text index earlier in order to get started with our text work. But there are a number of additional techniques you can use to make your text index better suited to your workload. You may recall from earlier that the logic for how words are stemmed will change based on the language that MongoDB uses to perform it. By default, all indexes are created in English, but this is not suitable for many people as their data may not be in English and thus the rules for language are different. You can specify the language to be used within each query, but that isn’t exactly friendly when you know which language you are using. You can specify the default language by adding that option to the index creation:

db. texttest.ensureIndex( { content : "text" }, { default_language : "french" } );

This will create a text index with the French language as the default. Now remember that you can only have one text index per collection, so you will need to drop any other indexes before creating this one.

But what if we have multiple languages in one collection? The text index feature offers a solution, but it requires you to tag all your documents with the correct language. You may think it would be better for MongoDB to determine which language a given document is in, but there is no programmatic way to make an exact linguistic match. Instead, MongoDB allows you to work with documents that specify their own language. For example, take the following four documents:

{ _id : 1, content : "cheese", lingvo : "english" }
{ _id : 2, content : "fromage", lingvo: "french" }
{ _id : 3, content : "queso", lingvo: "spanish" }
{ _id : 4, content : "ost", lingvo: "swedish" }

They include four languages (in the lingvo fields), and if we keep any one default then we need to specify which language we will be searching within. Because we have specified the language the given content is in, we can use this field as a language override and the given language will be used rather than the default. We can create an index with this as follows:

db.texttest.ensureIndex( { content : "text" }, { language_override : "lingvo" } );

Thus the default language for those documents will be the one provided, and any documents lacking a lingvo field will use the default index, in this case English.

Compound Indexing with Text Indexes

Although it is true that you can only have one text index on a collection, you can have that text index cover a number of fields in a document or even all fields. You can specify extra fields just as you would for a normal index. Let’s say we want to index both the content and any comments; we can do that as follows. Now we can make text searches on both fields.

db.texttest.ensureIndex( { content : "text", comments : "text" });

You may even want to create a text index on all the fields in a document. MongoDB has a wildcard specifier that can be used to reference all text elements of all documents; the notation is "$**". If you wish to specify this as the form for your text index, you will need to add the index option for a name to your document. This way, the automatically generated name will not be used and cause problems with the index by being too long a field. The maximum length of an index is 121 characters, which includes the names of the collection, the database, and the fields to be indexed.

image Note  It is strongly recommended that you specify a name with any compound index that has a text field to avoid running into issues caused by the name length.

This gives us the following syntax for creating a text index named alltextindexon all string elements of all documents in the texttest collection:

db.texttest.ensureIndex( { "$**": "text" }, { name: "alltextindex" } )

The next thing you can do with a compound text index is specifying weights for different text fields on that index. You do this by adding weight values above the default of one to each field that you will index. The values will then increase the importance of the given index’s results in a ratio of N:1. Take the following example index:

db.texttest.ensureIndex( { content : "text", comments : "text"}, { weights : { content: 10, comments: 5,                          } } );

This index will mean that the content portion of the document will be given 10:5 more precedence than the comments values. Any other field will have the default weight of 1, compared to the weight of 5 for comments and 10 for content. You can also combine weights and the wildcard text search parameters to weight specific fields.

image Note  Be aware that if you have too many text indexes, you will get a “too many text indexes” error. If that happens, you should drop one of your existing text indexes to allow you to create a new one.

In addition to creating compound indexes with other text fields, you can create compound indexes with other non-text fields. You can build these indexes just as you would add any other index, as in this example:

db.texttest.ensureIndex( { content : "text", username : 1 });

This command creates a text index on the content portion of the document and a normal index on the username portion. This can be especially useful when using the filter parameter, as the filter is effectively a query on all the subdocuments used. These, too, will need to be read either from the index or by reading the document itself. Let’s look at our example from earlier:

db.texttest.runCommand( "text", { search : "fish", filter : { about : "food" } })

Given the filter on this query, we will need to index the about portion of the document; otherwise, every theoretically matched document would need to be fully read and then validated against, which is a costly process. However, if we index as follows we can avoid those reads by having an index like this, which includes the about element:

db.texttest.ensureIndex( { about : 1, content : "text" });

Now let’s run the find command again:

> db.texttest.runCommand( "text", { search : "fish", filter : { about : "food" } })
{
     "queryDebugString" : "fish||||||",
     "language" : "english",
     "results" : [
          {
               "score" : 0.75,
               "obj" : {
                    "_id" : ObjectId("51d7ccb36bc6f959debe5514"),
                    "number" : 1,
                    "body" : "i like fish",
                    "about" : "food"
               }
          }
     ],
     "stats" : {
          "nscanned" : 1,
          "nscannedObjects" : 0,
          "n" : 1,
          "nfound" : 1,
          "timeMicros" : 95
     },
     "ok" : 1
}

You can see that there are no scanned objects, which should improve the overall efficiency of the query. With these options you should be able to drive some real flexibility and power into your text searching.

You should now see the enormous power of MongoDB’s latest searching feature, and you should have the knowledge to drive some real power from text searching.

The Aggregation Framework

The aggregation framework in MongoDB represents the ability to perform a selection of operations on all of the data in your collection. This is done by creating a pipeline of aggregation operations that will be executed in order first on the data and then each subsequent operation will be on the results of the previous operation. Those of you familiar with the Linux or Unix shell will recognize this as forming a shell pipeline of operations.

Within the Aggregation framework there are a plethora of operators, which can be used as part of your aggregations to corral your data. Here we will cover off some of the high-level pipeline operators and run through some examples on how to use these. This means that we will be covering off the following operators:

  • $group
  • $limit
  • $match
  • $sort
  • $unwind
  • $project
  • $skip

For further details about the full suite of operators, check out the aggregation documentation, available at http://docs.mongodb.org/manual/aggregation/. We’ve created an example collection you can use to test some of the aggregation commands. Extract the archive with the following command:

$ tar -xvf test.tgz
x test/
x test/aggregation.bson
x test/aggregation.metadata.json
x test/mapreduce.bson
x test/mapreduce.metadata.json

The next thing to do is to run the mongorestore command to restore the test database:

$ mongorestore test
connected to: 127.0.0.1
Sun Jul 21 19:26:21.342 test/aggregation.bson
Sun Jul 21 19:26:21.342      going into namespace [test.aggregation]
1000 objects found
Sun Jul 21 19:26:21.350      Creating index: { key: { _id: 1 }, ns: "test.aggregation", name: "_id_" }
Sun Jul 21 19:26:21.688 test/mapreduce.bson
Sun Jul 21 19:26:21.689      going into namespace [test.mapreduce]
1000 objects found
Sun Jul 21 19:26:21.695      Creating index: { key: { _id: 1 }, ns: "test.mapreduce", name: "_id_" }

Now that we have a collection of data to work with, we need to look at how to run an aggregation command and how to build an aggregation pipeline. To run an aggregation query we use the aggregate command and provide it a single document that contains the pipeline. For our tests we will run the following aggregation command with various pipeline documents:

> db.aggregation.aggregate({pipeline document})

So, without further ado, let’s start working through our aggregation examples.

$group

The $group command does what its name suggests; it groups documents together so you can create an aggregate of the results. Let’s start by creating a simple group command that will list out all the different colors within our “aggregation” collection. To begin, we create an _id document that will list all the elements from our collection that we want to group. So, we start our pipeline document with the $group command and add to it our _id document:

{ $group : { _id : "$color" } }

Now you can see we have the _id value of "$color". Note that there is a $ sign in front of the name color; this indicates that the element is a reference from a field in our documents. That gives us our basic document structure, so let’s execute the aggregation:

> db.aggregation.aggregate( { $group : { _id : "$color" } } )
{
     "result" : [
          {
               "_id" : "red"
          },
          {
               "_id" : "maroon"
          },
...
          {
               "_id" : "grey"
          },
          {
               "_id" : "blue"
          }
     ],
     "ok" : 1
}

$sum

From the results of the $group operator you can see that we have a number of different colors in our result stack. The result is an array of elements, which contain a number of documents, each with an _id value of one of the colors in the "color" field from a document. This doesn’t really tell us much, so let’s expand what we do with our $group command. We can add a count to our group with the $sum operator, which can increment a value for each instance of the value found. To do this, we add an extra value to our $group command by providing a name for the new field and what its value should be. In this case, we want a field called "count", as it represents the number of times each color occurs; its value is to be {$sum : 1}, which means that we want to create a sum per document and increase it by 1 each time. This gives us the following document:

{ $group : { _id : "$color", count : { $sum : 1 } }

Let’s run our aggregation with this new document:

> db.aggregation.aggregate({ $group : { _id : "$color", count : { $sum : 1 } } }
 {
     "result" : [
          {
               "_id" : "red",
               "count" : 90
          },
          {
               "_id" : "maroon",
               "count" : 91
          },
...
     {
               "_id" : "grey",
               "count" : 91
          },
          {
               "_id" : "blue",
               "count" : 91
          }
     ],
     "ok" : 1
}

Now you can see how often each color occurs. We can further expand what we are grouping by adding extra elements to the _id document. Let’s say we want to find groups of "color" and "transport". To do that, we can change _id to be a document that contains a subdocument of items as follows:

{ $group : { _id : { color: "$color", transport: "$transport"} , count : { $sum : 1 } } }

If we run this we get a result that is about 50 elements long, far too long to display here. There is a solution to this, and that’s the $limit operator.

$limit

The $limit operator is the next pipeline operator we will work with. As its name implies, $limit is used to limit the number of results returned. In our case we want to make the results of our existing pipeline more manageable, so let’s add a limit of 5 to the results. To add this limit, we need to turn our one document into an array of pipeline documents.

[
        { $group : { _id : { color: "$color", transport: "$transport"} , count : { $sum : 1 } } },
        { $limit : 5 }
]

This will give us the following results:

> db.aggregation.aggregate( [ { $group : { _id : { color: "$color", transport: "$transport"} , count : { $sum : 1 } } }, { $limit : 5 } ] )
{
     "result" : [
          {
               "_id" : {
                    "color" : "maroon",
                    "transport" : "motorbike"
               },
               "count" : 18
          },
          {
               "_id" : {
                    "color" : "orange",
                    "transport" : "autombile"
               },
               "count" : 18
          },
          {
               "_id" : {
                    "color" : "green",
                    "transport" : "train"
               },
               "count" : 18
          },
          {
               "_id" : {
                    "color" : "purple",
                    "transport" : "train"
               },
               "count" : 18
          },
          {
               "_id" : {
                    "color" : "grey",
                    "transport" : "plane"
               },
               "count" : 18
          }
     ],
     "ok" : 1
}

You can now see the extra fields from the transport element added to _id, and we have limited the results to only five. You should now see how we can build pipelines from multiple operators to draw data aggregated information from our collection.

$match

The next operator we will review is $match, which is used to effectively return the results of a normal MongoDB query within your aggregation pipeline. The $match operator is best used at the start of the pipeline to limit the number of documents that are initially put into the pipeline; by limiting the number of documents processed, we significantly reduce performance overhead. For example, suppose we want to perform our pipeline operations on only those documents that have a num value greater than 500. We can use the query { num : { $gt : 500 } } to return all documents matching this criterion. If we add this query as a $match to our to our existing aggregation, we get the following:

[
        { $match : { num : { $gt : 500 } } },
        { $group : { _id : { color: "$color", transport: "$transport"} , count : { $sum : 1 } } },
        { $limit : 5 }
]

This returns the following result:

{
     "result" : [
          {
               "_id" : {
                    "color" : "white",
                    "transport" : "boat"
               },
               "count" : 9
          },
          {
               "_id" : {
                    "color" : "black",
                    "transport" : "motorbike"
               },
               "count" : 9
          },
          {
               "_id" : {
                    "color" : "maroon",
                    "transport" : "train"
               },
               "count" : 9
          },
          {
               "_id" : {
                    "color" : "blue",
                    "transport" : "autombile"
               },
               "count" : 9
          },
          {
               "_id" : {
                    "color" : "green",
                    "transport" : "autombile"
               },
               "count" : 9
          }
     ],
     "ok" : 1
}

You will notice that the results returned are almost completely different from previous examples. This is because the order in which the documents were created has now changed. As such, when we run this query we limit the output, removing the original documents that had been our output earlier. You will also see that our counts are half the values of the earlier results. This is because we have cut our potential set of data to aggregate upon to about half the size it was before. If we want to have consistency among our return results, we need to invoke another pipeline operator, $sort.

$sort

As you’ve just seen, the $limit command can change which documents are returned in the result because it reflects the order in which the documents were originally output from the execution of the aggregation. This can be fixed with the advent of the $sort command. We simply need to apply a sort on a particular field before providing the limit in order to return the same set of limited results. The $sort syntax is the same as it is for a normal query; you specify documents that you wish to sort by, positive for ascending and negative for descending. To show how this works, let’s run our query with and without the match and a limit of 1. You will see that with the $sort prior to the $limit we can return documents in the same order.

This gives us the first query of

[
        { $group : { _id : { color: "$color", transport: "$transport"} , count : { $sum : 1 } } },
        { $sort : { _id :1 } },
        { $limit : 5 }
]

The result of this query is:

{
     "result" : [
          {
               "_id" : {
                    "color" : "black",
                    "transport" : "autombile"
               },
               "count" : 18
          }
     ],
     "ok" : 1
}

The second query looks like this:

[
        { $match : { num : { $gt : 500 } } },
        { $group : { _id : { color: "$color", transport: "$transport"} , count : { $sum : 1 } } },
        { $sort : { _id :1 } },
        { $limit : 1 }
]

The result of this query is

{
     "result" : [
          {
               "_id" : {
                    "color" : "black",
                    "transport" : "autombile"
               },
               "count" : 9
          }
     ],
     "ok" : 1
}

You will notice that both queries now contain the same document, and they differ only in the count. This means that our sort has been applied before the limit and allows us to get a consistent result. These operators should give you an idea of the power you can drive by building a pipeline of operators to manipulate things until we get the desired result.

$unwind

The next operator we will look at is $unwind. This takes an array and splits each element into a new document (in memory and not added to your collection) for each array element. As with making a shell pipeline, the best way to understand what is output by the $unwind operator is simply to run it on its own and evaluate the output. Let’s check out the results of $unwind:

db.aggregation.aggregate({ $unwind : "$vegetables" });
{
     "result" : [
          {
               "_id" : ObjectId("51de841747f3a410e3000001"),
               "num" : 1,
               "color" : "blue",
               "transport" : "train",
               "fruits" : [
                    "orange",
                    "banana",
                    "kiwi"
               ],
               "vegetables" : "corn"
          },
          {
               "_id" : ObjectId("51de841747f3a410e3000001"),
               "num" : 1,
               "color" : "blue",
               "transport" : "train",
               "fruits" : [
                    "orange",
                    "banana",
                    "kiwi"
               ],
               "vegetables" : "brocoli"
          },
          {
               "_id" : ObjectId("51de841747f3a410e3000001"),
               "num" : 1,
               "color" : "blue",
               "transport" : "train",
               "fruits" : [
                    "orange",
                    "banana",
                    "kiwi"
               ],
               "vegetables" : "potato"
          },
...
     ],
     "ok" : 1
}

We now have 3000 documents in our result array, a version of each document that has its own vegetable and the rest of the original source document! You can see the power of what we can do with $unwind and how with a very large collection of giant documents you could get yourself in trouble. Always remember that if you run your match first, you can cut down the number of objects you want to work with before running the other, more intensive aggregation operations.

$project

Our next operator, $project, is used to limit the fields or to rename fields returned as part of a document. This is just like the field-limiting arguments that can be set on find commands. It’s the perfect way to cut down on any excess fields returned by your aggregations. Let’s say we want to see only the fruit and vegetables for each of our documents; we can provide a document that shows which elements we want to be displayed (or not) just as we would add to our find command. Take the following example:

[
{ $unwind : "$vegetables" },
{ $project : { _id: 0, fruits:1, vegetables:1 } }
]

This projection returns the following result:

 
 {
     "result" : [
          {
               "fruits" : [
                    "orange",
                    "banana",
                    "kiwi"
               ],
               "vegetables" : "corn"
          },
          {
               "fruits" : [
                    "orange",
                    "banana",
                    "kiwi"
               ],
               "vegetables" : "brocoli"
          },
          {
               "fruits" : [
                    "orange",
                    "banana",
                    "kiwi"
               ],
               "vegetables" : "potato"
          },
...
     ],
     "ok" : 1
}

That’s better than before, as now our documents are not as big. But still better would be to cut down on the number of documents returned. Our next operator will help with that.

$skip

$skipis a pipeline operator complementary to the $limit operator, but instead of limiting results to the first X documents, it skips over the first X documents and returns all other remaining documents. We can use it to cut down on the number of documents returned. If we add it to our previous query with a value of 2995, we will return only five results. This would give us the following query:

[
{ $unwind : "$vegetables" },
{ $project : { _id: 0, fruits:1, vegetables:1 } },
{ $skip : 2995 }
]

With a result of

{
     "result" : [
          {
               "fruits" : [
                    "kiwi",
                    "pear",
                    "lemon"
               ],
               "vegetables" : "pumpkin"
          },
          {
               "fruits" : [
                    "kiwi",
                    "pear",
                    "lemon"
               ],
               "vegetables" : "mushroom"
          },
          {
               "fruits" : [
                    "pear",
                    "lemon",
                    "cherry"
               ],
               "vegetables" : "pumpkin"
          },
          {
               "fruits" : [
                    "pear",
                    "lemon",
                    "cherry"
               ],
               "vegetables" : "mushroom"
          },
          {
               "fruits" : [
                    "pear",
                    "lemon",
                    "cherry"
               ],
               "vegetables" : "capsicum"
          }
     ],
     "ok" : 1
}

And that’s how you can use the $skip operator to reduce the number of entries returned. You can also use the complementary $limit operator to limit the number of results in the same manner and even combine them to pick out a set number of results in the middle of a collection. Let’s say we wanted results 1500–1510 of our 3000-entry data set. We could provide a $skip value of 1500 and a $limit of 10, which would return only the 10 results we wanted.

We’ve reviewed just a few of the top-level pipeline operators available within the MongoDB aggregation framework. There are a whole host of smaller operators that can be used within the top-level pipeline operators as pipeline expressions. These include some geographic functions, mathematics functions such as average, first and last, and a number of date/time and other operations. And all of these can be used and combined to perform aggregation operations like the ones we have covered. Just remember that each operation in a pipeline will be performed on the results of the previous operation and that you can output and step through them to create your desired result.

MapReduce

MapReduce is one of the most complex query mechanisms within MongoDB. It works by taking two JavaScript functions, map and reduce. These two functions are completely user-defined, and this gives you an incredible amount of flexibility in what you can do! A few short examples will demonstrate some of the things you can do with MapReduce.

How MapReduce Works

Before we dive into the examples, it’s a good idea to go over what Map/Reduce is and how it works. In MongoDB’s implementation of MapReduce we issue a specialized query to a given collection, and all matching documents from that query are then input into our map function. This map function is designed to generate key/value pairs. Any set of keys that have multiple values are then input to the reduce function, which returns the aggregated result of the input data. After this there is one remaining optional step in which data can be finished for nice presentation by a finalize function.

Setting Up Testing Documents

To begin with, we need to set up some documents to test with. We’ve created a mapreduce collection that is part of the test database you restored earlier. If you haven’t restored it yet, extract the archive with the following command:

$ tar -xvf test.tgz
x test/
x test/aggregation.bson
x test/aggregation.metadata.json
x test/mapreduce.bson
x test/mapreduce.metadata.json

Then run the mongorestore command to restore the test database:

$ mongorestore test
connected to: 127.0.0.1
Sun Jul 21 19:26:21.342 test/aggregation.bson
Sun Jul 21 19:26:21.342      going into namespace [test.aggregation]
1000 objects found
Sun Jul 21 19:26:21.350      Creating index: { key: { _id: 1 }, ns: "test.aggregation", name: "_id_" }
Sun Jul 21 19:26:21.688 test/mapreduce.bson
Sun Jul 21 19:26:21.689      going into namespace [test.mapreduce]
1000 objects found
Sun Jul 21 19:26:21.695      Creating index: { key: { _id: 1 }, ns: "test.mapreduce", name: "_id_" }

This will give you a collection of documents to use in working with MapReduce. To begin, let’s look at the world’s simplest map function.

Working with Map functions

This function will “emit” the color and the num value from each document in the mapreduce collection. These two fields will be output in key/value form, with the first argument (color) as the key and the second argument (number) as the value. This is a lot to take in at first, so take a look at the simple map function that performs this emit:

var map = function() {
    emit(this.color, this.num);
};

In order to run a Map/Reduce we also need a reduce function, but before doing anything fancy let’s see what’s provided as the result of an empty reduce function to get an idea of what happens.

var reduce = function(color, numbers) { };

Enter both these commands into your shell, and you’ll have just about all you need to run our MapReduce.

The last thing you will need to provide is an output string for the MapReduce to use. This string defines where the output for this MapReduce command should be put. The two most common options are

  • To a collection
  • To the console (inline)

For our current purposes, let’s output to the screen so we can see exactly what is going on. To do this, we pass a document with the out option that has a value of { inline : 1 }, like this:

{ out : { inline : 1 } }

This gives us the following command:

db.mapreduce.mapReduce(map,reduce,{ out: { inline : 1 } });

The result looks like this:

{
     "results" : [
          {
               "_id" : "black",
               "value" : null
          },
          {
               "_id" : "blue",
               "value" : null
          },
          {
               "_id" : "brown",
               "value" : null
          },
          {
               "_id" : "green",
               "value" : null
          },
          {
               "_id" : "grey",
               "value" : null
          },
          {
               "_id" : "maroon",
               "value" : null
          },
          {
               "_id" : "orange",
               "value" : null
          },
          {
               "_id" : "purple",
               "value" : null
          },
          {
               "_id" : "red",
               "value" : null
          },
          {
               "_id" : "white",
               "value" : null
          },
          {
               "_id" : "yellow",
               "value" : null
          }
     ],
     "timeMillis" : 95,
     "counts" : {
          "input" : 1000,
          "emit" : 1000,
          "reduce" : 55,
          "output" : 11
     },
     "ok" : 1,
}

This shows that each “key” color value is split out individually and is the unique _id value for each document. Because we specified nothing for the value portion of each document, that is set to null. We can modify this by adding the output section for our desired MapReduce results. In this case we want a summary of what each functions takes. To do that we can use the function to modify what we want to return as the object in place of null. In this case, let’s return the sum of all values for each of those colors. To do this we can create a function that will return the sum of all the array of numbers for each color that’s passed into the reduce function. Thankfully, we can use a handy function called Array.sum to sum all values of an array. This gives us the following reduce function:

var reduce = function(color, numbers) {
    return Array.sum(numbers);
};

Perfect. In addition to our inline output we can also have MapReduce write to a collection; to do so, we simply have to replace that { inline : 1 } with the name of the collection we wish to output to. So let’s output to a collection called mrresult. This gives us the following command:

 
 db.mapreduce.mapReduce(map,reduce,{ out: "mrresult" });

When executed with our new reduce function, it gives us the following:

{
     "result" : "mrresult",
     "timeMillis" : 111,
     "counts" : {
          "input" : 1000,
          "emit" : 1000,
          "reduce" : 55,
          "output" : 11
     },
     "ok" : 1,
}

If you now want to see the document results, you need to query them from the mrresult collection, as follows:

> db.mrresult.findOne();
{ "_id" : "black", "value" : 45318 }

Now that we have a basic system working we can get more advanced!

Advanced MapReduce

Let’s say that instead of the sum of all values we want the average! This becomes far harder, as we need to add another variable—the number of objects we have! But how can we pass two variables out from the map function? After all, the emit takes only two arguments. We can perform a “cheat” of sorts; we return a JSON document, which can have as many fields as we want! So let’s expand our original map function to return a document that contains the color value and a counter value. First we define the document as a new variable, fill in the JSON document, and then emit that document.

var map = function() {
     var value = {
          num : this.num,
          count : 1
     };
     emit(this.color, value);
};

Notice that we set the counter value to 1, in order to count each document only once! Now for the reduce function. It will need to deal with an array of those value documents that we created earlier. One last thing to note is that we need to return the same values in our reduce function’s return function that are created in our map function and sent to our emit.

image Note  You could also accomplish all of the things we are doing here by using the length of the array containing all the numbers. But this way you get to see more of what you can do with MapReduce.

To deal with this array, we’ve created a simple for loop, the length of the array, and we iterate over each member and add the num and count for each document onto our new return variable, called reduceValue. Now we simply return this value and we have our result.

var reduce = function(color, val ) {
     reduceValue = { num : 0, count : 0};
     for (var i = 0; i < val.length; i++) {
          reduceValue.num += val[i].num;
          reduceValue.count += val[i].count;
     }
     return reduceValue;
};

At this point, you should be wondering how this gets us our average. We have the count and the number, but no actual average! If you run the MapReduce again you can see the results for yourself. Now, be warned that each time you output to a collection, MapReduce will drop that collection before writing to it! For us that’s a good thing, as we only want this run’s results but it could come back to haunt you in the future. If you want to merge the results of the two you can make an output document that looks like { out : { merge : "mrresult" } }.

db.mapreduce.mapReduce(map,reduce,{ out: "mrresult" });

Now Let’s check those results quickly:

 
 > db.mrresult.findOne();
{
     "_id" : "black",
     "value" : {
          "num" : 18381,
          "count" : 27028,
     }
}

No, there is not an average value. This means we have more work to do, but how do we make the average given that we have to return a document that matches the input of the emit? We need to a third function! MapRreduce provides one, called the finalize function. This allows you to do any final cleanup before returning your MapReduce results. Let’s write a function that will take the result from reduce and calculate the average for us:

var finalize = function (key, value) {
     value.avg = value.num/value.count;
     return value;
};

Yes, it’s that simple. So now with our map, reduce, and finalize functions ready, we simply add them to our call. The finalize option is set in the last document; along with the out, this gives us the following command:

db.mapreduce.mapReduce(map,reduce,{ out: "mrresult", finalize : finalize });

And from here let’s query one of our example documents:

> db.mrresult.findOne();
{
     "_id" : "black",
     "value" : {
          "num" : 45318,
          "count" : 91,
          "avg" : 498
     }
}

Now that’s better! We have our number, our count, and our average!

Debugging MapReduce

Debugging Map/Reduce is quite a time-consuming task, but there are a few little tricks to make your life easier. First let’s look at debugging a map. You can debug a map by overloading the emit with a function, as shown here:

var emit = function(key, value) {
     print("emit results - key: " + key + "  value: " + tojson(value));
}

This emit function will return the key and value results the same as a map function would. You can test one using map.apply() and an example document from your collection as follows:

> map.apply(db.mapreduce.findOne());
emit results - key: blue  value: { "num" : 1, "count" : 1 }

Now that you know what to expect out of your map, you can look at debugging your reduce. You first need to confirm that your map and reduce are returning in the same format— that’s critical. The next thing you can do is to create a short array with a few values just like the ones passed into your reduce, as shown here:

a = [{ "num" : 1, "count" : 1 },{ "num" : 2, "count" : 1 },{ "num" : 3, "count" : 1 }]

Now you can call reduce as follows. This will allow you to see the values returned by your emit:

>reduce("blue",a);
{ "num" : 6, "count" : 3 }

If all else fails and you’re confused about what’s going on inside your function, don’t forget that you can use the printjson() function to print any JSON value out to the mongodb logfile to read. This is always a valuable tool when debugging software.

Summary

By now you should have an idea of exactly how much power and flexibility there is within MongoDB, using three of the most powerful and flexible query systems available. Through your reading of this chapter you should have an idea of how to use text indexes to perform highly powerful text searches in a number of languages. You should have the ability to create highly complex and flexible aggregations using the MongoDB aggregation framework. Finally you should now have the ability to use the powerful JavaScript-backed MapReduce which would allow you to write powerful groupings and transformations on your data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.40.171