This chapter looks at querying in detail. The main areas covered are as follows:
You can perform ad hoc queries on the database using the
find
or findOne
functions and
a query document.
You can query for ranges, set inclusion, inequalities, and more by
using $
conditionals.
Some queries cannot be expressed as query documents, even using $
conditionals. For these types of complex queries, you can use a
$where
clause to harness the full expressive power of
JavaScript.
Queries return a database cursor, which lazily returns batches of documents as you need them.
There are a lot of metaoperations you can perform on a cursor, including skipping a certain number of results, limiting the number of results returned, and sorting results.
The find
method is used to perform queries in
MongoDB. Querying returns a subset of documents in a collection, from no
documents at all to the entire collection. Which documents get returned is
determined by the first argument to find
, which is a
document specifying the query to be performed.
An empty query document (i.e., {}
) matches
everything in the collection. If find
isn’t given a
query document, it defaults to {}
. For example, the
following:
> db.c.find()
returns everything in the collection c.
When we start adding key/value pairs to the query document, we begin
restricting our search. This works in a straightforward way for most
types. Integers match integers, booleans match booleans, and strings match
strings. Querying for a simple type is as easy as specifying the value
that you are looking for. For example, to find all documents where the
value for "age"
is 27, we can add that key/value pair
to the query document:
> db.users.find({"age" : 27})
If we have a string we want to match, such as a
"username"
key with the value "joe"
,
we use that key/value pair instead:
> db.users.find({"username" : "joe"})
Multiple conditions can be strung together by adding more key/value pairs to the query document, which gets interpreted as “condition1 AND condition2 AND … AND conditionN.” For instance, to get all users who are 27-year-olds with the username “joe,” we can query for the following:
> db.users.find({"username" : "joe", "age" : 27})
Sometimes, you do not need all of the key/value pairs in a
document returned. If this is the case, you can pass a second argument
to find
(or findOne
) specifying
the keys you want. This reduces both the amount of data sent over the
wire and the time and memory used to decode documents on the client
side.
For example, if you have a user collection and you are interested
only in the "username"
and "email"
keys, you could return just those keys with the following query:
> db.users.find({}, {"username" : 1, "email" : 1}) { "_id" : ObjectId("4ba0f0dfd22aa494fd523620"), "username" : "joe", "email" : "[email protected]" }
As you can see from the previous output, the
"_id"
key is always returned, even if it isn’t
specifically listed.
You can also use this second parameter to exclude specific
key/value pairs from the results of a query. For instance, you may have
documents with a variety of keys, and the only thing you know is that
you never want to return the "fatal_weakness"
key:
> db.users.find({}, {"fatal_weakness" : 0})
This can even prevent "_id"
from being
returned:
> db.users.find({}, {"username" : 1, "_id" : 0}) { "username" : "joe", }
There are some restrictions on queries. The value of a query
document must be a constant as far as the database is concerned. (It can
be a normal variable in your own code.) That is, it cannot refer to the
value of another key in the document. For example, if we were keeping
inventory and we had both "in_stock"
and
"num_sold"
keys, we could compare their values by
querying the following:
> db.stock.find({"in_stock" : "this.num_sold"}) // doesn't work
There are ways to do this (see $where Queries),
but you will usually get better performance by restructuring your
document slightly, such that a normal query will suffice. In this
example, we could instead use the keys
"initial_stock"
and "in_stock"
.
Then, every time someone buys an item, we decrement the value of the
"in_stock"
key by one. Finally, we can do a simple
query to check which items are out of stock:
> db.stock.find({"in_stock" : 0})
Queries can go beyond the exact matching described in the previous section; they can match more complex criteria, such as ranges, OR-clauses, and negation.
"$lt"
, "$lte"
,
"$gt"
, and "$gte"
are all
comparison operators, corresponding to <, <=, >, and >=,
respectively. They can be combined to look for a range of values. For
example, to look for users who are between the ages of 18 and 30
inclusive, we can do this:
> db.users.find({"age" : {"$gte" : 18, "$lte" : 30}})
These types of range queries are often useful for dates. For example, to find people who registered before January 1, 2007, we can do this:
> start = new Date("01/01/2007") > db.users.find({"registered" : {"$lt" : start}})
An exact match on a date is less useful, because dates are only stored with millisecond precision. Often you want a whole day, week, or month, making a range query necessary.
To query for documents where a key’s value is not equal to a
certain value, you must use another conditional operator,
"$ne"
, which stands for “not equal.” If you want to
find all users who do not have the username “joe,” you can query for
them using this:
> db.users.find({"username" : {"$ne" : "joe"}})
"$ne"
can be used with any type.
There are two ways to do an OR query in MongoDB.
"$in"
can be used to query for a variety of values
for a single key. "$or"
is more general; it can be
used to query for any of the given values across multiple
keys.
If you have more than one possible value to match for a single
key, use an array of criteria with "$in"
. For
instance, suppose we were running a raffle and the winning ticket
numbers were 725, 542, and 390. To find all three of these documents, we
can construct the following query:
> db.raffle.find({"ticket_no" : {"$in" : [725, 542, 390]}})
"$in"
is very flexible and allows you to
specify criteria of different types as well as values. For example, if
we are gradually migrating our schema to use usernames instead of user
ID numbers, we can query for either by using this:
> db.users.find({"user_id" : {"$in" : [12345, "joe"]})
This matches documents with a "user_id"
equal
to 12345, and documents with a "user_id"
equal to
"joe"
.
If "$in"
is given an array with a single value,
it behaves the same as directly matching the value. For instance,
{ticket_no : {$in : [725]}}
matches the same
documents as {ticket_no : 725}
.
The opposite of "$in"
is
"$nin"
, which returns documents that don’t match any
of the criteria in the array. If we want to return all of the people who
didn’t win anything in the raffle, we can query for them with
this:
> db.raffle.find({"ticket_no" : {"$nin" : [725, 542, 390]}})
This query returns everyone who did not have tickets with those numbers.
"$in"
gives you an OR query for a single key,
but what if we need to find documents where
"ticket_no"
is 725 or "winner"
is
true
? For this type of query, we’ll need to use the
"$or"
conditional. "$or"
takes an
array of possible criteria. In the raffle case, using
"$or"
would look like this:
> db.raffle.find({"$or" : [{"ticket_no" : 725}, {"winner" : true}]})
"$or"
can contain other conditionals. If, for
example, we want to match any of the three
"ticket_no"
values or the "winner"
key, we can use this:
> db.raffle.find({"$or" : [{"ticket_no" : {"$in" : [725, 542, 390]}}, {"winner" : true}]})
With a normal AND-type query, you want to narrow your results down as far as possible in as few arguments as possible. OR-type queries are the opposite: they are most efficient if the first arguments match as many documents as possible.
"$not"
is a metaconditional: it can be applied
on top of any other criteria. As an example, let’s consider the modulus
operator, "$mod"
. "$mod"
queries
for keys whose values, when divided by the first value given, have a
remainder of the second value:
> db.users.find({"id_num" : {"$mod" : [5, 1]}})
The previous query returns users with "id_num"
s
of 1, 6, 11, 16, and so on. If we want, instead, to return users with
"id_num"
s of 2, 3, 4, 5, 7, 8, 9, 10, 12, and so on,
we can use "$not"
:
> db.users.find({"id_num" : {"$not" : {"$mod" : [5, 1]}}})
"$not"
can be particularly useful in
conjunction with regular expressions to find all documents that don’t
match a given pattern (regular expression usage is described in the
section Regular Expressions).
If you look at the update modifiers in the previous chapter and
previous query documents, you’ll notice that the $-prefixed keys are in
different positions. In the query, "$lt"
is in the
inner document; in the update, "$inc"
is the key for
the outer document. This generally holds true: conditionals are an inner
document key, and modifiers are always a key in the outer
document.
Multiple conditions can be put on a single key. For example, to
find all users between the ages of 20 and 30, we can query for both
"$gt"
and "$lt"
on the
"age"
key:
> db.users.find({"age" : {"$lt" : 30, "$gt" : 20}})
Any number of conditionals can be used with a single key. Multiple
update modifiers cannot be used on a single key,
however. For example, you cannot have a modifier document such as
{"$inc" : {"age" : 1}, "$set" : {age : 40}}
because
it modifies
"age"
twice. With query conditionals, no
such rule applies.
As covered in Chapter 2, MongoDB has a wide variety of types that can be used in a document. Some of these behave specially in queries.
null
behaves a bit strangely. It does match
itself, so if we have a collection with the following
documents:
> db.c.find() { "_id" : ObjectId("4ba0f0dfd22aa494fd523621"), "y" : null } { "_id" : ObjectId("4ba0f0dfd22aa494fd523622"), "y" : 1 } { "_id" : ObjectId("4ba0f148d22aa494fd523623"), "y" : 2 }
we can query for documents whose "y"
key is
null
in the expected way:
> db.c.find({"y" : null}) { "_id" : ObjectId("4ba0f0dfd22aa494fd523621"), "y" : null }
However, null
not only matches itself but also
matches “does not exist.” Thus, querying for a key with the value
null
will return all documents lacking that
key:
> db.c.find({"z" : null}) { "_id" : ObjectId("4ba0f0dfd22aa494fd523621"), "y" : null } { "_id" : ObjectId("4ba0f0dfd22aa494fd523622"), "y" : 1 } { "_id" : ObjectId("4ba0f148d22aa494fd523623"), "y" : 2 }
If we only want to find keys whose value is
null
, we can check that the key is
null
and exists using the
"$exists"
conditional:
> db.c.find({"z" : {"$in" : [null], "$exists" : true}})
Unfortunately, there is no "$eq"
operator,
which makes this a little awkward, but "$in"
with one
element is equivalent.
Regular expressions are useful for flexible string matching. For example, if we want to find all users with the name Joe or joe, we can use a regular expression to do case-insensitive matching:
> db.users.find({"name" : /joe/i})
Regular expression flags (i
) are allowed but
not required. If we want to match not only various capitalizations of
joe, but also joey, we can continue to improve our regular
expression:
> db.users.find({"name" : /joey?/i})
MongoDB uses the Perl Compatible Regular Expression (PCRE) library to match regular expressions; any regular expression syntax allowed by PCRE is allowed in MongoDB. It is a good idea to check your syntax with the JavaScript shell before using it in a query to make sure it matches what you think it matches.
MongoDB can leverage an index for queries on prefix regular
expressions (e.g., /^joey/
), so queries of that
kind can be fast.
Regular expressions can also match themselves. Very few people insert regular expressions into the database, but if you insert one, you can match it with itself:
> db.foo.insert({"bar" : /baz/}) > db.foo.find({"bar" : /baz/}) { "_id" : ObjectId("4b23c3ca7525f35f94b60a2d"), "bar" : /baz/ }
Querying for elements of an array is simple. An array can mostly be treated as though each element is the value of the overall key. For example, if the array is a list of fruits, like this:
> db.food.insert({"fruit" : ["apple", "banana", "peach"]})
the following query:
> db.food.find({"fruit" : "banana"})
will successfully match the document. We can query for it in much
the same way as though we had a document that looked like the (illegal)
document: {"fruit" : "apple", "fruit" : "banana", "fruit" :
"peach"}
.
If you need to match arrays by more than one element, you can
use "$all"
. This allows you to match a list of
elements. For example, suppose we created a collection with three
elements:
> db.food.insert({"_id" : 1, "fruit" : ["apple", "banana", "peach"]}) > db.food.insert({"_id" : 2, "fruit" : ["apple", "kumquat", "orange"]}) > db.food.insert({"_id" : 3, "fruit" : ["cherry", "banana", "apple"]})
Then we can find all documents with both
"apple"
and "banana"
elements by
querying with "$all"
:
> db.food.find({fruit : {$all : ["apple", "banana"]}}) {"_id" : 1, "fruit" : ["apple", "banana", "peach"]} {"_id" : 3, "fruit" : ["cherry", "banana", "apple"]}
Order does not matter. Notice "banana"
comes
before "apple"
in the second result. Using a
one-element array with "$all"
is equivalent to not
using "$all"
. For instance, {fruit : {$all
: ['apple']}
will match the same documents as
{fruit : 'apple'}
.
You can also query by exact match using the entire array. However, exact match will not match a document if any elements are missing or superfluous. For example, this will match the first document shown previously:
> db.food.find({"fruit" : ["apple", "banana", "peach"]})
But this will not:
> db.food.find({"fruit" : ["apple", "banana"]})
and neither will this:
> db.food.find({"fruit" : ["banana", "apple", "peach"]})
If you want to query for a specific element of an array, you can
specify an index using the syntax
key
.index
:
> db.food.find({"fruit.2" : "peach"})
Arrays are always 0-indexed, so this would match the third array
element against the string "peach"
.
A useful conditional for querying arrays is
"$size"
, which allows you to query for arrays of a
given size. Here’s an example:
> db.food.find({"fruit" : {"$size" : 3}})
One common query is to get a range of sizes.
"$size"
cannot be combined with another $
conditional (in this example, "$gt"
), but this
query can be accomplished by adding a "size"
key to
the document. Then, every time you add an element to the array,
increment the value of "size"
. If the original
update looked like this:
> db.food.update({"$push" : {"fruit" : "strawberry"}})
it can simply be changed to this:
> db.food.update({"$push" : {"fruit" : "strawberry"}, "$inc" : {"size" : 1}})
Incrementing is extremely fast, so any performance penalty is negligible. Storing documents like this allows you to do queries such as this:
> db.food.find({"size" : {"$gt" : 3}})
Unfortunately, this technique doesn’t work as well with the
"$addToSet"
operator.
As mentioned earlier in this chapter, the optional second
argument to find
specifies the keys to be
returned. The special "$slice"
operator can be used
to return a subset of elements for an array key.
For example, suppose we had a blog post document and we wanted to return the first 10 comments:
> db.blog.posts.findOne(criteria, {"comments" : {"$slice" : 10}})
Alternatively, if we wanted the last 10 comments, we could use -10:
> db.blog.posts.findOne(criteria, {"comments" : {"$slice" : -10}})
"$slice"
can also return pages in the middle
of the results by taking an offset and the number of elements to
return:
> db.blog.posts.findOne(criteria, {"comments" : {"$slice" : [23, 10]}})
This would skip the first 23 elements and return the 24th through 34th. If there are fewer than 34 elements in the array, it will return as many as possible.
Unless otherwise specified, all keys in a document are returned
when "$slice"
is used. This is unlike the other key
specifiers, which suppress unmentioned keys from being returned. For
instance, if we had a blog post document that looked like this:
{ "_id" : ObjectId("4b2d75476cc613d5ee930164"), "title" : "A blog post", "content" : "...", "comments" : [ { "name" : "joe", "email" : "[email protected]", "content" : "nice post." }, { "name" : "bob", "email" : "[email protected]", "content" : "good post." } ] }
and we did a "$slice"
to get the last
comment, we’d get this:
> db.blog.posts.findOne(criteria, {"comments" : {"$slice" : -1}}) { "_id" : ObjectId("4b2d75476cc613d5ee930164"), "title" : "A blog post", "content" : "...", "comments" : [ { "name" : "bob", "email" : "[email protected]", "content" : "good post." } ] }
Both "title"
and "content"
are still returned, even though they weren’t explicitly included in
the key specifier.
There are two ways of querying for an embedded document: querying for the whole document or querying for its individual key/value pairs.
Querying for an entire embedded document works identically to a normal query. For example, if we have a document that looks like this:
{ "name" : { "first" : "Joe", "last" : "Schmoe" }, "age" : 45 }
we can query for someone named Joe Schmoe with the following:
> db.people.find({"name" : {"first" : "Joe", "last" : "Schmoe"}})
However, if Joe decides to add a middle name key, suddenly this
query won’t work anymore; it doesn’t match the entire embedded document!
This type of query is also order-sensitive; {"last" : "Schmoe",
"first" : "Joe"}
would not be a match.
If possible, it’s usually a good idea to query for just a specific key or keys of an embedded document. Then, if your schema changes, all of your queries won’t suddenly break because they’re no longer exact matches. You can query for embedded keys using dot-notation:
> db.people.find({"name.first" : "Joe", "name.last" : "Schmoe"})
Now, if Joe adds more keys, this query will still match his first and last names.
This dot-notation is the main difference between query documents and other document types. Query documents can contain dots, which mean “reach into an embedded document.” Dot-notation is also the reason that documents to be inserted cannot contain the . character. Oftentimes people run into this limitation when trying to save URLs as keys. One way to get around it is to always perform a global replace before inserting or after retrieving, substituting a character that isn’t legal in URLs for the dot (.) character.
Embedded document matches can get a little tricky as the document structure gets more complicated. For example, suppose we are storing blog posts and we want to find comments by Joe that were scored at least a five. We could model the post as follows:
> db.blog.find() { "content" : "...", "comments" : [ { "author" : "joe", "score" : 3, "comment" : "nice post" }, { "author" : "mary", "score" : 6, "comment" : "terrible post" } ] }
Now, we can’t query using db.blog.find({"comments" :
{"author" : "joe", "score" : {"$gte" : 5}}})
. Embedded
document matches have to match the whole document, and this doesn’t
match the "comment"
key. It also wouldn’t work to do
db.blog.find({"comments.author" : "joe", "comments.score" :
{"$gte" : 5}})
, because
the author criteria could match a different comment than the score
criteria. That is, it would return the document shown earlier; it would
match "author" : "joe"
in the first comment and
"score" : 6
in the second comment.
To correctly group criteria without needing to specify every key,
use "$elemMatch"
. This vaguely named conditional
allows you to partially specify criteria to match a single embedded
document in an array. The correct query looks like this:
> db.blog.find({"comments" : {"$elemMatch" : {"author" : "joe", "score" : {"$gte" : 5}}}})
"$elemMatch"
allows us to “group” our criteria.
As such, it’s only needed when you have more than one key you want to
match on in an embedded document.
Key/value pairs are a fairly expressive way to query, but there are
some queries that they cannot represent. For queries that cannot be done
any other way, there are "$where"
clauses, which allow
you to execute arbitrary JavaScript as part of your query. This allows you
to do (almost) anything within a query.
The most common case for this is wanting to compare the values for two keys in a document, for instance, if we had a list of items and wanted to return documents where any two of the values are equal. Here’s an example:
> db.foo.insert({"apple" : 1, "banana" : 6, "peach" : 3}) > db.foo.insert({"apple" : 8, "spinach" : 4, "watermelon" : 4})
In the second document, "spinach"
and
"watermelon"
have the same value, so we’d like that
document returned. It’s unlikely MongoDB will ever have a $ conditional
for this, so we can use a "$where"
clause to do it with
JavaScript:
> db.foo.find({"$where" : function () { ... for (var current in this) { ... for (var other in this) { ... if (current != other && this[current] == this[other]) { ... return true; ... } ... } ... } ... return false; ... }});
If the function returns true
, the document will
be part of the result set; if it returns false
, it
won’t be.
We used a function earlier, but you can also use strings to specify
a "$where"
query; the following two
"$where"
queries are equivalent:
> db.foo.find({"$where" : "this.x + this.y == 10"}) > db.foo.find({"$where" : "function() { return this.x + this.y == 10; }"})
"$where"
queries should not be used unless
strictly necessary: they are much slower than regular queries. Each
document has to be converted from BSON to a JavaScript object and then run
through the "$where"
expression. Indexes cannot be used
to satisfy a "$where"
, either. Hence, you should use
"$where"
only when there is no other way of doing the
query. You can cut down on the penalty by using other query filters in
combination with "$where"
. If possible, an index will
be used to filter based on the non-$where
clauses; the
"$where"
expression will be used only to fine-tune the
results.
Another way of doing complex queries is to use MapReduce, which is covered in the next chapter.
The database returns results from find
using a
cursor. The client-side implementations of cursors
generally allow you to control a great deal about the eventual output of a
query. You can limit the number of results, skip over some number of
results, sort results by any combination of keys in any direction, and
perform a number of other powerful operations.
To create a cursor with the shell, put some documents into a
collection, do a query on them, and assign the results to a local variable
(variables defined with "var"
are local). Here, we
create a very simple collection and query it, storing the results in the
cursor
variable:
> for(i=0; i<100; i++) { ... db.c.insert({x : i}); ... } > var cursor = db.collection.find();
The advantage of doing this is that you can look at one result at a time. If you store the results in a global variable or no variable at all, the MongoDB shell will automatically iterate through and display the first couple of documents. This is what we’ve been seeing up until this point, and it is often the behavior you want for seeing what’s in a collection but not for doing actual programming with the shell.
To iterate through the results, you can use the
next
method on the cursor. You can use
hasNext
to check whether there is another result. A
typical loop through results looks like the following:
> while (cursor.hasNext()) { ... obj = cursor.next(); ... // do stuff ... }
cursor.hasNext()
checks that the next result
exists, and cursor.next()
fetches it.
The cursor
class also implements the iterator
interface, so you can use it in a forEach
loop:
> var cursor = db.people.find(); > cursor.forEach(function(x) { ... print(x.name); ... }); adam matt zak
When you call find
, the shell does not query
the database immediately. It waits until you actually start requesting
results to send the query, which allows you to chain additional options
onto a query before it is performed. Almost every method on a cursor
object returns the cursor itself so that you can chain them in any order.
For instance, all of the following are equivalent:
> var cursor = db.foo.find().sort({"x" : 1}).limit(1).skip(10); > var cursor = db.foo.find().limit(1).sort({"x" : 1}).skip(10); > var cursor = db.foo.find().skip(10).limit(1).sort({"x" : 1});
At this point, the query has not been executed yet. All of these functions merely build the query. Now, suppose we call the following:
> cursor.hasNext()
At this point, the query will be sent to the server. The shell
fetches the first 100 results or first 4MB of results (whichever is
smaller) at once so that the next calls to next
or
hasNext
will not have to make trips to the server.
After the client has run through the first set of results, the shell will
again contact the database and ask for more results. This process
continues until the cursor is exhausted and all results have been
returned.
The most common query options are limiting the number of results returned, skipping a number of results, and sorting. All of these options must be added before a query is sent to the database.
To set a limit, chain the limit
function onto
your call to find
. For example, to only return three
results, use this:
> db.c.find().limit(3)
If there are fewer than three documents matching your query in the
collection, only the number of matching documents will be returned;
limit
sets an upper limit, not a lower
limit.
skip
works similarly to
limit
:
> db.c.find().skip(3)
This will skip the first three matching documents and return the rest of the matches. If there are less than three documents in your collection, it will not return any documents.
sort
takes an object: a set of key/value
pairs where the keys are key names and the values are the sort
directions. Sort direction can be 1 (ascending) or -1 (descending). If
multiple keys are given, the results will be sorted in that order. For
instance, to sort the results by "username"
ascending
and "age"
descending, we do the following:
> db.c.find().sort({username : 1, age : -1})
These three methods can be combined. This is often handy for pagination. For example, suppose that you are running an online store and someone searches for mp3. If you want 50 results per page sorted by price from high to low, you can do the following:
> db.stock.find({"desc" : "mp3"}).limit(50).sort({"price" : -1})
If they click Next Page to see more results, you can simply add a skip to the query, which will skip over the first 50 matches (which the user already saw on page 1):
> db.stock.find({"desc" : "mp3"}).limit(50).skip(50).sort({"price" : -1})
However, large skips are not very performant, so there are suggestions on avoiding them in a moment.
MongoDB has a hierarchy as to how types compare. Sometimes you will have a single key with multiple types, for instance, integers and booleans, or strings and nulls. If you do a sort on a key with a mix of types, there is a predefined order that they will be sorted in. From least to greatest value, this ordering is as follows:
Minimum value
null
Numbers (integers, longs, doubles)
Strings
Object/document
Array
Binary data
Object ID
Boolean
Date
Timestamp
Regular expression
Maximum value
Using skip
for a small number of documents is
fine. For a large number of results, skip
can be
slow (this is true in nearly every database, not just MongoDB) and
should be avoided. Usually you can build criteria into the documents
themselves to avoid having to do large skips, or you can calculate the
next query based on the result from the previous one.
The easiest way to do pagination is to return the first page of results using limit and then return each subsequent page as an offset from the beginning.
> // do not use: slow for large skips > var page1 = db.foo.find(criteria).limit(100) > var page2 = db.foo.find(criteria).skip(100).limit(100) > var page3 = db.foo.find(criteria).skip(200).limit(100) ...
However, depending on your query, you can usually find a way to
paginate without skip
s. For example, suppose we
want to display documents in descending order based on
"date"
. We can get the first page of results with
the following:
> var page1 = db.foo.find().sort({"date" : -1}).limit(100)
Then, we can use the "date"
value of the last
document as the criteria for fetching the next page:
var latest = null; // display first page while (page1.hasNext()) { latest = page1.next(); display(latest); } // get next page var page2 = db.foo.find({"date" : {"$gt" : latest.date}}); page2.sort({"date" : -1}).limit(100);
Now the query does not need to include a
skip
.
One fairly common problem is how to get a random document from a
collection. The naive (and slow) solution is to count the number of
documents and then do a find
, skipping a random
number of documents between 0 and the size of the
collection.
> // do not use > var total = db.foo.count() > var random = Math.floor(Math.random()*total) > db.foo.find().skip(random).limit(1)
It is actually highly inefficient to get a random element this way: you have to do a count (which can be expensive if you are using criteria), and skipping large numbers of elements can be time-consuming.
It takes a little forethought, but if you know you’ll be looking
up a random element on a collection, there’s a much more efficient way
to do so. The trick is to add an extra random key to each document
when it is inserted. For instance, if we’re using the shell, we could
use the Math.random()
function (which creates a
random number between 0 and 1):
> db.people.insert({"name" : "joe", "random" : Math.random()}) > db.people.insert({"name" : "john", "random" : Math.random()}) > db.people.insert({"name" : "jim", "random" : Math.random()})
Now, when we want to find a random document from the collection,
we can calculate a random number and use that as query criteria,
instead of doing a skip
:
> var random = Math.random() > result = db.foo.findOne({"random" : {"$gt" : random}})
There is a slight chance that random
will be
greater than any of the "random"
values in the
collection, and no results will be returned. We can guard against this
by simply returning a document in the other direction:
> if (result == null) { ... result = db.foo.findOne({"random" : {"$lt" : random}}) ... }
If there aren’t any documents in the collection, this technique
will end up returning null
, which makes
sense.
This technique can be used with arbitrarily complex queries;
just make sure to have an index that includes the random key. For
example, if we want to find a random plumber in California, we can
create an index on "profession"
,
"state"
, and "random"
:
> db.people.ensureIndex({"profession" : 1, "state" : 1, "random" : 1})
This allows us to quickly find a random result (see Chapter 5 for more information on indexing).
There are two types of queries: wrapped and plain. A plain query is something like this:
> var cursor = db.foo.find({"foo" : "bar"})
There are a couple options that “wrap” the query. For example, suppose we perform a sort:
> var cursor = db.foo.find({"foo" : "bar"}).sort({"x" : 1})
Instead of sending {"foo" : "bar"}
to the
database as the query, the query gets wrapped in a larger document. The
shell converts the query from {"foo" : "bar"}
to
{"$query" : {"foo" : "bar"}, "$orderby" : {"x" :
1}}
.
Most drivers provide helpers for adding arbitrary options to queries. Other helpful options include the following:
$maxscan :
integer
Specify the maximum number of documents that should be scanned for the query.
$min :
document
Start criteria for querying.
$max :
document
End criteria for querying.
$hint :
document
Tell the server which index to use for the query.
$explain :
boolean
Get an explanation of how the query will be executed (indexes used, number of results, how long it takes, etc.), instead of actually doing the query.
$snapshot :
boolean
Ensure that the query’s results will be a consistent snapshot from the point in time when the query was executed. See the next section for details.
A fairly common way of processing data is to pull it out of MongoDB, change it in some way, and then save it again:
cursor = db.foo.find(); while (cursor.hasNext()) { var doc = cursor.next(); doc = process(doc); db.foo.save(doc); }
This is fine for a small number of results, but it breaks down for large numbers of documents. To see why, imagine how the documents are actually being stored. You can picture a collection as a list of documents that looks something like Figure 4-1. Snowflakes represent documents, because every document is beautiful and unique.
Now, when we do a find
, it starts returning
results from the beginning of the collection and moves right. Your
program grabs the first 100 documents and processes them. When you save
them back to the database, if a document does not have the padding
available to grow to its new size, like in Figure 4-2, it needs to be relocated. Usually, a
document will be relocated to the end of a collection (Figure 4-3).
Now our program continues to fetch batches of documents. When it gets toward the end, it will return the relocated documents again (Figure 4-4)!
The solution to this problem is to snapshot
your query. If you add the "$snapshot"
option, the
query will be run against an unchanging view of the collection. All
queries that return a single batch of results are effectively
snapshotted. Inconsistencies arise only when the collection changes
under a cursor while it is waiting to get another batch of
results.
There are two sides to a cursor: the client-facing cursor and the database cursor that the client-side one represents. We have been talking about the client-side one up until now, but we are going to take a brief look at what’s happening on the server side.
On the server side, a cursor takes up memory and resources. Once a cursor runs out of results or the client sends a message telling it to die, the database can free the resources it was using. Freeing these resources lets the database use them for other things, which is good, so we want to make sure that cursors can be freed quickly (within reason).
There are a couple of conditions that can cause the death (and subsequent cleanup) of a cursor. First, when a cursor finishes iterating through the matching results, it will clean itself up. Another way is that, when a cursor goes out of scope on the client side, the drivers send the database a special message to let it know that it can kill that cursor. Finally, even if the user hasn’t iterated through all the results and the cursor is still in scope, after 10 minutes of inactivity, a database cursor will automatically “die.”
This “death by timeout” is usually the desired behavior: very few
applications expect their users to sit around for minutes at a time
waiting for results. However, sometimes you might know that you need a
cursor to last for a long time. In that case, many drivers have
implemented a function called immortal
, or a similar
mechanism, which tells the database not to time out the cursor. If you
turn off a cursor’s timeout, you must iterate through all of its results
or make sure it gets closed. Otherwise, it will sit around in the database
hogging resources forever.
3.16.75.165