Indexes are the way to make queries go vroom. Database indexes are similar to a book’s index: instead of looking through the whole book, the database takes a shortcut and just looks in the index, allowing it to do queries orders of magnitude faster. Once it finds the entry in the index, it can jump right to the location of the desired document.
Extending this metaphor to the breaking point, creating database
indexes is like deciding how a book’s index will be organized. You have the
advantage of knowing what kinds of queries you’ll be doing and thus what
types of information the database will need to find quickly. If all of your
queries involve the "date"
key, for example, you probably
need an index on "date"
(at least). If you will be
querying for usernames, you don’t need to index the
"user_num"
key, because you aren’t querying on
it.
It can be tricky to figure out what the optimal index for your queries is, but it is well worth the effort. Queries that otherwise take minutes can be instantaneous with the proper indexes.
MongoDB’s indexes work almost identically to typical relational database indexes, so if you are familiar with those, you can skim this section for syntax specifics. We’ll go over some indexing basics, but keep in mind that it’s an extensive topic and most of the material out there on index optimization for MySQL/Oracle/SQLite will apply equally well to MongoDB.
Suppose that you are querying for a single key, such as the following:
> db.people.find({"username" : "mark"})
When only a single key is used in the query, that key can be indexed
to improve the query’s speed. In this case, you would create an index on
"username"
. To create the index, use the
ensureIndex
method:
> db.people.ensureIndex({"username" : 1})
An index needs to be created only once for a collection. If you try to create the same index again, nothing will happen.
An index on a key will make queries on that key fast. However, it may not make other queries fast, even if they contain the indexed key. For example, this wouldn’t be very performant with the previous index:
> db.people.find({"date" : date1}).sort({"date" : 1, "username" : 1})
The server has to “look through the whole book” to find the desired dates. This process is called a table scan, which is basically what you’d do if you were looking for information in a book without an index: you start at page 1 and read through the whole thing. In general, you want to avoid making the server do table scans, because it is very slow for large collections.
As a rule of thumb, you should create an index that contains all of
the keys in your query. For instance, to optimize the previous query, you
should have an index on date
and
username
:
> db.ensureIndex({"date" : 1, "username" : 1})
The document you pass to ensureIndex
is of the
same form as the document passed to the sort
function: a set of keys with value 1 or -1, depending on the direction you
want the index to go. If you have only a single key in the index,
direction is irrelevant. A single key index is analogous to a book’s index
that is organized in alphabetical order: whether it goes from A–Z or Z–A,
it’s going to be fairly obvious where to find entries starting with
M.
If you have more than one key, you need to start thinking about index direction. For example, suppose we have a collection of users:
{ "_id" : ..., "username" : "smith", "age" : 48, "user_id" : 0 } { "_id" : ..., "username" : "smith", "age" : 30, "user_id" : 1 } { "_id" : ..., "username" : "john", "age" : 36, "user_id" : 2 } { "_id" : ..., "username" : "john", "age" : 18, "user_id" : 3 } { "_id" : ..., "username" : "joe", "age" : 36, "user_id" : 4 } { "_id" : ..., "username" : "john", "age" : 7, "user_id" : 5 } { "_id" : ..., "username" : "simon", "age" : 3, "user_id" : 6 } { "_id" : ..., "username" : "joe", "age" : 27, "user_id" : 7 } { "_id" : ..., "username" : "jacob", "age" : 17, "user_id" : 8 } { "_id" : ..., "username" : "sally", "age" : 52, "user_id" : 9 } { "_id" : ..., "username" : "simon", "age" : 59, "user_id" : 10 }
Let’s say we index them with {"username" : 1, "age" :
-1}
. MongoDB will organize the users as follows:
{ "_id" : ..., "username" : "jacob", "age" : 17, "user_id" : 8 } { "_id" : ..., "username" : "joe", "age" : 36, "user_id" : 4 } { "_id" : ..., "username" : "joe", "age" : 27, "user_id" : 7 } { "_id" : ..., "username" : "john", "age" : 36, "user_id" : 2 } { "_id" : ..., "username" : "john", "age" : 18, "user_id" : 3 } { "_id" : ..., "username" : "john", "age" : 7, "user_id" : 5 } { "_id" : ..., "username" : "sally", "age" : 52, "user_id" : 9 } { "_id" : ..., "username" : "simon", "age" : 59, "user_id" : 10 } { "_id" : ..., "username" : "simon", "age" : 3, "user_id" : 6 } { "_id" : ..., "username" : "smith", "age" : 48, "user_id" : 0 } { "_id" : ..., "username" : "smith", "age" : 30, "user_id" : 1 }
The usernames are in strictly increasing alphabetical order, and
within each name group the ages are in decreasing order. This optimizes
sorting by {"username" : 1, "age" : -1}
but is less
efficient at sorting by {"username" : 1, "age" : 1}
. If
we wanted to optimize {"username" : 1, "age" : 1}
, we
would create an index on {"username" : 1, "age" : 1}
to
organize ages in ascending order.
The index on username
and age
also makes queries on username
fast. In general, if an
index has N keys, it will make queries on any prefix of those keys fast.
For instance, if we have an index that looks like {"a" : 1, "b" :
1, "c" : 1, ..., "z" : 1}
, we effectively have an index on
{"a" : 1}
, {"a" : 1, "b" : 1}
,
{"a" : 1, "b" : 1, "c" : 1}
, and so on. Queries that
would use the index {"b" : 1}
, {"a" : 1, "c"
:1}
, and so on will not be optimized: only queries that can use
a prefix of the index can take advantage of it.
The MongoDB query optimizer will reorder query terms to take
advantage of indexes: if you query for {"x" : "foo", "y" :
"bar"}
and you have an index on {"y" : 1, "x" :
1}
, MongoDB will figure it out.
The disadvantage to creating an index is that it puts a little bit of overhead on every insert, update, and remove. This is because the database not only needs to do the operation but also needs to make a note of it in any indexes on the collection. Thus, the absolute minimum number of indexes should be created. There is a built-in maximum of 64 indexes per collection, which is more than almost any application should need.
Do not index every key. This will make inserts slow, take up lots
of space, and probably not speed up your queries very much. Figure out
what queries you are running, what the best indexes are for these
queries, and make sure that the server is using the indexes you’ve
created using the explain
and
hint
tools described in the next section.
Sometimes the most efficient solution is actually not to use an index. In general, if a query is returning a half or more of the collection, it will be more efficient for the database to just do a table scan instead of having to look up the index and then the value for almost every single document. Thus, for queries such as checking whether a key exists or determining whether a boolean value is true or false, it may actually be better to not use an index at all.
Suppose we have a collection of status messages from users. We want to query by user and date to pull up all of a user’s recent statuses. Using what we’ve learned so far, we might create an index that looks like the following:
> db.status.ensureIndex({user : 1, date : -1})
This will make the query for user and date efficient, but it is not actually the best index choice.
Imagine this as a book index again. We would have a list of documents sorted by user and then subsorted by date, so it would look something like the following:
User 123 on March 13, 2010 User 123 on March 12, 2010 User 123 on March 11, 2010 User 123 on March 5, 2010 User 123 on March 4, 2010 User 124 on March 12, 2010 User 124 on March 11, 2010 ...
This looks fine at this scale, but imagine if the application has millions of users who have dozens of status updates per day. If the index entries for each user’s status messages take up a page’s worth of space on disk, then for every “latest statuses” query, the database will have to load a different page into memory. This will be very slow if the site becomes popular enough that not all of the index fits into memory.
If we flip the index order to {date : -1, user :
1}
, the database can keep the last couple days of the index in
memory, swap less, and thus query for the latest statuses for any user
much more quickly.
Thus, there are several questions to keep in mind when deciding what indexes to create:
What are the queries you are doing? Some of these keys will need to be indexed.
What is the correct direction for each key?
How is this going to scale? Is there a different ordering of keys that would keep more of the frequently used portions of the index in memory?
If you can answer these questions, you are ready to index your data.
Indexes can be created on keys in embedded documents in the same
way that they are created on normal keys. For example, if we want to be
able to search blog post comments by date, we can create an index on the
"date"
key in the array of embedded
"comments"
documents:
> db.blog.ensureIndex({"comments.date" : 1})
Indexes on keys in embedded documents are identical to those on top-level keys, and the two can be combined in compound indexes.
As your collection grows, you’ll need to create indexes for any
large sorts your queries are doing. If you call sort
on a key that is not indexed, MongoDB needs to pull all of that data
into memory to sort it. Thus, there’s a limit on the amount you can sort
without an index: you can’t sort a terabyte of data in memory. Once your
collection is too big to sort in memory, MongoDB will just return an
error for queries that try.
Indexing the sort allows MongoDB to pull the sorted data in order, allowing you to sort any amount of data without running out of memory.
Each index in a collection has a string name that uniquely
identifies the index and is used by the server to delete or manipulate
it. Index names are, by default,
,
where keyname1
_dir1
_keyname2
_dir2
_..._keynameN
_dirN
keynameX
is the index’s key and
dirX
is the index’s direction (1 or -1). This can get
unwieldy if indexes contain more than a couple keys, so you can specify
your own name as one of the options to
ensureIndex
:
> db.foo.ensureIndex({"a" : 1, "b" : 1, "c" : 1, ..., "z" : 1}, {"name" : "alphabet"})
There is a limit to the number of characters in an index name, so
complex indexes may need custom names to be created. A call to
getLastError
will show whether the index creation
succeeded or why it didn’t.
Unique indexes guarantee that, for a given key, every document in
the collection will have a unique value. For example, if you want to make
sure no two documents can have the same value in the
"username"
key, you can create a unique
index:
> db.people.ensureIndex({"username" : 1}, {"unique" : true})
Keep in mind that insert
, by default, does not
check whether the document was actually inserted. Therefore, you may want
to use safe inserts if you are inserting documents that might have a
duplicate value for a unique key. This way, you will get a duplicate key
error when you attempt to insert such a document.
A unique index that you are probably already familiar with is the
index on "_id"
, which is created whenever you create a
normal collection. This is a normal unique index, aside from the fact that
it cannot be deleted.
If a key does not exist, the index stores its value as
null
. Therefore, if you create a unique index on a
key and try to insert more than one document that is missing the indexed
key, the inserts will fail because you already have a document with a
value of null
.
When building unique indexes for an existing collection, some
values may be duplicates. If there are any duplicates, this will cause
the index building to fail. In some cases, you may just want to drop all
of the documents with duplicate values. The dropDups
option will save the first document found and remove any subsequent
documents with duplicate values:
> db.people.ensureIndex({"username" : 1}, {"unique" : true, "dropDups" : true})
This is a bit of a drastic option; it might be better to create a script to preprocess the data if it is important.
You can also create a compound unique index. If you do this, individual keys can have the same values, but the combination of values for all keys must be unique.
GridFS, the standard method for storing large files in MongoDB
(see Chapter 7), uses a compound unique index.
The collection that holds the file content has a unique index on
{files_id : 1, n : 1}
, which allows documents that
look like (in part) the following:
{files_id : ObjectId("4b23c3ca7525f35f94b60a2d"), n : 1} {files_id : ObjectId("4b23c3ca7525f35f94b60a2d"), n : 2} {files_id : ObjectId("4b23c3ca7525f35f94b60a2d"), n : 3} {files_id : ObjectId("4b23c3ca7525f35f94b60a2d"), n : 4}
Note that all of the values for "files_id"
are
the same, but "n"
is different. If you attempted to
insert {files_id : ObjectId("4b23c3ca7525f35f94b60a2d"), n :
1}
again, the database would complain about the duplicate
key.
explain
is an incredibly handy tool that will
give you lots of information about your queries. You can run it on any
query by tacking it on to a cursor. explain
returns a
document, not the cursor itself, unlike most cursor methods:
> db.foo.find().explain()
explain
will return information about the
indexes used for the query (if any) and stats about timing and the number
of documents scanned.
Let’s take an example. The index {"username" : 1}
works well to speed up a simple key/value pair lookup, but many queries
are more complicated. For example, suppose you are querying and sorting
the following:
> db.people.find({"age" : 18}).sort({"username" : 1})
You may be unsure if the database is using the index you created or
how effective it is. If you run explain
on the query,
it will return the index currently being used for the query, how long it
took, and how many documents the database needed to scan to find the
results.
For a very simple query ({}
) on a database with
no indexes (other than the index on "_id"
) and 64
documents, the output for explain
looks like
this:
> db.people.find().explain() { "cursor" : "BasicCursor", "indexBounds" : [ ], "nscanned" : 64, "nscannedObjects" : 64, "n" : 64, "millis" : 0, "allPlans" : [ { "cursor" : "BasicCursor", "indexBounds" : [ ] } ] }
The important parts of this result are as follows:
"cursor" : "BasicCursor"
This means that the query did not use an index (unsurprisingly, because there was no query criteria). We’ll see what this value looks like for an indexed query in a moment.
"nscanned" : 64
This is the number of documents that the database looked through. You want to make sure this is as close to the number returned as possible.
"n" : 64
This is the number of documents returned. We’re doing pretty well here, because the number of documents scanned exactly matches the number returned. Of course, given that we’re returning the entire collection, it would be difficult to do otherwise.
"millis" : 0
The number of milliseconds it took the database to execute the query. 0 is a good time to shoot for.
Now, let’s suppose we have an index on the "age"
key and we’re querying for users in their 20s. If we run
explain
on this query, we’ll see the
following:
> db.c.find({age : {$gt : 20, $lt : 30}}).explain() { "cursor" : "BtreeCursor age_1", "indexBounds" : [ [ { "age" : 20 }, { "age" : 30 } ] ], "nscanned" : 14, "nscannedObjects" : 12, "n" : 12, "millis" : 1, "allPlans" : [ { "cursor" : "BtreeCursor age_1", "indexBounds" : [ [ { "age" : 20 }, { "age" : 30 } ] ] } ] }
There are some differences here, because this query uses an index.
Thus, some of explain
’s outputted keys have
changed:
"cursor" : "BtreeCursor age_1"
This query is not using a BasicCursor
,
like the first query was. Indexes are stored in data structures
called B-trees, so, when a query uses an index, it uses a special
type of cursor called a BtreeCursor.
This value also identifies the name of the index being used:
age_1
. Using this name, we could query the
system.indexes collection to find out more
about this index (e.g., whether it is unique or what keys it
includes):
> db.system.indexes.find({"ns" : "test.c", "name" : "age_1"}) { "_id" : ObjectId("4c0d211478b4eaaf7fb28565"), "ns" : "test.c", "key" : { "age" : 1 }, "name" : "age_1" }
"allPlans" : [ ... ]
This key lists all of the plans MongoDB considered for the
query. The choice here was fairly obvious, because we had an index
on "age"
and we were querying for
"age"
. If we had multiple overlapping indexes
and a more complex query, "allPlans"
would
contain information about all of the possible plans that could
have been used.
Let’s take a slightly more complicated example: suppose you have an
index on {"username" : 1, "age" : 1}
and an index on
{"age" : 1, "username" : 1}
. What happens if you query
for username
and age
? Well, it
depends on the query:
> db.c.find({age : {$gt : 10}, username : "sally"}).explain() { "cursor" : "BtreeCursor username_1_age_1", "indexBounds" : [ [ { "username" : "sally", "age" : 10 }, { "username" : "sally", "age" : 1.7976931348623157e+308 } ] ], "nscanned" : 13, "nscannedObjects" : 13, "n" : 13, "millis" : 5, "allPlans" : [ { "cursor" : "BtreeCursor username_1_age_1", "indexBounds" : [ [ { "username" : "sally", "age" : 10 }, { "username" : "sally", "age" : 1.7976931348623157e+308 } ] ] } ], "oldPlan" : { "cursor" : "BtreeCursor username_1_age_1", "indexBounds" : [ [ { "username" : "sally", "age" : 10 }, { "username" : "sally", "age" : 1.7976931348623157e+308 } ] ] } }
We are querying for an exact match on "username"
and a range of values for "age"
, so the database
chooses to use the {"username" : 1, "age" : 1}
index,
reversing the terms of the query. If, on the other hand, we query for an
exact age and a range of names, MongoDB will use the other index:
> db.c.find({"age" : 14, "username" : /.*/}).explain() { "cursor" : "BtreeCursor age_1_username_1 multi", "indexBounds" : [ [ { "age" : 14, "username" : "" }, { "age" : 14, "username" : { } } ], [ { "age" : 14, "username" : /.*/ }, { "age" : 14, "username" : /.*/ } ] ], "nscanned" : 2, "nscannedObjects" : 2, "n" : 2, "millis" : 2, "allPlans" : [ { "cursor" : "BtreeCursor age_1_username_1 multi", "indexBounds" : [ [ { "age" : 14, "username" : "" }, { "age" : 14, "username" : { } } ], [ { "age" : 14, "username" : /.*/ }, { "age" : 14, "username" : /.*/ } ] ] } ] }
If you find that Mongo is using different indexes than you want it
to for a query, you can force it to use a certain index by using
hint
. For instance, if you want to make sure MongoDB
uses the {"username" : 1, "age" : 1}
index on the
previous query, you could say the following:
> db.c.find({"age" : 14, "username" : /.*/}).hint({"username" : 1, "age" : 1})
Hinting is usually unnecessary. MongoDB has a query optimizer and is very clever about choosing which index to use. When you first do a query, the query optimizer tries out a number of query plans concurrently. The first one to finish will be used, and the rest of the query executions are terminated. That query plan will be remembered for future queries on the same keys. The query optimizer periodically retries other plans, in case you’ve added new data and the previously chosen plan is no longer best. The only part you should need to worry about is giving the query optimizer useful indexes to choose from.
Metainformation about indexes is stored in the
system.indexes collection of each database. This is a reserved collection, so
you cannot insert documents into it or remove documents from it. You can
manipulate its documents only through ensureIndex
and
the dropIndexes
database command.
The system.indexes collection has detailed
information about each index, but the system.namespaces
collection also lists their names. If you look at this
collection, you can see that there are at least two documents for each
collection: one for the collection itself and one for each index it
contains. For a collection with just the standard "_id"
index, its system.namespaces entries would look like
this:
{ "name" : "test.foo" } { "name" : "test.foo.$_id_"}
If we add a compound index on the name and age keys, a new document is added to system.namespaces with the following form:
{ "name" : "test.foo.$name_1_age_1" }
In Chapter 2, we mentioned that collection names
are limited to 121 bytes in length. This somewhat strange limit is because
the "_id"
index’s namespace needs 6 extra bytes
(".$_id_"
), bringing the namespace length for that
index up to a more sensical 127 bytes.
It is important to remember that the size of the collection name plus the index name cannot exceed 127 bytes. If you approach the maximum namespace size or have large index names, you may have to use custom index names to avoid creating namespaces that are too long. It’s generally easier just to keep database, collection, and key names to reasonable lengths.
As you and your application grow old together, you may find that
your data or queries have changed and that the indexes that used to work
well no longer do. You can add new indexes to existing collections at
any time with ensureIndex
:
> db.people.ensureIndex({"username" : 1}, {"background" : true})
Building indexes is time-consuming and resource-intensive. Using
the {"background" : true}
option builds the index in
the background, while handling incoming requests. If you do not include
the background
option, the database will block all
other requests while the index is being built.
Blocking lets index creation go faster but means your application will be unresponsive during the build. Even building an index in the background can take a toll on normal operations, so it is best to do it during an “off” time; building indexes in the background will still add extra load to your database, but it will not bring it grinding to a halt.
Creating indexes on existing documents is slightly faster than creating the index first and then inserting all of the documents. Of course, creating an index beforehand is an option only if you’re populating a new collection from scratch.
If an index is no longer necessary, you can remove it with the
dropIndexes
command and the index name. Often, you
may have to look at the system.indexes collection
to figure out what the index name is, because even the autogenerated
names vary somewhat from driver to driver:
> db.runCommand({"dropIndexes" : "foo", "index" : "alphabet"})
You can drop all indexes on a collection by passing in * as the
value for the index
key:
> db.runCommand({"dropIndexes" : "foo", "index" : "*"})
Another way to delete all indexes is to drop the collection. This
will also delete the index on _id
(and all of the
documents in the collection). Removing all of the documents in a
collection (with remove
) does not affect the
indexes; they will be repopulated when new documents are
inserted.
There is another type of query that is becoming increasingly common (especially with the emergence of mobile devices): finding the nearest N things to a current location. MongoDB provides a special type of index for coordinate plane queries, called a geospatial index.
Suppose we wanted to find the nearest coffee shop to where we are
now, given a latitude and longitude. We need to create a special index to
do this type of query efficiently, because it needs to search in two
dimensions. A geospatial index can be created using the
ensureIndex
function, but by passing
"2d"
as a value instead of 1 or -1:
> db.map.ensureIndex({"gps" : "2d"})
The "gps"
key must be some type of pair value,
that is, a two-element array or embedded document with two keys. These
would all be valid:
{ "gps" : [ 0, 100 ] } { "gps" : { "x" : -30, "y" : 30 } } { "gps" : { "latitude" : -180, "longitude" : 180 } }
The keys can be anything; for example, {"gps" : {"foo" : 0,
"bar" : 1}}
is valid.
By default, geospatial indexing assumes that your values are going
to range from -180 to 180 (which is convenient for latitude and
longitude). If you are using it for other values, you can specify what the
minimum and maximum values will be as options to
ensureIndex
:
> db.star.trek.ensureIndex({"light-years" : "2d"}, {"min" : -1000, "max" : 1000})
This will create a spatial index calibrated for a 2,000 × 2,000 light-year square.
Geospatial queries can be performed in two ways: as a normal query
(using find
) or as a database command. The
find
approach is similar to performing a
nongeospatial query, except the conditional "$near"
is
used. It takes an array of the two target values:
> db.map.find({"gps" : {"$near" : [40, -73]}})
This finds all of the documents in the map collection, in order by distance from the point (40, -73). A default limit of 100 documents is applied if no limit is specified. If you don’t need that many results, you should set a limit to conserve server resources. For example, the following code returns the 10 nearest documents to (40, -73):
> db.map.find({"gps" : {"$near" : [40, -73]}}).limit(10)
This same type of query can be performed using the
geoNear
command:
> db.runCommand({geoNear : "map", near : [40, -73], num : 10});
geoNear
also returns the distance of each result
document from the query point. The distance is in whatever units you
inserted the data in; if you inserted degrees of longitude and latitude,
the distance is in degrees. find
with
"$near"
doesn’t give you the distance for each point
but must be used if you have more than 4MB of results.
MongoDB also allows you to find all of the documents within a shape,
as well as near a point. To find points in a shape, we use the
"$within"
conditional instead of the
"$near"
conditional. "$within"
can
take an increasing number of shapes; check the online documentation for
geospatial indexing to see the most up-to-date list of what’s supported
(http://www.mongodb.org/display/DOCS/Geospatial+Indexing).
As of this writing, there are two options: you can query for all points
within a rectangle or a circle. To use a rectangle, use the
"$box"
option:
> db.map.find({"gps" : {"$within" : {"$box" : [[10, 20], [15, 30]]}}})
"$box"
takes a two-element array: the first
element specifies the coordinates of the lower-left corner, the second
element the upper right.
Also, you can find all points within a circle with
"$center"
, which takes an array with the center point
and then a radius:
> db.map.find({"gps" : {"$within" : {"$center" : [[12, 25], 5]}}})
Applications often need to search for more complex criteria than
just a location. For example, a user might want to find all coffee shops
or pizza parlors near where they are. To facilitate this type of query,
you can combine geospatial indexes with normal indexes. In this
situation, for instance, we might want to query on both the "location"
key and a
"desc"
key, so we’d create a compound
index:
> db.ensureIndex({"location" : "2d", "desc" : 1})
Then we can quickly find the nearest coffee shop:
> db.map.find({"location" : {"$near" : [-70, 30]}, "desc" : "coffeeshop"}).limit(1) { "_id" : ObjectId("4c0d1348928a815a720a0000"), "name" : "Mud", "location" : [x, y], "desc" : ["coffee", "coffeeshop", "muffins", "espresso"] }
Note that creating an array of keywords is a good way to perform user-defined searches.
MongoDB’s geospatial indexes assumes that whatever you’re indexing is a flat plane. This means that results aren’t perfect for spherical shapes, like the earth, especially near the poles. The problem is that the distance between lines of longitude in the Yukon is much shorter than the distance between them at the equator. There are various projections that can be used to convert the earth to a 2D plane with varying levels of accuracy and corresponding levels of complexity.
3.22.216.254