This chapter covers the basics of moving data into and out of the database, including the following:
Adding new documents to a collection
Removing documents from a collection
Updating existing documents
Choosing the correct level of safety versus speed for all of these operations
Inserts are the basic method for adding data to MongoDB. To insert a
document into a collection, use the collection’s
insert
method:
> db.foo.insert({"bar" : "baz"})
This will add an "_id"
key to the document (if
one does not already exist) and save it to MongoDB.
If you have a situation where you are inserting multiple documents into a collection, you can make the insert faster by using batch inserts. Batch inserts allow you to pass an array of documents to the database.
Sending dozens, hundreds, or even thousands of documents at a time can make inserts significantly faster. A batch insert is a single TCP request, meaning that you do not incur the overhead of doing hundreds of individual requests. It can also cut insert time by eliminating a lot of the header processing that gets done for each message. When an individual document is sent to the database, it is prefixed by a header that tells the database to do an insert operation on a certain collection. By using batch insert, the database doesn’t need to reprocess this information for each document.
Batch inserts are intended to be used in applications, such as for
inserting a couple hundred sensor data points into an analytics
collection at once. They are useful only if you are inserting multiple
documents into a single collection: you cannot use batch inserts to
insert into multiple collections with a single request. If you are just
importing raw data (for example, from a data feed or MySQL), there are
command-line tools like mongoimport
that can be used
instead of batch insert. On the other hand, it is often handy to munge
data before saving it to MongoDB (converting dates to the date type or
adding a custom "_id"
) so batch inserts can be used
for importing data, as well.
Current versions of MongoDB do not accept messages longer than 16MB, so there is a limit to how much can be inserted in a single batch insert.
When you perform an insert, the driver you are using converts the
data structure into BSON, which it then sends to the database (see Appendix C for more on BSON). The database understands
BSON and checks for an "_id"
key and that the
document’s size does not exceed 4MB, but other than that, it doesn’t do
data validation; it just saves the document to the database as is. This
has a couple of side effects, most notably that you can insert invalid
data and that your database is fairly secure from injection
attacks.
All of the drivers for major languages (and most of the minor
ones, too) check for a variety of invalid data (documents that are too
large, contain non-UTF-8 strings, or use unrecognized types) before
sending anything to the database. If you are running a driver that you
are not sure about, you can start the database server with the
--objcheck
option, and it will examine each document’s
structural validity before inserting it (at the cost of slower
performance).
Documents larger than 4MB (when converted to BSON) cannot be
saved to the database. This is a somewhat arbitrary limit (and may be
raised in the future); it is mostly to prevent bad schema design and
ensure consistent performance. To see the BSON size (in bytes) of the
document doc
, run
Object.bsonsize(
from the shell.doc
)
To give you an idea of how much 4MB is, the entire text of War and Peace is just 3.14MB.
MongoDB does not do any sort of code execution on inserts, so they are not vulnerable to injection attacks. Traditional injection attacks are impossible with MongoDB, and alternative injection-type attacks are easy to guard against in general, but inserts are particularly invulnerable.
Now that there’s data in our database, let’s delete it.
> db.users.remove()
This will remove all of the documents in the users collection. This doesn’t actually remove the collection, and any indexes created on it will still exist.
The remove
function optionally takes a query
document as a parameter. When it’s given, only documents that match the
criteria will be removed. Suppose, for instance, that we want to remove
everyone from the mailing.list collection where the
value for "opt-out"
is
true
:
> db.mailing.list.remove({"opt-out" : true})
Once data has been removed, it is gone forever. There is no way to undo the remove or recover deleted documents.
Removing documents is usually a fairly quick operation, but if you want to clear an entire collection, it is faster to drop it (and then re-create any indexes).
For example, in Python, suppose we insert a million dummy elements with the following:
for i in range(1000000): collection.insert({"foo": "bar", "baz": i, "z": 10 - i})
Now we’ll try to remove all of the documents we just inserted,
measuring the time it takes. First, here’s a simple
remove
:
import time from pymongo import Connection db = Connection().foo collection = db.bar start = time.time() collection.remove() collection.find_one() total = time.time() - start print "%d seconds" % total
On a MacBook Air, this script prints “46.08 seconds.”
If the remove
and
find_one
are replaced by
db.drop_collection("bar")
, the time drops to .01
seconds! This is obviously a vast improvement, but it comes at the
expense of granularity: we cannot specify any criteria. The whole
collection is dropped, and all of its indexes are deleted.
Once a document is stored in the database, it can be changed using
the update
method. update
takes
two parameters: a query document, which locates documents to update, and a
modifier document, which describes the changes to make to the documents
found.
Updates are atomic: if two updates happen at the same time, whichever one reaches the server first will be applied, and then the next one will be applied. Thus, conflicting updates can safely be sent in rapid-fire succession without any documents being corrupted: the last update will “win.”
The simplest type of update fully replaces a matching document with a new one. This can be useful to do a dramatic schema migration. For example, suppose we are making major changes to a user document, which looks like the following:
{ "_id" : ObjectId("4b2b9f67a1f631733d917a7a"), "name" : "joe", "friends" : 32, "enemies" : 2 }
We want to change that document into the following:
{ "_id" : ObjectId("4b2b9f67a1f631733d917a7a"), "username" : "joe", "relationships" : { "friends" : 32, "enemies" : 2 } }
We can make this change by replacing the document using an
update
:
> var joe = db.users.findOne({"name" : "joe"}); > joe.relationships = {"friends" : joe.friends, "enemies" : joe.enemies}; { "friends" : 32, "enemies" : 2 } > joe.username = joe.name; "joe" > delete joe.friends; true > delete joe.enemies; true > delete joe.name; true > db.users.update({"name" : "joe"}, joe);
Now, doing a findOne
shows that the structure
of the document has been updated.
A common mistake is matching more than one document with the
criteria and then create a duplicate "_id"
value with
the second parameter. The database will throw an error for this, and
nothing will be changed.
For example, suppose we create several documents with the same
"name"
, but we don’t realize it:
> db.people.find() {"_id" : ObjectId("4b2b9f67a1f631733d917a7b"), "name" : "joe", "age" : 65}, {"_id" : ObjectId("4b2b9f67a1f631733d917a7c"), "name" : "joe", "age" : 20}, {"_id" : ObjectId("4b2b9f67a1f631733d917a7d"), "name" : "joe", "age" : 49},
Now, if it’s Joe #2’s birthday, we want to increment the value of
his "age"
key, so we might say this:
> joe = db.people.findOne({"name" : "joe", "age" : 20}); { "_id" : ObjectId("4b2b9f67a1f631733d917a7c"), "name" : "joe", "age" : 20 } > joe.age++; > db.people.update({"name" : "joe"}, joe); E11001 duplicate key on update
What happened? When you call update, the database will look for a
document matching {"name" : "joe"}
. The first one it
finds will be the 65-year-old Joe. It will attempt to replace that
document with the one in the joe
variable, but
there’s already a document in this collection with the same
"_id"
. Thus, the update will fail, because
"_id"
values must be unique. The best way to avoid
this situation is to make sure that your update always specifies a
unique document, perhaps by matching on a key like
"_id"
.
Usually only certain portions of a document need to be updated. Partial updates can be done extremely efficiently by using atomic update modifiers. Update modifiers are special keys that can be used to specify complex update operations, such as altering, adding, or removing keys, and even manipulating arrays and embedded documents.
Suppose we were keeping website analytics in a collection and wanted to increment a counter each time someone visited a page. We can use update modifiers to do this increment atomically. Each URL and its number of page views is stored in a document that looks like this:
{ "_id" : ObjectId("4b253b067525f35f94b60a31"), "url" : "www.example.com", "pageviews" : 52 }
Every time someone visits a page, we can find the page by its URL
and use the "$inc"
modifier to increment the value of
the "pageviews"
key.
> db.analytics.update({"url" : "www.example.com"}, ... {"$inc" : {"pageviews" : 1}})
Now, if we do a find
, we see that
"pageviews"
has increased by one.
> db.analytics.find() { "_id" : ObjectId("4b253b067525f35f94b60a31"), "url" : "www.example.com", "pageviews" : 53 }
Perl and PHP programmers are probably thinking that any character would have been a better choice than $. Both of these languages use $ as a variable prefix and will replace $-prefixed strings with their variable value in double-quoted strings. However, MongoDB started out as a JavaScript database, and $ is a special character that isn’t interpreted differently in JavaScript, so it was used. It is an annoying historical relic from MongoDB’s primordial soup.
There are several options for Perl and PHP programmers. First,
you could just escape the $: "$foo"
. You can use
single quotes, which don’t do variable interpolation:
'$foo'
. Finally, both drivers allow you to define
your own character that will be used instead of $. In Perl, set
$MongoDB::BSON::char
, and in
PHP set mongo.cmd_char
in
php.ini to =, :, ?, or any other character that
you would like to use instead of $. Then, if you choose, say, ~, you
would use ~inc instead of $inc and ~gt instead of $gt.
Good choices for the special character are characters that will not naturally appear in key names (don’t use _ or x) and are not characters that have to be escaped themselves, which will gain you nothing and be confusing (such as or, in Perl, @).
When using modifiers, the value of "_id"
cannot
be changed. (Note that "_id"
can be changed by using
whole-document replacement.) Values for any other key, including other
uniquely indexed keys, can be modified.
"$set"
sets the value of a key. If the key
does not yet exist, it will be created. This can be handy for updating
schema or adding user-defined keys. For example, suppose you have a
simple user profile stored as a document that looks something like the
following:
> db.users.findOne() { "_id" : ObjectId("4b253b067525f35f94b60a31"), "name" : "joe", "age" : 30, "sex" : "male", "location" : "Wisconsin" }
This is a pretty bare-bones user profile. If the user wanted to
store his favorite book in his profile, he could add it using
"$set"
:
> db.users.update({"_id" : ObjectId("4b253b067525f35f94b60a31")}, ... {"$set" : {"favorite book" : "war and peace"}})
Now the document will have a “favorite book” key:
> db.users.findOne() { "_id" : ObjectId("4b253b067525f35f94b60a31"), "name" : "joe", "age" : 30, "sex" : "male", "location" : "Wisconsin", "favorite book" : "war and peace" }
If the user decides that he actually enjoys a different book,
"$set"
can be used again to change the
value:
> db.users.update({"name" : "joe"}, ... {"$set" : {"favorite book" : "green eggs and ham"}})
"$set"
can even change the type of the key it
modifies. For instance, if our fickle user decides that he actually
likes quite a few books, he can change the value of the “favorite
book” key into an array:
> db.users.update({"name" : "joe"}, ... {"$set" : {"favorite book" : ... ["cat's cradle", "foundation trilogy", "ender's game"]}})
If the user realizes that he actually doesn’t like reading, he
can remove the key altogether with
"$unset"
:
> db.users.update({"name" : "joe"}, ... {"$unset" : {"favorite book" : 1}})
Now the document will be the same as it was at the beginning of this example.
You can also use
"$set"
to reach in and change embedded
documents:
> db.blog.posts.findOne() { "_id" : ObjectId("4b253b067525f35f94b60a31"), "title" : "A Blog Post", "content" : "...", "author" : { "name" : "joe", "email" : "[email protected]" } } > db.blog.posts.update({"author.name" : "joe"}, {"$set" : {"author.name" : "joe schmoe"}}) > db.blog.posts.findOne() { "_id" : ObjectId("4b253b067525f35f94b60a31"), "title" : "A Blog Post", "content" : "...", "author" : { "name" : "joe schmoe", "email" : "[email protected]" } }
You must always use a $ modifier for adding, changing, or
removing keys. A common error people often make when starting out is
to try to set the value of "foo"
to
"bar"
by doing an update that looks like
this:
> db.coll.update(criteria
, {"foo" : "bar"})
This will not function as intended. It actually does a
full-document replacement, replacing the matched document with
{"foo" : "bar"}
. Always use $ operators for
modifying individual key/value pairs.
The "$inc"
modifier can be used to change the
value for an existing key or to create a new key if it does not
already exist. It is very useful for updating analytics, karma, votes,
or anything else that has a changeable, numeric value.
Suppose we are creating a game collection where we want to save games and update scores as they change. When a user starts playing, say, a game of pinball, we can insert a document that identifies the game by name and user playing it:
> db.games.insert({"game" : "pinball", "user" : "joe"})
When the ball hits a bumper, the game should increment the
player’s score. As points in pinball are given out pretty freely,
let’s say that the base unit of points a player can earn is 50. We can
use the "$inc"
modifier to add 50 to the player’s
score:
> db.games.update({"game" : "pinball", "user" : "joe"}, ... {"$inc" : {"score" : 50}})
If we look at the document after this update, we’ll see the following:
> db.games.findOne() { "_id" : ObjectId("4b2d75476cc613d5ee930164"), "game" : "pinball", "name" : "joe", "score" : 50 }
The score key did not already exist, so it was created by
"$inc"
and set to the increment amount: 50.
If the ball lands in a “bonus” slot, we want to add 10,000 to
the score. This can be accomplished by passing a different value to
"$inc"
:
> db.games.update({"game" : "pinball", "user" : "joe"}, ... {"$inc" : {"score" : 10000}})
Now if we look at the game, we’ll see the following:
> db.games.find() { "_id" : ObjectId("4b2d75476cc613d5ee930164"), "game" : "pinball", "name" : "joe", "score" : 10050 }
The "score"
key existed and had a numeric
value, so the server added 10,000 to it.
"$inc"
is similar to
"$set"
, but it is designed for incrementing (and
decrementing) numbers. "$inc"
can be used only on
values of type integer, long, or double. If it is used on any other
type of value, it will fail. This includes types that many languages
will automatically cast into numbers, like nulls, booleans, or strings
of numeric characters:
> db.foo.insert({"count" : "1"}) > db.foo.update({}, {$inc : {count : 1}}) Cannot apply $inc modifier to non-number
Also, the value of the "$inc"
key must be a
number. You cannot increment by a string, array, or other non-numeric
value. Doing so will give a “Modifier "$inc"
allowed for numbers only” error message. To modify other types, use
"$set"
or one of the array operations described in
a moment.
An extensive class of modifiers exists for manipulating arrays. Arrays are common and powerful data structures: not only are they lists that can be referenced by index, but they can also double as sets.
Array operators can be used only on keys with array values. For
example, you cannot push on to an integer or pop off of a string, for
example. Use "$set"
or "$inc"
to
modify scalar values.
"$push"
adds an element to the end of an
array if the specified key already exists and creates a new array if
it does not. For example, suppose that we are storing blog posts and
want to add a "comments"
key containing an array.
We can push a comment onto the nonexistent
"comments"
array, which will create the array and
add the comment:
> db.blog.posts.findOne() { "_id" : ObjectId("4b2d75476cc613d5ee930164"), "title" : "A blog post", "content" : "..." } > db.blog.posts.update({"title" : "A blog post"}, {$push : {"comments" : ... {"name" : "joe", "email" : "[email protected]", "content" : "nice post."}}}) > db.blog.posts.findOne() { "_id" : ObjectId("4b2d75476cc613d5ee930164"), "title" : "A blog post", "content" : "...", "comments" : [ { "name" : "joe", "email" : "[email protected]", "content" : "nice post." } ] }
Now, if we want to add another comment, we can simple use
"$push"
again:
> db.blog.posts.update({"title" : "A blog post"}, {$push : {"comments" : ... {"name" : "bob", "email" : "[email protected]", "content" : "good post."}}}) > db.blog.posts.findOne() { "_id" : ObjectId("4b2d75476cc613d5ee930164"), "title" : "A blog post", "content" : "...", "comments" : [ { "name" : "joe", "email" : "[email protected]", "content" : "nice post." }, { "name" : "bob", "email" : "[email protected]", "content" : "good post." } ] }
A common use is wanting to add a value to an array only if the
value is not already present. This can be done using a
"$ne"
in the query document. For example, to push
an author onto a list of citations, but only if he isn’t already
there, use the following:
> db.papers.update({"authors cited" : {"$ne" : "Richie"}}, ... {$push : {"authors cited" : "Richie"}})
This can also be done with "$addToSet"
, which
is useful for cases where "$ne"
won’t work or where
"$addToSet"
describes what is happening
better.
For instance, suppose you have a document that represents a user. You might have a set of email addresses that they have added:
> db.users.findOne({"_id" : ObjectId("4b2d75476cc613d5ee930164")}) { "_id" : ObjectId("4b2d75476cc613d5ee930164"), "username" : "joe", "emails" : [ "[email protected]", "[email protected]", "[email protected]" ] }
When adding another address, you can use
"$addToSet"
to prevent duplicates:
> db.users.update({"_id" : ObjectId("4b2d75476cc613d5ee930164")}, ... {"$addToSet" : {"emails" : "[email protected]"}}) > db.users.findOne({"_id" : ObjectId("4b2d75476cc613d5ee930164")}) { "_id" : ObjectId("4b2d75476cc613d5ee930164"), "username" : "joe", "emails" : [ "[email protected]", "[email protected]", "[email protected]", ] } > db.users.update({"_id" : ObjectId("4b2d75476cc613d5ee930164")}, ... {"$addToSet" : {"emails" : "[email protected]"}}) > db.users.findOne({"_id" : ObjectId("4b2d75476cc613d5ee930164")}) { "_id" : ObjectId("4b2d75476cc613d5ee930164"), "username" : "joe", "emails" : [ "[email protected]", "[email protected]", "[email protected]", "[email protected]" ] }
You can also use "$addToSet"
in conjunction
with "$each"
to add multiple unique values, which
cannot be done with the
"$ne"
/"$push"
combination. For
instance, we could use these modifiers if the user wanted to add more
than one email address:
> db.users.update({"_id" : ObjectId("4b2d75476cc613d5ee930164")}, {"$addToSet" : ... {"emails" : {"$each" : ["[email protected]", "[email protected]", "[email protected]"]}}}) > db.users.findOne({"_id" : ObjectId("4b2d75476cc613d5ee930164")}) { "_id" : ObjectId("4b2d75476cc613d5ee930164"), "username" : "joe", "emails" : [ "[email protected]", "[email protected]", "[email protected]", "[email protected]" "[email protected]" "[email protected]" ] }
There are a few ways to remove elements from an array. If you
want to treat the array like a queue or a stack, you can use
"$pop"
, which can remove elements from either end.
{$pop : {
removes an element from the end of the array. key
: 1}}{$pop :
{
removes it from the
beginning.key
: -1}}
Sometimes an element should be removed based on specific
criteria, rather than its position in the array.
"$pull"
is used to remove elements of an array that
match the given criteria. For example, suppose we have a list of
things that need to be done but not in any specific order:
> db.lists.insert({"todo" : ["dishes", "laundry", "dry cleaning"]})
If we do the laundry first, we can remove it from the list with the following:
> db.lists.update({}, {"$pull" : {"todo" : "laundry"}})
Now if we do a find, we’ll see that there are only two elements remaining in the array:
> db.lists.find() { "_id" : ObjectId("4b2d75476cc613d5ee930164"), "todo" : [ "dishes", "dry cleaning" ] }
Pulling removes all matching documents, not just a single match.
If you have an array that looks like [1, 1, 2, 1]
and pull 1
, you’ll end up with a single-element
array, [2]
.
Array manipulation becomes a little trickier when we have
multiple values in an array and want to modify some of them. There are
two ways to manipulate values in arrays: by position or by using the
position operator (the "$"
character).
Arrays use 0-based indexing, and elements can be selected as though their index were a document key. For example, suppose we have a document containing an array with a few embedded documents, such as a blog post with comments:
> db.blog.posts.findOne() { "_id" : ObjectId("4b329a216cc613d5ee930192"), "content" : "...", "comments" : [ { "comment" : "good post", "author" : "John", "votes" : 0 }, { "comment" : "i thought it was too short", "author" : "Claire", "votes" : 3 }, { "comment" : "free watches", "author" : "Alice", "votes" : -1 } ] }
If we want to increment the number of votes for the first comment, we can say the following:
> db.blog.update({"post" : post_id}, ... {"$inc" : {"comments.0.votes" : 1}})
In many cases, though, we don’t know what index of the array to
modify without querying for the document first and examining it. To
get around this, MongoDB has a positional operator,
"$"
, that figures out which element of the array
the query document matched and updates that element. For example, if
we have a user named John who updates his name to Jim, we can replace
it in the comments by using the positional operator:
db.blog.update({"comments.author" : "John"}, ... {"$set" : {"comments.$.author" : "Jim"}})
The positional operator updates only the first match. Thus, if John had left more than one comment, his name would be changed only for the first comment he left.
Some modifiers are faster than others. $inc
modifies a document in place: it does not have to change the size of a
document, only the value of a key, so it is very efficient. On the
other hand, array modifiers might change the size of a document and
can be slow. ("$set"
can modify documents in place
if the size isn’t changing but otherwise is subject to the same
performance limitations as array operators.)
MongoDB leaves some padding around a document to allow for changes in size (and, in fact, figures out how much documents usually change in size and adjusts the amount of padding it leaves accordingly), but it will eventually have to allocate new space for a document if you make it much larger than it was originally. Compounding this slowdown, as arrays get longer, it takes MongoDB a longer amount of time to traverse the whole array, slowing down each array modification.
A simple program in Python can demonstrate the speed difference. This program inserts a single key and increments its value 100,000 times.
from pymongo import Connection import time db = Connection().performance_test db.drop_collection("updates") collection = db.updates collection.insert({"x": 1}) # make sure the insert is complete before we start timing collection.find_one() start = time.time() for i in range(100000): collection.update({}, {"$inc" : {"x" : 1}}) # make sure the updates are complete before we stop timing collection.find_one() print time.time() - start
On a MacBook Air this took 7.33 seconds. That’s more than 13,000 updates per second (which is pretty good for a fairly anemic machine). Now, let’s try it with a document with a single array key, pushing new values onto that array 100,000 times:
for i in range(100000): collection.update({}, {'$push' : {'x' : 1}})
This program took 67.58 seconds to run, which is less than 1,500 updates per second.
Using "$push"
and other array modifiers is
encouraged and often necessary, but it is good to keep in mind the
trade-offs of such updates. If "$push"
becomes a
bottleneck, it may be worth pulling an embedded array out into a
separate collection.
An upsert is a special type of update. If no document is found that matches the update criteria, a new document will be created by combining the criteria and update documents. If a matching document is found, it will be updated normally. Upserts can be very handy because they eliminate the need to “seed” your collection: you can have the same code create and update documents.
Let’s go back to our example recording the number of views for
each page of a website. Without an upsert, we might try to find the URL
and increment the number of views or create a new document if the URL
doesn’t exist. If we were to write this out as a JavaScript program
(instead of a series of shell commands—scripts can be run with
mongo
), it
might look something like the following:scriptname.js
// check if we have an entry for this page blog = db.analytics.findOne({url : "/blog"}) // if we do, add one to the number of views and save if (blog) { blog.pageviews++; db.analytics.save(blog); } // otherwise, create a new document for this page else { db.analytics.save({url : "/blog", pageviews : 1}) }
This means we are making a round-trip to the database, plus sending an update or insert, every time someone visits a page. If we are running this code in multiple processes, we are also subject to a race condition where more than one document can be inserted for a given URL.
We can eliminate the race condition and cut down on the amount of
code by just sending an upsert (the third parameter to
update
specifies that this should be an
upsert):
db.analytics.update({"url" : "/blog"}, {"$inc" : {"visits" : 1}}, true)
This line does exactly what the previous code block does, except it’s faster and atomic! The new document is created using the criteria document as a base and applying any modifier documents to it. For example, if you do an upsert that matches a key and has an increment to the value of that key, the increment will be applied to the match:
> db.math.remove() > db.math.update({"count" : 25}, {"$inc" : {"count" : 3}}, true) > db.math.findOne() { "_id" : ObjectId("4b3295f26cc613d5ee93018f"), "count" : 28 }
The remove
empties the collection, so there
are no documents. The upsert creates a new document with a
"count"
of 25 and then increments that by 3, giving
us a document where "count"
is 28. If the upsert
option were not specified, {"count" : 25}
would not
match any documents, so nothing would happen.
If we run the upsert again (with the criteria {count :
25}
), it will create another new document. This is because the
criteria does not match the only document in the collection. (Its
"count"
is 28.)
save
is a shell function that lets you
insert a document if it doesn’t exist and update it if it does. It
takes one argument: a document. If the document contains an
"_id"
key, save
will do an
upsert. Otherwise, it will do an insert. This is just a convenience
function so that programmers can quickly modify documents in the
shell:
> var x = db.foo.findOne() > x.num = 42 42 > db.foo.save(x)
Without save
, the last line would have been
a more cumbersome db.foo.update({"_id" : x._id},
x)
.
Updates, by default, update only the first document found that
matches the criteria. If there are more matching documents, they will
remain unchanged. To modify all of the documents matching the criteria,
you can pass true
as the fourth parameter to
update
.
update
’s behavior may be changed in the
future (the server may update all matching documents by default and
update one only if false
is passed as the fourth
parameter), so it is recommended that you always specify whether you
want a multiple update.
Not only is it more obvious what the update should be doing, but your program won’t break if the default is ever changed.
Multiupdates are a great way of performing schema migrations or
rolling out new features to certain users. Suppose, for example, we want
to give a gift to every user who has a birthday on a certain day. We can
use multiupdate to add a "gift"
to their
account:
> db.users.update({birthday : "10/13/1978"}, ... {$set : {gift : "Happy Birthday!"}}, false, true)
This would add the "gift"
key to all user
documents with birthdays on October 13, 1978.
To see the number of documents updated by a multiple update, you
can run the getLastError
database command (which might be better named
"getLastOpStatus"
). The "n"
key
will contain the number of documents affected by the update:
> db.count.update({x : 1}, {$inc : {x : 1}}, false, true) > db.runCommand({getLastError : 1}) { "err" : null, "updatedExisting" : true, "n" : 5, "ok" : true }
"n"
is 5
, meaning that five
documents were affected by the update.
"updatedExisting"
is true
, meaning
that the update modified existing document(s). For more on database
commands and their responses, see Chapter 7.
You can get some limited information about what was updated by
calling getLastError
, but it
does not actually return the updated document. For that, you’ll need the
findAndModify
command.
findAndModify
is called differently than a
normal update and is a bit slower, because it must wait for a database
response. It is handy for manipulating queues and performing other
operations that need get-and-set style atomicity.
Suppose we have a collection of processes run in a certain order. Each is represented with a document that has the following form:
{ "_id" : ObjectId(), "status" :state
, "priority" :N
}
"status"
is a string that can be “READY,”
“RUNNING,” or “DONE.” We need to find the job with the highest priority
in the “READY” state, run the process function, and then update the
status to “DONE.” We might try querying for the ready processes, sorting
by priority, and updating the status of the highest-priority process to
mark it is “RUNNING.” Once we have processed it, we update the status to
“DONE.” This looks something like the following:
ps = db.processes.find({"status" : "READY").sort({"priority" : -1}).limit(1).next() db.processes.update({"_id" : ps._id}, {"$set" : {"status" : "RUNNING"}}) do_something(ps); db.processes.update({"_id" : ps._id}, {"$set" : {"status" : "DONE"}})
This algorithm isn’t great, because it is subject to a race condition. Suppose we have two threads running. If one thread (call it A) retrieved the document and another thread (call it B) retrieved the same document before A had updated its status to “RUNNING,” then both threads would be running the same process. We can avoid this by checking the status as part of the update query, but this becomes complex:
var cursor = db.processes.find({"status" : "READY"}).sort({"priority" : -1}).limit(1); while ((ps = cursor.next()) != null) { ps.update({"_id" : ps._id, "status" : "READY"}, {"$set" : {"status" : "RUNNING"}}); var lastOp = db.runCommand({getlasterror : 1}); if (lastOp.n == 1) { do_something(ps); db.processes.update({"_id" : ps._id}, {"$set" : {"status" : "DONE"}}) break; } cursor = db.processes.find({"status" : "READY"}).sort({"priority" : -1}).limit(1); }
Also, depending on timing, one thread may end up doing all the
work while another thread is uselessly trailing it. Thread A could
always grab the process, and then B would try to get the same process,
fail, and leave A to do all the work. Situations like this are perfect
for findAndModify
. findAndModify
can return the item and update it in a single operation. In this case,
it looks like the following:
> ps = db.runCommand({"findAndModify" : "processes", ... "query" : {"status" : "READY"}, ... "sort" : {"priority" : -1}, ... "update" : {"$set" : {"status" : "RUNNING"}}) { "ok" : 1, "value" : { "_id" : ObjectId("4b3e7a18005cab32be6291f7"), "priority" : 1, "status" : "READY" } }
Notice that the status is still “READY” in the returned document.
The document is returned before the modifier document is applied. If you
do a find on the collection, though, you will see that the document’s
"status"
has been updated to “RUNNING”:
> db.processes.findOne({"_id" : ps.value._id}) { "_id" : ObjectId("4b3e7a18005cab32be6291f7"), "priority" : 1, "status" : "RUNNING" }
Thus, the program becomes the following:
> ps = db.runCommand({"findAndModify" : "processes", ... "query" : {"status" : "READY"}, ... "sort" : {"priority" : -1}, ... "update" : {"$set" : {"status" : "RUNNING"}}).value > do_something(ps) > db.process.update({"_id" : ps._id}, {"$set" : {"status" : "DONE"}})
findAndModify
can have either an
"update"
key or a "remove"
key. A
"remove"
key indicates that the matching document
should be removed from the collection. For instance, if we wanted to
simply remove the job instead of updating its status, we could run the
following:
> ps = db.runCommand({"findAndModify" : "processes", ... "query" : {"status" : "READY"}, ... "sort" : {"priority" : -1}, ... "remove" : true).value > do_something(ps)
The values for each key in the findAndModify
command are as follows:
findAndModify
A string, the collection name.
query
A query document, the criteria with which to search for documents.
sort
Criteria by which to sort results.
update
A modifier document, the update to perform on the document found.
remove
Boolean specifying whether the document should be removed.
new
Boolean specifying whether the document returned should be the updated document or the preupdate document. Defaults to the preupdate document.
Either "update"
or "remove"
must be included, but not both. If no matching document is found, the
command will return an error.
findAndModify
has a few limitations. First, it
can update or remove only one document at a time. There is also no way
to use it for an upsert; it can update only existing documents.
The price of using findAndModify
over a
traditional update is speed: it is a bit slower. That said, it is no
slower than one might expect: it takes roughly the same amount of time
as a find
, update
, and
getLastError
command performed in serial.
The three operations that this chapter focused on (inserts, removes, and updates) seem instantaneous because none of them waits for a database response. They are not asynchronous; they can be thought of as “fire-and-forget” functions: the client sends the documents to the server and immediately continues. The client never receives an “OK, got that” or a “not OK, could you send that again?” response.
The benefit to this is that the speed at which you can perform these operations is terrific. You are often only limited by the speed at which your client can send them and the speed of your network. This works well most of the time; however, sometimes something goes wrong: a server crashes, a rat chews through a network cable, or a data center is in a flood zone. If the server disappears, the client will happily send some writes to a server that isn’t there, entirely unaware of its absence. For some applications, this is acceptable. Losing a couple of seconds of log messages, user clicks, or analytics in a hardware failure is not the end of the world. For others, this is not the behavior the programmer wants (payment-processing systems spring to mind).
Suppose you are writing an ecommerce application. If someone orders something, the application should probably take a little extra time to make sure the order goes through. That is why you can do a “safe” version of these operations, where you check whether there was an error in execution and attempt to redo them.
MongoDB developers made unchecked operations the default because of their experience with relational databases. Many applications written on top of relational databases do not care about or check the return codes, yet they incur the performance penalty of their application waiting for them to arrive. MongoDB pushes this option to the user. This way, programs that collect log messages or real-time analytics don’t have to wait for return codes that they don’t care about.
The safe version of these operations runs a
getLastError
command immediately following the
operation to check whether it succeeded (see Database Commands for more on commands). The driver waits for
the database response and then handles errors appropriately, throwing a
catchable exception in most cases. This way, developers can catch and
handle database errors in whatever way feels “natural” for their
language. When an operation is successful, the
getLastError
response also contains some additional
information (e.g., for an update or remove, it includes the number of
documents affected).
The same getLastError
command that powers
safe mode also contains functionality for checking that operations
have been successfully replicated. For more on this feature, see Blocking for Replication.
The price of performing “safe” operations is performance: waiting for a database response takes an order of magnitude longer than sending the message, ignoring the client-side cost of handling exceptions. (This cost varies by language but is usually fairly heavyweight.) Thus, applications should weigh the importance of their data (and the consequences if some of it is lost) versus the speed needed.
When in doubt, use safe operations. If they aren’t fast enough, start making less important operations fire-and-forget.
More specifically:
If you live dangerously, use fire-and-forget operations exclusively.
If you want to live longer, save valuable user input (account sign-ups, credit card numbers, emails) with safe operations and do everything else with fire-and-forget operations.
If you are cautious, use safe operations exclusively. If your application is automatically generating hundreds of little pieces of information to save (e.g., page, user, or advertising statistics), these can still use the fire-and-forget operation.
Safe operations are also a good way to debug “strange” database behavior, not just for preventing the apocalyptic scenarios described earlier. Safe operations should be used extensively while developing, even if they are later removed before going into production. They can protect against many common database usage errors, most commonly duplicate key errors.
Duplicate key errors often occur when users try to insert a
document with a duplicate value for the "_id"
key.
MongoDB does not allow multiple documents with the same
"_id"
in the same collection. If you do a safe insert
and a duplicate key error occurs, the server error will be picked up by
the safety check, and an exception will be thrown. In unsafe mode, there
is no database response, and you might not be aware that the insert
failed.
For example, using the shell, you can see that inserting two
documents with the same "_id"
will not work:
> db.foo.insert({"_id" : 123, "x" : 1}) > db.foo.insert({"_id" : 123, "x" : 2}) E11000 duplicate key error index: test.foo.$_id_ dup key: { : 123.0 }
If we examine the collection, we can see that only the first
document was successfully inserted. Note that this error can occur with
any unique index, not just the one on "_id"
. The
shell always checks for errors; in the drivers it is optional.
For each connection to a MongoDB server, the database creates a queue for that connection’s requests. When the client sends a request, it will be placed at the end of its connection’s queue. Any subsequent requests on the connection will occur after the enqueued operation is processed. Thus, a single connection has a consistent view of the database and can always read its own writes.
Note that this is a per-connection queue: if we open two shells, we will have two connections to the database. If we perform an insert in one shell, a subsequent query in the other shell might not return the inserted document. However, within a single shell, if we query for the document after inserting, the document will be returned. This behavior can be difficult to duplicate by hand, but on a busy server, interleaved inserts/queries are very likely to occur. Often developers run into this when they insert data in one thread and then check that it was successfully inserted in another. For a second or two, it looks like the data was not inserted, and then it suddenly appears.
This behavior is especially worth keeping in mind when using the Ruby, Python, and Java drivers, because all three drivers use connection pooling. For efficiency, these drivers open multiple connections (a pool) to the server and distribute requests across them. They all, however, have mechanisms to guarantee that a series of requests is processed by a single connection. There is detailed documentation on connection pooling in various languages on the MongoDB wiki.
18.218.3.204