Chapter 3. Creating, Updating, and Deleting Documents

This chapter covers the basics of moving data into and out of the database, including the following:

  • Adding new documents to a collection

  • Removing documents from a collection

  • Updating existing documents

  • Choosing the correct level of safety versus speed for all of these operations

Inserting and Saving Documents

Inserts are the basic method for adding data to MongoDB. To insert a document into a collection, use the collection’s insert method:

> db.foo.insert({"bar" : "baz"})

This will add an "_id" key to the document (if one does not already exist) and save it to MongoDB.

Batch Insert

If you have a situation where you are inserting multiple documents into a collection, you can make the insert faster by using batch inserts. Batch inserts allow you to pass an array of documents to the database.

Sending dozens, hundreds, or even thousands of documents at a time can make inserts significantly faster. A batch insert is a single TCP request, meaning that you do not incur the overhead of doing hundreds of individual requests. It can also cut insert time by eliminating a lot of the header processing that gets done for each message. When an individual document is sent to the database, it is prefixed by a header that tells the database to do an insert operation on a certain collection. By using batch insert, the database doesn’t need to reprocess this information for each document.

Batch inserts are intended to be used in applications, such as for inserting a couple hundred sensor data points into an analytics collection at once. They are useful only if you are inserting multiple documents into a single collection: you cannot use batch inserts to insert into multiple collections with a single request. If you are just importing raw data (for example, from a data feed or MySQL), there are command-line tools like mongoimport that can be used instead of batch insert. On the other hand, it is often handy to munge data before saving it to MongoDB (converting dates to the date type or adding a custom "_id") so batch inserts can be used for importing data, as well.

Current versions of MongoDB do not accept messages longer than 16MB, so there is a limit to how much can be inserted in a single batch insert.

Inserts: Internals and Implications

When you perform an insert, the driver you are using converts the data structure into BSON, which it then sends to the database (see Appendix C for more on BSON). The database understands BSON and checks for an "_id" key and that the document’s size does not exceed 4MB, but other than that, it doesn’t do data validation; it just saves the document to the database as is. This has a couple of side effects, most notably that you can insert invalid data and that your database is fairly secure from injection attacks.

All of the drivers for major languages (and most of the minor ones, too) check for a variety of invalid data (documents that are too large, contain non-UTF-8 strings, or use unrecognized types) before sending anything to the database. If you are running a driver that you are not sure about, you can start the database server with the --objcheck option, and it will examine each document’s structural validity before inserting it (at the cost of slower performance).

Note

Documents larger than 4MB (when converted to BSON) cannot be saved to the database. This is a somewhat arbitrary limit (and may be raised in the future); it is mostly to prevent bad schema design and ensure consistent performance. To see the BSON size (in bytes) of the document doc, run Object.bsonsize(doc) from the shell.

To give you an idea of how much 4MB is, the entire text of War and Peace is just 3.14MB.

MongoDB does not do any sort of code execution on inserts, so they are not vulnerable to injection attacks. Traditional injection attacks are impossible with MongoDB, and alternative injection-type attacks are easy to guard against in general, but inserts are particularly invulnerable.

Removing Documents

Now that there’s data in our database, let’s delete it.

> db.users.remove()

This will remove all of the documents in the users collection. This doesn’t actually remove the collection, and any indexes created on it will still exist.

The remove function optionally takes a query document as a parameter. When it’s given, only documents that match the criteria will be removed. Suppose, for instance, that we want to remove everyone from the mailing.list collection where the value for "opt-out" is true:

> db.mailing.list.remove({"opt-out" : true})

Once data has been removed, it is gone forever. There is no way to undo the remove or recover deleted documents.

Remove Speed

Removing documents is usually a fairly quick operation, but if you want to clear an entire collection, it is faster to drop it (and then re-create any indexes).

For example, in Python, suppose we insert a million dummy elements with the following:

for i in range(1000000):
    collection.insert({"foo": "bar", "baz": i, "z": 10 - i})

Now we’ll try to remove all of the documents we just inserted, measuring the time it takes. First, here’s a simple remove:

import time

from pymongo import Connection

db = Connection().foo
collection = db.bar

start = time.time()

collection.remove()
collection.find_one()

total = time.time() - start
print "%d seconds" % total

On a MacBook Air, this script prints “46.08 seconds.”

If the remove and find_one are replaced by db.drop_collection("bar"), the time drops to .01 seconds! This is obviously a vast improvement, but it comes at the expense of granularity: we cannot specify any criteria. The whole collection is dropped, and all of its indexes are deleted.

Updating Documents

Once a document is stored in the database, it can be changed using the update method. update takes two parameters: a query document, which locates documents to update, and a modifier document, which describes the changes to make to the documents found.

Updates are atomic: if two updates happen at the same time, whichever one reaches the server first will be applied, and then the next one will be applied. Thus, conflicting updates can safely be sent in rapid-fire succession without any documents being corrupted: the last update will “win.”

Document Replacement

The simplest type of update fully replaces a matching document with a new one. This can be useful to do a dramatic schema migration. For example, suppose we are making major changes to a user document, which looks like the following:

{
    "_id" : ObjectId("4b2b9f67a1f631733d917a7a"),
    "name" : "joe",
    "friends" : 32,
    "enemies" : 2
}

We want to change that document into the following:

{
    "_id" : ObjectId("4b2b9f67a1f631733d917a7a"),
    "username" : "joe",
    "relationships" :
        {
            "friends" : 32,
            "enemies" : 2
        }
}

We can make this change by replacing the document using an update:

> var joe = db.users.findOne({"name" : "joe"});
> joe.relationships = {"friends" : joe.friends, "enemies" : joe.enemies};
{
    "friends" : 32,
    "enemies" : 2
}
> joe.username = joe.name;
"joe"
> delete joe.friends;
true
> delete joe.enemies;
true
> delete joe.name;
true
> db.users.update({"name" : "joe"}, joe);

Now, doing a findOne shows that the structure of the document has been updated.

A common mistake is matching more than one document with the criteria and then create a duplicate "_id" value with the second parameter. The database will throw an error for this, and nothing will be changed.

For example, suppose we create several documents with the same "name", but we don’t realize it:

> db.people.find()
{"_id" : ObjectId("4b2b9f67a1f631733d917a7b"), "name" : "joe", "age" : 65},
{"_id" : ObjectId("4b2b9f67a1f631733d917a7c"), "name" : "joe", "age" : 20},
{"_id" : ObjectId("4b2b9f67a1f631733d917a7d"), "name" : "joe", "age" : 49},

Now, if it’s Joe #2’s birthday, we want to increment the value of his "age" key, so we might say this:

> joe = db.people.findOne({"name" : "joe", "age" : 20});
{
    "_id" : ObjectId("4b2b9f67a1f631733d917a7c"),
    "name" : "joe",
    "age" : 20
}
> joe.age++;
> db.people.update({"name" : "joe"}, joe);
E11001 duplicate key on update

What happened? When you call update, the database will look for a document matching {"name" : "joe"}. The first one it finds will be the 65-year-old Joe. It will attempt to replace that document with the one in the joe variable, but there’s already a document in this collection with the same "_id". Thus, the update will fail, because "_id" values must be unique. The best way to avoid this situation is to make sure that your update always specifies a unique document, perhaps by matching on a key like "_id".

Using Modifiers

Usually only certain portions of a document need to be updated. Partial updates can be done extremely efficiently by using atomic update modifiers. Update modifiers are special keys that can be used to specify complex update operations, such as altering, adding, or removing keys, and even manipulating arrays and embedded documents.

Suppose we were keeping website analytics in a collection and wanted to increment a counter each time someone visited a page. We can use update modifiers to do this increment atomically. Each URL and its number of page views is stored in a document that looks like this:

{
    "_id" : ObjectId("4b253b067525f35f94b60a31"),
    "url" : "www.example.com",
    "pageviews" : 52
}

Every time someone visits a page, we can find the page by its URL and use the "$inc" modifier to increment the value of the "pageviews" key.

> db.analytics.update({"url" : "www.example.com"},
... {"$inc" : {"pageviews" : 1}})

Now, if we do a find, we see that "pageviews" has increased by one.

> db.analytics.find()
{
    "_id" : ObjectId("4b253b067525f35f94b60a31"),
    "url" : "www.example.com",
    "pageviews" : 53
}

Tip

Perl and PHP programmers are probably thinking that any character would have been a better choice than $. Both of these languages use $ as a variable prefix and will replace $-prefixed strings with their variable value in double-quoted strings. However, MongoDB started out as a JavaScript database, and $ is a special character that isn’t interpreted differently in JavaScript, so it was used. It is an annoying historical relic from MongoDB’s primordial soup.

There are several options for Perl and PHP programmers. First, you could just escape the $: "$foo". You can use single quotes, which don’t do variable interpolation: '$foo'. Finally, both drivers allow you to define your own character that will be used instead of $. In Perl, set $MongoDB::BSON::char, and in PHP set mongo.cmd_char in php.ini to =, :, ?, or any other character that you would like to use instead of $. Then, if you choose, say, ~, you would use ~inc instead of $inc and ~gt instead of $gt.

Good choices for the special character are characters that will not naturally appear in key names (don’t use _ or x) and are not characters that have to be escaped themselves, which will gain you nothing and be confusing (such as or, in Perl, @).

When using modifiers, the value of "_id" cannot be changed. (Note that "_id" can be changed by using whole-document replacement.) Values for any other key, including other uniquely indexed keys, can be modified.

Getting started with the “$set” modifier

"$set" sets the value of a key. If the key does not yet exist, it will be created. This can be handy for updating schema or adding user-defined keys. For example, suppose you have a simple user profile stored as a document that looks something like the following:

> db.users.findOne()
{
    "_id" : ObjectId("4b253b067525f35f94b60a31"),
    "name" : "joe",
    "age" : 30,
    "sex" : "male",
    "location" : "Wisconsin"
}

This is a pretty bare-bones user profile. If the user wanted to store his favorite book in his profile, he could add it using "$set":

> db.users.update({"_id" : ObjectId("4b253b067525f35f94b60a31")},
... {"$set" : {"favorite book" : "war and peace"}})

Now the document will have a “favorite book” key:

> db.users.findOne()
{
    "_id" : ObjectId("4b253b067525f35f94b60a31"),
    "name" : "joe",
    "age" : 30,
    "sex" : "male",
    "location" : "Wisconsin",
    "favorite book" : "war and peace"
}

If the user decides that he actually enjoys a different book, "$set" can be used again to change the value:

> db.users.update({"name" : "joe"},
... {"$set" : {"favorite book" : "green eggs and ham"}})

"$set" can even change the type of the key it modifies. For instance, if our fickle user decides that he actually likes quite a few books, he can change the value of the “favorite book” key into an array:

> db.users.update({"name" : "joe"},
... {"$set" : {"favorite book" :
...     ["cat's cradle", "foundation trilogy", "ender's game"]}})

If the user realizes that he actually doesn’t like reading, he can remove the key altogether with "$unset":

> db.users.update({"name" : "joe"},
... {"$unset" : {"favorite book" : 1}})

Now the document will be the same as it was at the beginning of this example.

You can also use "$set" to reach in and change embedded documents:

> db.blog.posts.findOne()
{
    "_id" : ObjectId("4b253b067525f35f94b60a31"),
    "title" : "A Blog Post",
    "content" : "...",
    "author" : {
        "name" : "joe",
        "email" : "[email protected]"
    }
}
> db.blog.posts.update({"author.name" : "joe"}, {"$set" : {"author.name" : "joe schmoe"}})
> db.blog.posts.findOne()
{
    "_id" : ObjectId("4b253b067525f35f94b60a31"),
    "title" : "A Blog Post",
    "content" : "...",
    "author" : {
        "name" : "joe schmoe",
        "email" : "[email protected]"
    }
}

You must always use a $ modifier for adding, changing, or removing keys. A common error people often make when starting out is to try to set the value of "foo" to "bar" by doing an update that looks like this:

> db.coll.update(criteria, {"foo" : "bar"})

This will not function as intended. It actually does a full-document replacement, replacing the matched document with {"foo" : "bar"}. Always use $ operators for modifying individual key/value pairs.

Incrementing and decrementing

The "$inc" modifier can be used to change the value for an existing key or to create a new key if it does not already exist. It is very useful for updating analytics, karma, votes, or anything else that has a changeable, numeric value.

Suppose we are creating a game collection where we want to save games and update scores as they change. When a user starts playing, say, a game of pinball, we can insert a document that identifies the game by name and user playing it:

> db.games.insert({"game" : "pinball", "user" : "joe"})

When the ball hits a bumper, the game should increment the player’s score. As points in pinball are given out pretty freely, let’s say that the base unit of points a player can earn is 50. We can use the "$inc" modifier to add 50 to the player’s score:

> db.games.update({"game" : "pinball", "user" : "joe"},
... {"$inc" : {"score" : 50}})

If we look at the document after this update, we’ll see the following:

> db.games.findOne()
{
     "_id" : ObjectId("4b2d75476cc613d5ee930164"),
     "game" : "pinball",
     "name" : "joe",
     "score" : 50
}

The score key did not already exist, so it was created by "$inc" and set to the increment amount: 50.

If the ball lands in a “bonus” slot, we want to add 10,000 to the score. This can be accomplished by passing a different value to "$inc":

> db.games.update({"game" : "pinball", "user" : "joe"},
... {"$inc" : {"score" : 10000}})

Now if we look at the game, we’ll see the following:

> db.games.find()
{
     "_id" : ObjectId("4b2d75476cc613d5ee930164"),
     "game" : "pinball",
     "name" : "joe",
     "score" : 10050
}

The "score" key existed and had a numeric value, so the server added 10,000 to it.

"$inc" is similar to "$set", but it is designed for incrementing (and decrementing) numbers. "$inc" can be used only on values of type integer, long, or double. If it is used on any other type of value, it will fail. This includes types that many languages will automatically cast into numbers, like nulls, booleans, or strings of numeric characters:

> db.foo.insert({"count" : "1"})
> db.foo.update({}, {$inc : {count : 1}})
Cannot apply $inc modifier to non-number

Also, the value of the "$inc" key must be a number. You cannot increment by a string, array, or other non-numeric value. Doing so will give a “Modifier "$inc" allowed for numbers only” error message. To modify other types, use "$set" or one of the array operations described in a moment.

Array modifiers

An extensive class of modifiers exists for manipulating arrays. Arrays are common and powerful data structures: not only are they lists that can be referenced by index, but they can also double as sets.

Array operators can be used only on keys with array values. For example, you cannot push on to an integer or pop off of a string, for example. Use "$set" or "$inc" to modify scalar values.

"$push" adds an element to the end of an array if the specified key already exists and creates a new array if it does not. For example, suppose that we are storing blog posts and want to add a "comments" key containing an array. We can push a comment onto the nonexistent "comments" array, which will create the array and add the comment:

> db.blog.posts.findOne()
{
    "_id" : ObjectId("4b2d75476cc613d5ee930164"),
    "title" : "A blog post",
    "content" : "..."
}
> db.blog.posts.update({"title" : "A blog post"}, {$push : {"comments" :
... {"name" : "joe", "email" : "[email protected]", "content" : "nice post."}}})
> db.blog.posts.findOne()
{
    "_id" : ObjectId("4b2d75476cc613d5ee930164"),
    "title" : "A blog post",
    "content" : "...",
    "comments" : [
        {
            "name" : "joe",
            "email" : "[email protected]",
            "content" : "nice post."
        }
    ]
}

Now, if we want to add another comment, we can simple use "$push" again:

> db.blog.posts.update({"title" : "A blog post"}, {$push : {"comments" :
... {"name" : "bob", "email" : "[email protected]", "content" : "good post."}}})
> db.blog.posts.findOne()
{
    "_id" : ObjectId("4b2d75476cc613d5ee930164"),
    "title" : "A blog post",
    "content" : "...",
    "comments" : [
        {
            "name" : "joe",
            "email" : "[email protected]",
            "content" : "nice post."
        },
        {
            "name" : "bob",
            "email" : "[email protected]",
            "content" : "good post."
        }
    ]
}

A common use is wanting to add a value to an array only if the value is not already present. This can be done using a "$ne" in the query document. For example, to push an author onto a list of citations, but only if he isn’t already there, use the following:

> db.papers.update({"authors cited" : {"$ne" : "Richie"}},
... {$push : {"authors cited" : "Richie"}})

This can also be done with "$addToSet", which is useful for cases where "$ne" won’t work or where "$addToSet" describes what is happening better.

For instance, suppose you have a document that represents a user. You might have a set of email addresses that they have added:

> db.users.findOne({"_id" : ObjectId("4b2d75476cc613d5ee930164")})
{
    "_id" : ObjectId("4b2d75476cc613d5ee930164"),
    "username" : "joe",
    "emails" : [
        "[email protected]",
        "[email protected]",
        "[email protected]"
    ]
}

When adding another address, you can use "$addToSet" to prevent duplicates:

> db.users.update({"_id" : ObjectId("4b2d75476cc613d5ee930164")},
... {"$addToSet" : {"emails" : "[email protected]"}})
> db.users.findOne({"_id" : ObjectId("4b2d75476cc613d5ee930164")})
{
    "_id" : ObjectId("4b2d75476cc613d5ee930164"),
    "username" : "joe",
    "emails" : [
        "[email protected]",
        "[email protected]",
        "[email protected]",
    ]
}
> db.users.update({"_id" : ObjectId("4b2d75476cc613d5ee930164")},
... {"$addToSet" : {"emails" : "[email protected]"}})
> db.users.findOne({"_id" : ObjectId("4b2d75476cc613d5ee930164")})
{
    "_id" : ObjectId("4b2d75476cc613d5ee930164"),
    "username" : "joe",
    "emails" : [
        "[email protected]",
        "[email protected]",
        "[email protected]",
        "[email protected]"
    ]
}

You can also use "$addToSet" in conjunction with "$each" to add multiple unique values, which cannot be done with the "$ne"/"$push" combination. For instance, we could use these modifiers if the user wanted to add more than one email address:

> db.users.update({"_id" : ObjectId("4b2d75476cc613d5ee930164")}, {"$addToSet" :
... {"emails" : {"$each" : ["[email protected]", "[email protected]", "[email protected]"]}}})
> db.users.findOne({"_id" : ObjectId("4b2d75476cc613d5ee930164")})
{
    "_id" : ObjectId("4b2d75476cc613d5ee930164"),
    "username" : "joe",
    "emails" : [
        "[email protected]",
        "[email protected]",
        "[email protected]",
        "[email protected]"
        "[email protected]"
        "[email protected]"
    ]
}

There are a few ways to remove elements from an array. If you want to treat the array like a queue or a stack, you can use "$pop", which can remove elements from either end. {$pop : {key : 1}} removes an element from the end of the array. {$pop : {key : -1}} removes it from the beginning.

Sometimes an element should be removed based on specific criteria, rather than its position in the array. "$pull" is used to remove elements of an array that match the given criteria. For example, suppose we have a list of things that need to be done but not in any specific order:

> db.lists.insert({"todo" : ["dishes", "laundry", "dry cleaning"]})

If we do the laundry first, we can remove it from the list with the following:

> db.lists.update({}, {"$pull" : {"todo" : "laundry"}})

Now if we do a find, we’ll see that there are only two elements remaining in the array:

> db.lists.find()
{
    "_id" : ObjectId("4b2d75476cc613d5ee930164"),
    "todo" : [
        "dishes",
        "dry cleaning"
    ]
}

Pulling removes all matching documents, not just a single match. If you have an array that looks like [1, 1, 2, 1] and pull 1, you’ll end up with a single-element array, [2].

Positional array modifications

Array manipulation becomes a little trickier when we have multiple values in an array and want to modify some of them. There are two ways to manipulate values in arrays: by position or by using the position operator (the "$" character).

Arrays use 0-based indexing, and elements can be selected as though their index were a document key. For example, suppose we have a document containing an array with a few embedded documents, such as a blog post with comments:

> db.blog.posts.findOne()
{
    "_id" : ObjectId("4b329a216cc613d5ee930192"),
    "content" : "...",
    "comments" : [
        {
            "comment" : "good post",
            "author" : "John",
            "votes" : 0
        },
        {
            "comment" : "i thought it was too short",
            "author" : "Claire",
            "votes" : 3
        },
        {
            "comment" : "free watches",
            "author" : "Alice",
            "votes" : -1
        }
    ]
}

If we want to increment the number of votes for the first comment, we can say the following:

> db.blog.update({"post" : post_id},
... {"$inc" : {"comments.0.votes" : 1}})

In many cases, though, we don’t know what index of the array to modify without querying for the document first and examining it. To get around this, MongoDB has a positional operator, "$", that figures out which element of the array the query document matched and updates that element. For example, if we have a user named John who updates his name to Jim, we can replace it in the comments by using the positional operator:

db.blog.update({"comments.author" : "John"},
... {"$set" : {"comments.$.author" : "Jim"}})

The positional operator updates only the first match. Thus, if John had left more than one comment, his name would be changed only for the first comment he left.

Modifier speed

Some modifiers are faster than others. $inc modifies a document in place: it does not have to change the size of a document, only the value of a key, so it is very efficient. On the other hand, array modifiers might change the size of a document and can be slow. ("$set" can modify documents in place if the size isn’t changing but otherwise is subject to the same performance limitations as array operators.)

MongoDB leaves some padding around a document to allow for changes in size (and, in fact, figures out how much documents usually change in size and adjusts the amount of padding it leaves accordingly), but it will eventually have to allocate new space for a document if you make it much larger than it was originally. Compounding this slowdown, as arrays get longer, it takes MongoDB a longer amount of time to traverse the whole array, slowing down each array modification.

A simple program in Python can demonstrate the speed difference. This program inserts a single key and increments its value 100,000 times.

from pymongo import Connection

import time

db = Connection().performance_test
db.drop_collection("updates")
collection = db.updates

collection.insert({"x": 1})

# make sure the insert is complete before we start timing
collection.find_one()

start = time.time()

for i in range(100000):
    collection.update({}, {"$inc" : {"x" : 1}})

# make sure the updates are complete before we stop timing
collection.find_one()

print time.time() - start

On a MacBook Air this took 7.33 seconds. That’s more than 13,000 updates per second (which is pretty good for a fairly anemic machine). Now, let’s try it with a document with a single array key, pushing new values onto that array 100,000 times:

for i in range(100000):
    collection.update({}, {'$push' : {'x' : 1}})

This program took 67.58 seconds to run, which is less than 1,500 updates per second.

Using "$push" and other array modifiers is encouraged and often necessary, but it is good to keep in mind the trade-offs of such updates. If "$push" becomes a bottleneck, it may be worth pulling an embedded array out into a separate collection.

Upserts

An upsert is a special type of update. If no document is found that matches the update criteria, a new document will be created by combining the criteria and update documents. If a matching document is found, it will be updated normally. Upserts can be very handy because they eliminate the need to “seed” your collection: you can have the same code create and update documents.

Let’s go back to our example recording the number of views for each page of a website. Without an upsert, we might try to find the URL and increment the number of views or create a new document if the URL doesn’t exist. If we were to write this out as a JavaScript program (instead of a series of shell commands—scripts can be run with mongo scriptname.js), it might look something like the following:

// check if we have an entry for this page
blog = db.analytics.findOne({url : "/blog"})

// if we do, add one to the number of views and save
if (blog) {
  blog.pageviews++;
  db.analytics.save(blog);
}
// otherwise, create a new document for this page
else {
  db.analytics.save({url : "/blog", pageviews : 1})
}

This means we are making a round-trip to the database, plus sending an update or insert, every time someone visits a page. If we are running this code in multiple processes, we are also subject to a race condition where more than one document can be inserted for a given URL.

We can eliminate the race condition and cut down on the amount of code by just sending an upsert (the third parameter to update specifies that this should be an upsert):

db.analytics.update({"url" : "/blog"}, {"$inc" : {"visits" : 1}}, true)

This line does exactly what the previous code block does, except it’s faster and atomic! The new document is created using the criteria document as a base and applying any modifier documents to it. For example, if you do an upsert that matches a key and has an increment to the value of that key, the increment will be applied to the match:

> db.math.remove()
> db.math.update({"count" : 25}, {"$inc" : {"count" : 3}}, true)
> db.math.findOne()
{
    "_id" : ObjectId("4b3295f26cc613d5ee93018f"),
    "count" : 28
}

The remove empties the collection, so there are no documents. The upsert creates a new document with a "count" of 25 and then increments that by 3, giving us a document where "count" is 28. If the upsert option were not specified, {"count" : 25} would not match any documents, so nothing would happen.

If we run the upsert again (with the criteria {count : 25}), it will create another new document. This is because the criteria does not match the only document in the collection. (Its "count" is 28.)

The save Shell Helper

save is a shell function that lets you insert a document if it doesn’t exist and update it if it does. It takes one argument: a document. If the document contains an "_id" key, save will do an upsert. Otherwise, it will do an insert. This is just a convenience function so that programmers can quickly modify documents in the shell:

> var x = db.foo.findOne()
> x.num = 42
42
> db.foo.save(x)

Without save, the last line would have been a more cumbersome db.foo.update({"_id" : x._id}, x).

Updating Multiple Documents

Updates, by default, update only the first document found that matches the criteria. If there are more matching documents, they will remain unchanged. To modify all of the documents matching the criteria, you can pass true as the fourth parameter to update.

Tip

update’s behavior may be changed in the future (the server may update all matching documents by default and update one only if false is passed as the fourth parameter), so it is recommended that you always specify whether you want a multiple update.

Not only is it more obvious what the update should be doing, but your program won’t break if the default is ever changed.

Multiupdates are a great way of performing schema migrations or rolling out new features to certain users. Suppose, for example, we want to give a gift to every user who has a birthday on a certain day. We can use multiupdate to add a "gift" to their account:

> db.users.update({birthday : "10/13/1978"},
... {$set : {gift : "Happy Birthday!"}}, false, true)

This would add the "gift" key to all user documents with birthdays on October 13, 1978.

To see the number of documents updated by a multiple update, you can run the getLastError database command (which might be better named "getLastOpStatus"). The "n" key will contain the number of documents affected by the update:

> db.count.update({x : 1}, {$inc : {x : 1}}, false, true)
> db.runCommand({getLastError : 1})
{
    "err" : null,
    "updatedExisting" : true,
    "n" : 5,
    "ok" : true
}

"n" is 5, meaning that five documents were affected by the update. "updatedExisting" is true, meaning that the update modified existing document(s). For more on database commands and their responses, see Chapter 7.

Returning Updated Documents

You can get some limited information about what was updated by calling getLastError, but it does not actually return the updated document. For that, you’ll need the findAndModify command.

findAndModify is called differently than a normal update and is a bit slower, because it must wait for a database response. It is handy for manipulating queues and performing other operations that need get-and-set style atomicity.

Suppose we have a collection of processes run in a certain order. Each is represented with a document that has the following form:

{
    "_id" : ObjectId(),
    "status" : state,
    "priority" : N
}

"status" is a string that can be “READY,” “RUNNING,” or “DONE.” We need to find the job with the highest priority in the “READY” state, run the process function, and then update the status to “DONE.” We might try querying for the ready processes, sorting by priority, and updating the status of the highest-priority process to mark it is “RUNNING.” Once we have processed it, we update the status to “DONE.” This looks something like the following:

ps = db.processes.find({"status" : "READY").sort({"priority" : -1}).limit(1).next()
db.processes.update({"_id" : ps._id}, {"$set" : {"status" : "RUNNING"}})
do_something(ps);
db.processes.update({"_id" : ps._id}, {"$set" : {"status" : "DONE"}})

This algorithm isn’t great, because it is subject to a race condition. Suppose we have two threads running. If one thread (call it A) retrieved the document and another thread (call it B) retrieved the same document before A had updated its status to “RUNNING,” then both threads would be running the same process. We can avoid this by checking the status as part of the update query, but this becomes complex:

var cursor = db.processes.find({"status" : "READY"}).sort({"priority" : -1}).limit(1);
while ((ps = cursor.next()) != null) {
    ps.update({"_id" : ps._id, "status" : "READY"},
              {"$set" : {"status" : "RUNNING"}});

    var lastOp = db.runCommand({getlasterror : 1});
    if (lastOp.n == 1) {
        do_something(ps);
        db.processes.update({"_id" : ps._id}, {"$set" : {"status" : "DONE"}})
        break;
    }
    cursor =  db.processes.find({"status" : "READY"}).sort({"priority" : -1}).limit(1);
}

Also, depending on timing, one thread may end up doing all the work while another thread is uselessly trailing it. Thread A could always grab the process, and then B would try to get the same process, fail, and leave A to do all the work. Situations like this are perfect for findAndModify. findAndModify can return the item and update it in a single operation. In this case, it looks like the following:

> ps = db.runCommand({"findAndModify" : "processes",
... "query" : {"status" : "READY"},
... "sort" : {"priority" : -1},
... "update" : {"$set" : {"status" : "RUNNING"}})
{
    "ok" : 1,
    "value" : {
        "_id" : ObjectId("4b3e7a18005cab32be6291f7"),
        "priority" : 1,
        "status" : "READY"
    }
}

Notice that the status is still “READY” in the returned document. The document is returned before the modifier document is applied. If you do a find on the collection, though, you will see that the document’s "status" has been updated to “RUNNING”:

> db.processes.findOne({"_id" : ps.value._id})
{
    "_id" : ObjectId("4b3e7a18005cab32be6291f7"),
    "priority" : 1,
    "status" : "RUNNING"
}

Thus, the program becomes the following:

> ps = db.runCommand({"findAndModify" : "processes",
... "query" : {"status" : "READY"},
... "sort" : {"priority" : -1},
... "update" : {"$set" : {"status" : "RUNNING"}}).value
> do_something(ps)
> db.process.update({"_id" : ps._id}, {"$set" : {"status" : "DONE"}})

findAndModify can have either an "update" key or a "remove" key. A "remove" key indicates that the matching document should be removed from the collection. For instance, if we wanted to simply remove the job instead of updating its status, we could run the following:

> ps = db.runCommand({"findAndModify" : "processes",
... "query" : {"status" : "READY"},
... "sort" : {"priority" : -1},
... "remove" : true).value
> do_something(ps)

The values for each key in the findAndModify command are as follows:

findAndModify

A string, the collection name.

query

A query document, the criteria with which to search for documents.

sort

Criteria by which to sort results.

update

A modifier document, the update to perform on the document found.

remove

Boolean specifying whether the document should be removed.

new

Boolean specifying whether the document returned should be the updated document or the preupdate document. Defaults to the preupdate document.

Either "update" or "remove" must be included, but not both. If no matching document is found, the command will return an error.

findAndModify has a few limitations. First, it can update or remove only one document at a time. There is also no way to use it for an upsert; it can update only existing documents.

The price of using findAndModify over a traditional update is speed: it is a bit slower. That said, it is no slower than one might expect: it takes roughly the same amount of time as a find, update, and getLastError command performed in serial.

The Fastest Write This Side of Mississippi

The three operations that this chapter focused on (inserts, removes, and updates) seem instantaneous because none of them waits for a database response. They are not asynchronous; they can be thought of as “fire-and-forget” functions: the client sends the documents to the server and immediately continues. The client never receives an “OK, got that” or a “not OK, could you send that again?” response.

The benefit to this is that the speed at which you can perform these operations is terrific. You are often only limited by the speed at which your client can send them and the speed of your network. This works well most of the time; however, sometimes something goes wrong: a server crashes, a rat chews through a network cable, or a data center is in a flood zone. If the server disappears, the client will happily send some writes to a server that isn’t there, entirely unaware of its absence. For some applications, this is acceptable. Losing a couple of seconds of log messages, user clicks, or analytics in a hardware failure is not the end of the world. For others, this is not the behavior the programmer wants (payment-processing systems spring to mind).

Safe Operations

Suppose you are writing an ecommerce application. If someone orders something, the application should probably take a little extra time to make sure the order goes through. That is why you can do a “safe” version of these operations, where you check whether there was an error in execution and attempt to redo them.

Note

MongoDB developers made unchecked operations the default because of their experience with relational databases. Many applications written on top of relational databases do not care about or check the return codes, yet they incur the performance penalty of their application waiting for them to arrive. MongoDB pushes this option to the user. This way, programs that collect log messages or real-time analytics don’t have to wait for return codes that they don’t care about.

The safe version of these operations runs a getLastError command immediately following the operation to check whether it succeeded (see Database Commands for more on commands). The driver waits for the database response and then handles errors appropriately, throwing a catchable exception in most cases. This way, developers can catch and handle database errors in whatever way feels “natural” for their language. When an operation is successful, the getLastError response also contains some additional information (e.g., for an update or remove, it includes the number of documents affected).

Note

The same getLastError command that powers safe mode also contains functionality for checking that operations have been successfully replicated. For more on this feature, see Blocking for Replication.

The price of performing “safe” operations is performance: waiting for a database response takes an order of magnitude longer than sending the message, ignoring the client-side cost of handling exceptions. (This cost varies by language but is usually fairly heavyweight.) Thus, applications should weigh the importance of their data (and the consequences if some of it is lost) versus the speed needed.

Tip

When in doubt, use safe operations. If they aren’t fast enough, start making less important operations fire-and-forget.

More specifically:

  • If you live dangerously, use fire-and-forget operations exclusively.

  • If you want to live longer, save valuable user input (account sign-ups, credit card numbers, emails) with safe operations and do everything else with fire-and-forget operations.

  • If you are cautious, use safe operations exclusively. If your application is automatically generating hundreds of little pieces of information to save (e.g., page, user, or advertising statistics), these can still use the fire-and-forget operation.

Catching “Normal” Errors

Safe operations are also a good way to debug “strange” database behavior, not just for preventing the apocalyptic scenarios described earlier. Safe operations should be used extensively while developing, even if they are later removed before going into production. They can protect against many common database usage errors, most commonly duplicate key errors.

Duplicate key errors often occur when users try to insert a document with a duplicate value for the "_id" key. MongoDB does not allow multiple documents with the same "_id" in the same collection. If you do a safe insert and a duplicate key error occurs, the server error will be picked up by the safety check, and an exception will be thrown. In unsafe mode, there is no database response, and you might not be aware that the insert failed.

For example, using the shell, you can see that inserting two documents with the same "_id" will not work:

> db.foo.insert({"_id" : 123, "x" : 1})
> db.foo.insert({"_id" : 123, "x" : 2})
E11000 duplicate key error index: test.foo.$_id_  dup key: { : 123.0 }

If we examine the collection, we can see that only the first document was successfully inserted. Note that this error can occur with any unique index, not just the one on "_id". The shell always checks for errors; in the drivers it is optional.

Requests and Connections

For each connection to a MongoDB server, the database creates a queue for that connection’s requests. When the client sends a request, it will be placed at the end of its connection’s queue. Any subsequent requests on the connection will occur after the enqueued operation is processed. Thus, a single connection has a consistent view of the database and can always read its own writes.

Note that this is a per-connection queue: if we open two shells, we will have two connections to the database. If we perform an insert in one shell, a subsequent query in the other shell might not return the inserted document. However, within a single shell, if we query for the document after inserting, the document will be returned. This behavior can be difficult to duplicate by hand, but on a busy server, interleaved inserts/queries are very likely to occur. Often developers run into this when they insert data in one thread and then check that it was successfully inserted in another. For a second or two, it looks like the data was not inserted, and then it suddenly appears.

This behavior is especially worth keeping in mind when using the Ruby, Python, and Java drivers, because all three drivers use connection pooling. For efficiency, these drivers open multiple connections (a pool) to the server and distribute requests across them. They all, however, have mechanisms to guarantee that a series of requests is processed by a single connection. There is detailed documentation on connection pooling in various languages on the MongoDB wiki.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.103.210