Chapter 8. Application Design

This chapter covers designing applications to work effectively with MongoDB. It discusses:

  • Trade-offs when deciding whether to embed data or to reference it

  • Tips for optimizations

  • Consistency considerations

  • How to migrate schemas

  • When MongoDB isn’t a good choice of data store

Normalization versus Denormalization

There are many ways of representing data and one of the most important issues is how much you should normalize your data. Normalization is dividing up data into multiple collections with references between collections. Each piece of data lives in one collection although multiple documents may reference it. Thus, to change the data, only one document must be updated. However, MongoDB has no joining facilities, so gathering documents from multiple collections will require multiple queries.

Denormalization is the opposite of normalization: embedding all of the data in a single document. Instead of documents containing references to one definitive copy of the data, many documents may have copies of the data. This means that multiple documents need to be updated if the information changes but that all related data can be fetched with a single query.

Deciding when to normalize and when to denormalize can be difficult: typically, normalizing makes writes faster and denormalizing makes reads faster. Thus, you need to find what trade-offs make sense for your application.

Examples of Data Representations

Suppose we are storing information about students and the classes that they are taking. One way to represent this would be to have a students collection (each student is one document) and a classes collection (each class is one document). Then we could have a third collection (studentClasses) that contains references to the student and classes he is taking:

> db.studentClasses.findOne({"studentId" : id})
{
    "_id" : ObjectId("512512c1d86041c7dca81915"),
    "studentId" : ObjectId("512512a5d86041c7dca81914"),
    "classes" : [
        ObjectId("512512ced86041c7dca81916"),
        ObjectId("512512dcd86041c7dca81917"),
        ObjectId("512512e6d86041c7dca81918"),
        ObjectId("512512f0d86041c7dca81919")
    ]
}

If you are familiar with relational databases, you may have seen this type of join table before, although typically you’d have one student and one class per document (instead of a list of class "_id"s). It’s a bit more MongoDB-ish to put the classes in an array, but you usually wouldn’t want to store the data this way because it requires a lot of querying to get to the actual information.

Suppose we wanted to find the classes a student was taking. We’d query for the student in the students collection, query studentClasses for the course "_id"s, and then query the classes collection for the class information. Thus, finding this information would take three trips to the server. This is generally not the way you want to structure data in MongoDB, unless the classes and students are changing constantly and reading the data does not need to be done quickly.

We can remove one of the dereferencing queries by embedding class references in the student’s document:

{
    "_id" : ObjectId("512512a5d86041c7dca81914"),
    "name" : "John Doe",
    "classes" : [
        ObjectId("512512ced86041c7dca81916"),
        ObjectId("512512dcd86041c7dca81917"),
        ObjectId("512512e6d86041c7dca81918"),
        ObjectId("512512f0d86041c7dca81919")
    ]
}

The "classes" field keeps an array of "_id"s of classes that John Doe is taking. When we want to find out information about those classes, we can query the classes collection with those "_id"s. This only takes two queries. This is fairly popular way to structure data that does not need to be instantly accessible and changes, but not constantly.

If we need to optimize reads further, we can get all of the information in a single query by fully denormalizing the data and storing each class as an embedded document in the "classes" field:

{
    "_id" : ObjectId("512512a5d86041c7dca81914"),
    "name" : "John Doe",
    "classes" : [
        {
            "class" : "Trigonometry",
            "credits" : 3,
            "room" : "204"
        },
        {
            "class" : "Physics",
            "credits" : 3,
            "room" : "159"
        },
        {
            "class" : "Women in Literature",
            "credits" : 3,
            "room" : "14b"
        },
        {
            "class" : "AP European History",
            "credits" : 4,
            "room" : "321"
        }
    ]
}

The upside of this is that it only takes one query to get the information. The downsides are that it takes up more space and is more difficult to keep in sync. For example, if it turns out that physics was supposed to be four credits (not three) every student in the physics class would need to have her document updated (instead of just updating a central “Physics” document).

Finally, you can use a hybrid of embedding and referencing: create an array of subdocuments with the frequently used information, but with a reference to the actual document for more information:

{
    "_id" : ObjectId("512512a5d86041c7dca81914"),
    "name" : "John Doe",
    "classes" : [
        {
            "_id" : ObjectId("512512ced86041c7dca81916"),
            "class" : "Trigonometry"
        },
        {
            "_id" : ObjectId("512512dcd86041c7dca81917"),
            "class" : "Physics"
        },
        {
            "_id" : ObjectId("512512e6d86041c7dca81918"),
            "class" : "Women in Literature"
        },
        {
            "_id" : ObjectId("512512f0d86041c7dca81919"),
            "class" : "AP European History"
        }
    ]
}

This approach is also a nice option because the amount of information embedded can change over time as your requirements changes: if you want to include more or less information on a page, you could embed more or less of it in the document.

Another important consideration is how often this information will change versus how often it’s read. If it will be updated regularly, then normalizing it is a good idea. However, if it changes infrequently, then there is little benefit to optimize the update process at the expense of every read your application performs.

For example, a textbook normalization use case is to store a user and his address in separate collections. However, people almost never change their address, so you generally shouldn’t penalize every read on the off chance that someone’s moved. Your application should embed the address in the user document.

If you decide to use embedded documents and you need to update them, you should set up a cron job to ensure that any updates you do are successfully propagated to every document. For example, you might attempt to do a multiupdate but the server crashes before all of the documents have been updated. You need a way to detect this and retry the update.

To some extent, the more information you are generating the less of it you should embed. If the embedded fields or number of embedded fields is supposed to grow without bound then they should generally be referenced, not embedded. Things like comment trees or activity lists should be stored as their own documents, not embedded.

Finally, fields should be included that are integral to the data in the document. If a field is almost always excluded from your results when you query for this document, it’s a good sign that it may belong in another collection. These guidelines are summarized in Table 8-1.

Table 8-1. Comparison of embedding versus references
Embedding is better for...References are better for...
Small subdocumentsLarge subdocuments
Data that does not change regularlyVolatile data
When eventual consistency is acceptableWhen immediate consistency is necessary
Documents that grow by a small amountDocuments that grow a large amount
Data that you’ll often need to perform a second query to fetchData that you’ll often exclude from the results
Fast readsFast writes

Suppose we had a users collection. Here are some example fields we might have and whether or not they should be embedded:

Account preferences

They are only relevant to this user document, and will probably be exposed with other user information in this document. Account preferences should generally be embedded.

Recent activity

This one depends on how much recent activity grows and changes. If it is a fixed-size field (last 10 things), it might be useful to embed.

Friends

Generally this should not be embedded, or at least not fully. See the section below on advice on social networking.

All of the content this user has produced

No.

Cardinality

Cardinality is how many references a collection has to another collection. Common relationships are one-to-one, one-to-many, or many-to-many. For example, suppose we had a blog application. Each post has a title, so that’s a one-to-one relationship. Each author has many posts, so that’s a one-to-many relationship. And posts have many tags and tags refer to many posts, so that’s a many-to-many relationship.

When using MongoDB, it can be conceptually useful to split “many” into subcategories: “many” and “few.” For example, you might have a one-to-few cardinality between authors and posts: each author only writes a few posts. You might have many-to-few relation between blog posts and tags: your probably have many more blog posts than you have tags. However, you’d have a one-to-many relationship between blog posts and comments: each post has many comments.

When you’ve determined few versus many relations, it can help you decide what to embed versus what to reference. Generally, “few” relationships will work better with embedding, and “many” relationships will work better as references.

Friends, Followers, and Other Inconveniences

Keep your friends close and your enemies embedded.

Many social applications need to link people, content, followers, friends, and so on. Figuring out how to balance embedding and referencing this highly connected information can be tricky. This section covers considerations for social graph data. But generally following, friending, or favoriting can be simplified to a publication-subscription system: one user is subscribing to notifications from another. Thus, there are two basic operations that need to be efficient: how to store subscribers and how to notify all interested parties of an event.

There are three ways people typically implement subscribing. The first option is that you can put the producer in the subscriber’s document, which looks something like this:

{
    "_id" : ObjectId("51250a5cd86041c7dca8190f"),
    "username" : "batman",
    "email" : "[email protected]"
    "following" : [
        ObjectId("51250a72d86041c7dca81910"), 
        ObjectId("51250a7ed86041c7dca81936")
    ]
}

Now, given a user’s document, you can query for something like db.activities.find({"user" : {"$in" : user["following"]}}) to find all of the activities that have been published that she’d be interested in. However, if you need to find everyone who is interested in a newly published activity, you’d have to query the "following" field across all users.

Alternatively, you could append the followers to the producer’s document, like so:

{
    "_id" : ObjectId("51250a7ed86041c7dca81936"),
    "username" : "joker",
    "email" : "[email protected]"
    "followers" : [
        ObjectId("512510e8d86041c7dca81912"),
        ObjectId("51250a5cd86041c7dca8190f"),
        ObjectId("512510ffd86041c7dca81910")
    ]
}

Whenever this user does something, all the users we need to notify are right there. The downside is that now you need to query the whole users collection to find everyone a user follows (the opposite limitation as above).

Either of these options comes with an additional downside: they make your user document larger and more volatile. The "following" (or "followers") field often won’t even need to be returned: how often do you want to list every follower? If users are frequently followed or unfollowed, this can result in a lot of fragmentation, as well. Thus, the final option neutralizes these downsides by normalizing even further and storing subscriptions in another collection. Normalizing this far is often overkill, but it can be useful for an extremely volatile field that often isn’t returned with the rest of the document. "followers" may be a sensible field to normalize this way.

Keep a collection that matches publishers to subscribers, with documents that look something like this:

{
    "_id" : ObjectId("51250a7ed86041c7dca81936"), // followee's "_id"
    "followers" : [
        ObjectId("512510e8d86041c7dca81912"),
        ObjectId("51250a5cd86041c7dca8190f"),
        ObjectId("512510ffd86041c7dca81910")
    ]
}

This keeps your user documents svelte but takes an extra query to get the followers. As "followers" arrays will generally change size a lot, this allows you to enable the usePowerOf2Sizes on this collection while keeping the users collection as small as possible. If you put this followers collection in another database, you can also compact it without affecting the users collection too much.

Dealing with the Wil Wheaton effect

Regardless of which strategy you use, embedding only works with a limited number of subdocuments or references. If you have celebrity users, they may overflow any document that you’re storing followers in. The typical way of compensating this is to have a “continuation” document, if necessary. For example, you might have:

> db.users.find({"username" : "wil"})
{
    "_id" : ObjectId("51252871d86041c7dca8191a"),
    "username" : "wil",
    "email" : "[email protected]",
    "tbc" : [
        ObjectId("512528ced86041c7dca8191e"),
        ObjectId("5126510dd86041c7dca81924")
    ]
    "followers" : [
        ObjectId("512528a0d86041c7dca8191b"),
        ObjectId("512528a2d86041c7dca8191c"),
        ObjectId("512528a3d86041c7dca8191d"),
        ...
    ]
}
{
    "_id" : ObjectId("512528ced86041c7dca8191e"),
    "followers" : [
        ObjectId("512528f1d86041c7dca8191f"),
        ObjectId("512528f6d86041c7dca81920"),
        ObjectId("512528f8d86041c7dca81921"),
        ...
    ]
}
{
    "_id" : ObjectId("5126510dd86041c7dca81924"),
    "followers" : [
        ObjectId("512673e1d86041c7dca81925"),
        ObjectId("512650efd86041c7dca81922"),
        ObjectId("512650fdd86041c7dca81923"),
        ...
    ]
}

Then add application logic to support fetching the documents in the “to be continued” ("tbc") array.

Optimizations for Data Manipulation

To optimize your application, you must first know what its bottleneck is by evaluating its read and write performance. Optimizing reads generally involves having the correct indexes and returning as much of the information as possible in a single document. Optimizing writes usually involves minimizing the number of indexes you have and making updates as efficient as possible.

There is often a trade-off between schemas that are optimized for writing quickly and those that are optimized for reading quickly, so you may have to decide which is a more important for your application. Factor in not only the importance of reads versus writes, but also their proportions: if writes are more important but you’re doing a thousand reads to every write, you may still want to optimize reads first.

Optimizing for Document Growth

If you’re going to need to update data, determine whether or not your documents are going to grow and by how much. If it is by a predictable amount, manually padding your documents will prevent moves, making writes faster. Check your padding factor: if it is about 1.2 or greater, consider using manual padding.

When you manually pad a document, you create the document with a large field that will later be removed. This preallocates the space that the document will eventually need. For example, suppose you had a collection of restaurant reviews and your documents looked like this:

{
    "_id" : ObjectId(),
    "restaurant" : "Le Cirque",
    "review" : "Hamburgers were overpriced."
    "userId" : ObjectId(),
    "tags" : []
}

The "tags" field will grow as users add tags, so the application will often have to perform an update like this:

> db.reviews.update({"_id" : id}, 
... {"$push" : {"tags" : {"$each" : ["French", "fine dining", "hamburgers"]}}}})

If "tags" generally doesn’t grow to more than 100 bytes, you could manually pad the document to prevent any unwanted moves. If you leave the document without padding, moves will definitely occur as "tags" grows. To pad, add a final field to the document with whatever field name you’d like:

{
    "_id" : ObjectId(),
    "restaurant" : "Le Cirque",
    "review" : "Hamburgers were overpriced."
    "userId" : ObjectId(),
    "tags" : [],
    "garbage" : "........................................................"+
        "................................................................"+
        "................................................................"
}

You can either do this on insert or, if the document is created with an upsert, use "$setOnInsert" to create the field when the document is first inserted.

When you update the document, always "$unset" the "garbage" field:

> db.reviews.update({"_id" : id}, 
... {"$push" : {"tags" : {"$each" : ["French", "fine dining", "hamburgers"]}}},
...  "$unset" : {"garbage" : true}})

The "$unset" will remove the "garbage" field if it exists and be a no-op if it does not.

If your document has one field that grows, try to keep it as the last field in the document (but before "garbage"). It is slightly more efficient for MongoDB not to have to rewrite fields after "tags" if it grows.

Removing Old Data

Some data is only important for a brief time: after a few weeks or months it is just wasting storage space. There are three popular options for removing old data: capped collections, TTL collections, and dropping collections per time period.

The easiest option is to use a capped collection: set it to a large size and let old data “fall off” the end. However, capped collections pose certain limitations on the operations you can do and are vulnerable to spikes in traffic, temporarily lowering the length of time that they can hold. See Capped Collections for more information.

The second option is TTL collections: this gives you a finer-grain control over when documents are removed. However, it may not be fast enough for very high-write-volume collections: it removes documents by traversing the TTL index the same way a user-requested remove would. If TTL collections can keep up, though, they are probably the easiest solution. See Time-To-Live Indexes for more information about TTL indexes.

The final option is to use multiple collections: for example, one collection per month. Every time the month changes, your application starts using this month’s (empty) collection and searching for data in both the current and previous months’ collections. Once a collection is older than, say, six months, you can drop it. This can keep up with nearly any volume of traffic, but it is more complex to build an application around, since it has to use dynamic collection (or database) names and possibly query multiple databases.

Planning Out Databases and Collections

Once you have sketched out what your documents look like, you must decide what collections or databases to put them in. This is often a fairly intuitive process, but there are some guidelines to keep in mind.

In general, documents with a similar schema should be kept in the same collection. MongoDB generally disallows combining data from multiple collections, so if there are documents that need to be queried or aggregated together, those are good candidates for putting in one big collection. For example, you might have documents that are fairly different “shapes,” but if you’re going to be aggregating them, they all need to live in the same collection.

For databases, the big issues to consider are locking (you get a read/write lock per database) and storage. Each database resides in its own files and often its own directory on disk, which means that you could mount different databases to different volumes. Thus, you may want all items within a database to be of similar “quality,” similar access pattern, or similar traffic levels.

For example, suppose we have an application with several components: a logging component that creates a huge amount of not-very-valuable data, a user collection, and a couple of collections for user-generated data. The user collections are high-value: it is important that user data is safe. There is also a high-traffic collection for social activities, which is of lower importance but not quite as unimportant as the logs. This collection is mainly used for user notifications, so it is almost an append-only collection.

Splitting these up by importance, we might end up with three databases: logs, activities, and users. The nice thing about this strategy is that you may find that your highest-value data is also your smallest (for instance, users probably don’t generate as much data as your logging does). You might not be able to afford an SSD for your entire data set, but you might be able to get one for your users. Or use RAID10 for users and RAID0 for logs and activities.

Be aware that there are some limitations when using multiple databases: MongoDB generally does not allow you to move data directly from one database to another. For example, you cannot store the results of a MapReduce in a different database than you ran the MapReduce on and you cannot move a collection from one database to another with the renameCollection command (e.g., you can rename foo.bar as foo.baz, but not foo2.baz).

Managing Consistency

You must figure out how consistent your application’s reads need to be. MongoDB supports a huge variety in consistency levels, from always reading your own writes to reading data of unknown oldness. If you’re reporting on the last year of activity, you might only need data that’s correct to the last couple of days. Conversely, if you’re doing real-time trading, you might need to immediately read the latest writes.

To understand how to achieve these varying levels of consistency, it is important to understand what MongoDB is doing under the hood. The server keeps a queue of requests for each connection. When the client sends a request, it will be placed at the end of its connection’s queue. Any subsequent requests on the connection will occur after the enqueued operation is processed. Thus, a single connection has a consistent view of the database and can always read its own writes.

Note that this is a per-connection queue: if we open two shells, we will have two connections to the database. If we perform an insert in one shell, a subsequent query in the other shell might not return the inserted document. However, within a single shell, if we query for the document after inserting, the document will be returned. This behavior can be difficult to duplicate by hand, but on a busy server interleaved inserts and queries are likely to occur. Often developers run into this when they insert data in one thread and then check that it was successfully inserted in another. For a moment or two, it looks like the data was not inserted, and then it suddenly appears.

This behavior is especially worth keeping in mind when using the Ruby, Python, and Java drivers, because all three use connection pooling. For efficiency, these drivers open multiple connections (a pool) to the server and distribute requests across them. They all, however, have mechanisms to guarantee that a series of requests is processed by a single connection. There is detailed documentation on connection pooling in various languages on the MongoDB wiki.

When you send reads to a replica set secondary (see Chapter 11), this becomes an even larger issue. Secondaries may lag behind the primary, leading to reading data from seconds, minutes, or even hours ago. There are several ways of dealing with this, the easiest being to simply send all reads to the primary if you care about staleness. You could also set up an automatic script to detect lag on a secondary and put it into maintenance mode if it lags too far behind. If you have a small set, it might be worth using "w" : setSize as a write concern and sending subsequent reads to the primary if getLastError does not return successfully.

Migrating Schemas

As your application grows and your needs change, your schema may have to grow and change as well. There are a couple of ways of accomplishing this, and regardless of the method you chose, you should carefully document each schema that your application has used.

The simplest method is to simply have your schema evolve as your application requires, making sure that your application supports all old versions of the schema (e.g., accepting the existence or non-existence of fields or dealing with multiple possible field types gracefully). This technique can become messy, particularly if you have conflicting versions. For instance, one version might require a "mobile" field and one version might require not having a "mobile" field but does require another field, and yet another version thinks that the "mobile" field is optional. Keeping track of these requirements can gradually turn code into spaghetti.

To handle changing requirements in a slightly more structured way you can include a "version" field (or just "v") in each document and use that to determine what your application will accept for document structure. This enforces your schema more rigorously: a document has to be valid for some version of the schema, if not the current one. However, it still requires supporting old versions.

The final option is to migrate all of your data when the schema changes. Generally this is not a good idea: MongoDB allows you to have a dynamic schema in order to avoid migrates because they put a lot of pressure on your system. However, if you do decide to change every document, you will need to ensure that all documents were successfully updated. MongoDB does not support atomic multiupdates (either they all happen or they all don’t across multiple documents). If MongoDB crashes in the middle of a migrate, you could end up with some updated and some non-updated documents.

When Not to Use MongoDB

While MongoDB is a general-purpose database that works well for most applications, it isn’t good at everything. Here are some tasks that MongoDB is not designed to do:

  • MongoDB does not support transactions, so systems that require transactions should use another data store. There are a couple of ways to hack in simple transaction-like semantics, particularly on a single document, but there is no database enforcement. Thus, you can make all of your clients agree to obey whatever semantics you come up with (e.g., “Check the "locks" field before doing any operation”) but there is nothing stopping an ignorant or malicious client from messing things up.

  • Joining many different types of data across many different dimensions is something relational databases are fantastic at. MongoDB isn’t supposed to do this well and most likely never will.

  • Finally, one of the big (if hopefully temporary) reasons to use a relational database over MongoDB is if you’re using tools that don’t support MongoDB. From SQLAlchemy to Wordpress, there are thousands of tools that just weren’t built to support MongoDB. The pool of tools that support MongoDB is growing but is hardly the size of relational databases’ ecosystem, yet.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.147.215