Chapter 9. Application Design

This chapter covers designing applications to work effectively with MongoDB. It discusses:

  • Schema design considerations

  • Trade-offs when deciding whether to embed data or to reference it

  • Tips for optimization

  • Consistency considerations

  • How to migrate schemas

  • How to manage schemas

  • When MongoDB isn’t a good choice of data store

Schema Design Considerations

A key aspect of data representation is the design of the schema, which is the way your data is represented in your documents. The best approach to this design is to represent the data the way your application wants to see it. Thus, unlike in relational databases, you first need to understand your queries and data access patterns before modeling your schema.

Here are the key aspects you need to consider when designing a schema:

Constraints

You need to understand any database or hardware limitations. You also need to consider a number of MongoDB’s specific aspects, such as the maximum document size of 16 MB, that full documents get read and written from disk, that an update rewrites the whole document, and that atomic updates are at the document level.

Access patterns of your queries and of your writes

You will need to identify and quantify the workload of your application and of the wider system. The workload encompasses both the reads and the writes in your application. Once you know when queries are running and how frequently, you can identify the most common queries. These are the queries you need to design your schema to support. Once you have identified these queries, you should try to minimize the number of queries and ensure in your design that data that gets queried together is stored in the same document.

Data not used in these queries should be put into a different collection. Data that is infrequently used should also be moved to a different collection. It is worth considering if you can separate your dynamic (read/write) data and your static (mostly read) data. The best performance results occur when you prioritize your schema design for your most common queries.

Relation types

You should consider which data is related in terms of your application’s needs, as well as the relationships between documents. You can then determine the best approaches to embed or reference the data or documents. You will need to work out how you can reference documents without having to perform additional queries, and how many documents are updated when there is a relationship change. You must also consider if the data structure is easy to query, such as with nested arrays (arrays in arrays), which support modeling certain relationships.

Cardinality

Once you have determined how your documents and your data are related, you should consider the cardinality of these relationships. Specifically, is it one-to-one, one-to-many, many-to-many, one-to-millions, or many-to-billions? It is very important to establish the cardinality of the relationships to ensure you use the best format to model them in your MongoDB schema. You should also consider whether the object on the many/millions side is accessed separately or only in the context of the parent object, as well as the ratio of updates to reads for the data field in question. The answers to these questions will help you to determine whether you should embed documents or reference documents and if you should be denormalizing data across documents.

Schema Design Patterns

Schema design is important in MongoDB, as it impacts directly on application performance. There are many common issues in schema design that can be addressed through the use of known patterns, or “building blocks.” It is best practice in schema design to use one or more of these patterns together.

Scheme design patterns that might apply include:

Polymorphic pattern

This is suitable where all documents in a collection have a similar, but not identical, structure. It involves identifying the common fields across the documents that support the common queries that will be run by the application. Tracking specific fields in the documents or subdocuments will help identify the differences between the data and different code paths or classes/subclasses that can be coded in your application to manage these differences. This allows for the use of simple queries in a single collection of not-quite-identical documents to improve query performance.

Attribute pattern

This is suitable when there are a subset of fields in a document that share common features on which you want to sort or query, or when the fields you need to sort on only exist in a subset of the documents, or when both of these conditions are true. It involves reshaping the data into an array of key/value pairs and creating an index on the elements in this array. Qualifiers can be added as additional fields to these key/value pairs. This pattern assists in targeting many similar fields per document so that fewer indexes are required and queries become simpler to write.

Bucket pattern

This is suitable for time series data where the data is captured as a stream over a period of time. It is much more efficient in MongoDB to “bucket” this data into a set of documents each holding the data for a particular time range than it is to create a document per point in time/data point. For example, you might use a one-hour bucket and place all readings for that hour in an array in a single document. The document itself will have start and end times indicating the period this “bucket” covers.

Outlier pattern

This addresses the rare instances where a few queries of documents fall outside the normal pattern for the application. It is an advanced schema pattern designed for situations where popularity is a factor. This can be seen in social networks with major influencers, book sales, movie reviews, etc. It uses a flag to indicate the document is an outlier and stores the additional overflow into one or more documents that refer back to the first document via the "_id". The flag will be used by your application code to make the additional queries to retrieve the overflow document(s).

Computed pattern

This is used when data needs to be computed frequently, and it can also be used when the data access pattern is read-intensive. This pattern recommends that the calculations be done in the background, with the main document being updated periodically. This provides a valid approximation of the computed fields or documents without having to continuously generate these for individual queries. This can significantly reduce the strain on the CPU by avoiding repetition of the same calculations, particularly in use cases where reads trigger the calculation and you have a high read-to-write ratio.

Subset pattern

This is used when you have a working set that exceeds the available RAM of the machine. This can be caused by large documents that contain a lot of information that isn’t being used by your application. This pattern suggests that you split frequently used data and infrequently used data into two separate collections. A typical example might be an ecommerce application keeping the 10 most recent reviews of a product in the “main” (frequently accessed) collection and moving all the older reviews into a second collection queried only if the application needs more than the last 10 reviews.

Extended Reference pattern

This is used for scenarios where you have many different logical entities or “things,” each with their own collection, but you may want to gather these entities together for a specific function. A typical ecommerce schema might have separate collections for orders, customers, and inventory. This can have a negative performance impact when we want to collect together all the information for a single order from these separate collections. The solution is to identify the frequently accessed fields and duplicate these within the order document. In the case of an ecommerce order, this would be the name and address of the customer we are shipping the item to. This pattern trades off the duplication of data for a reduction in the number of queries necessary to collate the information together.

Approximation pattern

This is useful for situations where resource-expensive (time, memory, CPU cycles) calculations are needed but where exact precision is not absolutely required. An example of this is an image or post like/love counter or a page view counter, where knowing the exact count (e.g., whether it’s 999,535 or 1,000,0000) isn’t necessary. In these situations, applying this pattern can greatly reduce the number of writes—for example, by only updating the counter after every 100 or more views instead of after every view.

Tree pattern

This can be applied when you have a lot of queries and have data that is primarily hierarchical in structure. It follows the earlier concept of storing data together that is typically queried together. In MongoDB, you can easily store a hierarchy in an array within the same document. In the example of the ecommerce site, specifically its product catalog, there are often products that belong to multiple categories or to categories that are part of other categories. An example might be “Hard Drive,” which is itself a category but comes under the “Storage” category, which itself is under the “Computer Parts” category, which is part of the “Electronics” category. In this kind of scenario, we would have a field that would track the entire hierarchy and another field that would hold the immediate category (“Hard Drive”). The entire hierarchy field, kept in an array, provides the ability to use a multikey index on those values. This ensures all items related to categories in the hierarchy will be easily found. The immediate category field allows all items directly related to this category to be found.

Preallocation pattern

This was primarily used with the MMAP storage engine, but there are still uses for this pattern. The pattern recommends creating an initial empty structure that will be populated later. An example use could be for a reservation system that manages a resource on a day-by-day basis, keeping track of whether it is free or already booked/unavailable. A two-dimensional structure of resources (x) and days (y) makes it trivially easy to check availability and perform calculations.

Document Versioning pattern

This provides a mechanism to enable retention of older revisions of documents. It requires an extra field to be added to each document to track the document version in the “main” collection, and an additional collection that contains all the revisions of the documents. This pattern makes a few assumptions: specifically, that each document has a limited number of revisions, that there are not large numbers of documents that need to be versioned, and that the queries are primarily done on the current version of each document. In situations where these assumptions are not valid, you may need to modify the pattern or consider a different schema design pattern.

MongoDB provides several useful resources online on patterns and schema design. MongoDB University offers a free course, M320 Data Modeling, as well as a “Building with Patterns” blog series.

Normalization Versus Denormalization

There are many ways to represent data, and one of the most important issues to consider is how much you should normalize your data. Normalization refers to dividing up data into multiple collections with references between collections. Each piece of data lives in one collection, although multiple documents may reference it. Thus, to change the data, only one document must be updated. The MongoDB Aggregation Framework offers joins with the $lookup stage, which performs a left outer join by adding documents to the “joined” collection where there is a matching document in the source collection—it adds a new array field to each matched document in the “joined” collection with the details of the document from the source collection. These reshaped documents are then available in the next stage for further processing.

Denormalization is the opposite of normalization: embedding all of the data in a single document. Instead of documents containing references to one definitive copy of the data, many documents may have copies of the data. This means that multiple documents need to be updated if the information changes, but enables all related data to be fetched with a single query.

Deciding when to normalize and when to denormalize can be difficult: typically, normalizing makes writes faster and denormalizing makes reads faster. Thus, you need to decide what trade-offs make sense for your application.

Examples of Data Representations

Suppose we are storing information about students and the classes that they are taking. One way to represent this would be to have a students collection (each student is one document) and a classes collection (each class is one document). Then we could have a third collection (studentClasses) that contains references to the students and the classes they are taking:

> db.studentClasses.findOne({"studentId" : id})
{
    "_id" : ObjectId("512512c1d86041c7dca81915"),
    "studentId" : ObjectId("512512a5d86041c7dca81914"),
    "classes" : [
        ObjectId("512512ced86041c7dca81916"),
        ObjectId("512512dcd86041c7dca81917"),
        ObjectId("512512e6d86041c7dca81918"),
        ObjectId("512512f0d86041c7dca81919")
    ]
}

If you are familiar with relational databases, you may have seen this type of join table before (although typically you’d have one student and one class per document, instead of a list of class "_id"s). It’s a bit more MongoDB-ish to put the classes in an array, but you usually wouldn’t want to store the data this way because it requires a lot of querying to get to the actual information.

Suppose we wanted to find the classes a student was taking. We’d query for the student in the students collection, query studentClasses for the course "_id"s, and then query the classes collection for the class information. Thus, finding this information would take three trips to the server. This is generally not the way you want to structure data in MongoDB, unless the classes and students are changing constantly and reading the data does not need to be done quickly.

We can remove one of the dereferencing queries by embedding class references in the student’s document:

{
    "_id" : ObjectId("512512a5d86041c7dca81914"),
    "name" : "John Doe",
    "classes" : [
        ObjectId("512512ced86041c7dca81916"),
        ObjectId("512512dcd86041c7dca81917"),
        ObjectId("512512e6d86041c7dca81918"),
        ObjectId("512512f0d86041c7dca81919")
    ]
}

The "classes" field keeps an array of "_id"s of classes that John Doe is taking. When we want to find out information about those classes, we can query the classes collection with those "_id"s. This only takes two queries. This is a fairly popular way to structure data that does not need to be instantly accessible and changes, but not constantly.

If we need to optimize reads further, we can get all of the information in a single query by fully denormalizing the data and storing each class as an embedded document in the "classes" field:

{
    "_id" : ObjectId("512512a5d86041c7dca81914"),
    "name" : "John Doe",
    "classes" : [
        {
            "class" : "Trigonometry",
            "credits" : 3,
            "room" : "204"
        },
        {
            "class" : "Physics",
            "credits" : 3,
            "room" : "159"
        },
        {
            "class" : "Women in Literature",
            "credits" : 3,
            "room" : "14b"
        },
        {
            "class" : "AP European History",
            "credits" : 4,
            "room" : "321"
        }
    ]
}

The upside of this is that it only takes one query to get the information. The downsides are that it takes up more space and is more difficult to keep in sync. For example, if it turns out that physics was supposed to be worth four credits (not three), every student in the physics class would need to have their document updated (instead of just updating a central “Physics” document).

Finally, you can use the Extended Reference pattern mentioned earlier, which is a hybrid of embedding and referencing—you create an array of subdocuments with the frequently used information, but with a reference to the actual document for more information:

{
    "_id" : ObjectId("512512a5d86041c7dca81914"),
    "name" : "John Doe",
    "classes" : [
        {
            "_id" : ObjectId("512512ced86041c7dca81916"),
            "class" : "Trigonometry"
        },
        {
            "_id" : ObjectId("512512dcd86041c7dca81917"),
            "class" : "Physics"
        },
        {
            "_id" : ObjectId("512512e6d86041c7dca81918"),
            "class" : "Women in Literature"
        },
        {
            "_id" : ObjectId("512512f0d86041c7dca81919"),
            "class" : "AP European History"
        }
    ]
}

This approach is also a nice option because the amount of information embedded can change over time as your requirements change: if you want to include more or less information on a page, you can embed more or less of it in the document.

Another important consideration is how often this information will change, versus how often it’s read. If it will be updated regularly, then normalizing it is a good idea. However, if it changes infrequently, then there is little benefit to optimizing the update process at the expense of every read your application performs.

For example, a textbook normalization use case is to store a user and their address in separate collections. However, people’s addresses rarely change, so you generally shouldn’t penalize every read on the off chance that someone’s moved. Your application should embed the address in the user document.

If you decide to use embedded documents and you need to update them, you should set up a cron job to ensure that any updates you do are successfully propagated to every document. For example, suppose you attempt to do a multi-update but the server crashes before all of the documents have been updated. You need a way to detect this and retry the update.

In terms of update operators, "$set" is idempotent but "$inc" is not. Idempotent operations will have the same outcome whether tried once or several times; in the case of a network error, retrying the operation will be sufficient for the update to occur. In the case of operators that are not idempotent, the operation should be broken into two operations that are individually idempotent and safe to retry. This can be achieved by including a unique pending token in the first operation and having the second operation use both a unique key and the unique pending token. This approach allows "$inc" to be idempotent because each individual updateOne operation is idempotent.

To some extent, the more information you are generating, the less of it you should embed. If the content of the embedded fields or number of embedded fields is supposed to grow without bound then they should generally be referenced, not embedded. Things like comment trees or activity lists should be stored as their own documents, not embedded. It is also worth considering using the Subset pattern (described in “Schema Design Patterns”) to store the most recent items (or some other subset) in the document.

Finally, the fields that are included should be integral to the data in the document. If a field is almost always excluded from your results when you query for a document, it’s a good sign that it may belong in another collection. These guidelines are summarized in Table 9-1.

Table 9-1. Comparison of embedding versus references
Embedding is better for...References are better for...
Small subdocumentsLarge subdocuments
Data that does not change regularlyVolatile data
When eventual consistency is acceptableWhen immediate consistency is necessary
Documents that grow by a small amountDocuments that grow by a large amount
Data that you’ll often need to perform a second query to fetchData that you’ll often exclude from the results
Fast readsFast writes

Suppose we had a users collection. Here are some example fields we might have in the user documents and an indication of whether or not they should be embedded:

Account preferences

These are only relevant to this user document, and will probably be exposed with other user information in the document. Account preferences should generally be embedded.

Recent activity

This depends on how much recent activity grows and changes. If it is a fixed-size field (say, the last 10 things), it might be useful to embed this information or to implement the Subset pattern.

Friends

Generally this information should not be embedded, or at least not fully. See “Friends, Followers, and Other Inconveniences”.

All of the content this user has produced

This should not be embedded.

Cardinality

Cardinality is an indication of how many references a collection has to another collection. Common relationships are one-to-one, one-to-many, or many-to-many. For example, suppose we had a blog application. Each post has a title, so that’s a one-to-one relationship. Each author has many posts, so that’s a one-to-many relationship. And posts have many tags and tags refer to many posts, so that’s a many-to-many relationship.

When using MongoDB, it can be conceptually useful to split “many” into subcategories: “many” and “few.” For example, you might have a one-to-few relationship between authors and posts: each author only writes a few posts. You might have many-to-few relation between blog posts and tags: you probably have many more blog posts than you have tags. However, you’d have a one-to-many relationship between blog posts and comments: each post has many comments.

Determining few versus many relations can help you decide what to embed versus what to reference. Generally, “few” relationships will work better with embedding, and “many” relationships will work better as references.

Friends, Followers, and Other Inconveniences

Keep your friends close and your enemies embedded.

This section covers considerations for social graph data. Many social applications need to link people, content, followers, friends, and so on. Figuring out how to balance embedding and referencing this highly connected information can be tricky, but generally following, friending, or favoriting can be simplified to a publication/subscription system: one user is subscribing to notifications from another. Thus, there are two basic operations that need to be efficient: storing subscribers and notifying all interested parties of an event.

There are three ways people typically implement subscribing. The first option is to put the producer in the subscriber’s document, which looks something like this:

{
    "_id" : ObjectId("51250a5cd86041c7dca8190f"),
    "username" : "batman",
    "email" : "[email protected]"
    "following" : [
        ObjectId("51250a72d86041c7dca81910"), 
        ObjectId("51250a7ed86041c7dca81936")
    ]
}

Now, given a user’s document, you can issue a query like the following to find all of the activities that have been published that they might be interested in:

db.activities.find({"user" : {"$in" :
      user["following"]}})

However, if you need to find everyone who is interested in a newly published activity, you’d have to query the "following" field across all users.

Alternatively, you could append the followers to the producer’s document, like so:

{
    "_id" : ObjectId("51250a7ed86041c7dca81936"),
    "username" : "joker",
    "email" : "[email protected]"
    "followers" : [
        ObjectId("512510e8d86041c7dca81912"),
        ObjectId("51250a5cd86041c7dca8190f"),
        ObjectId("512510ffd86041c7dca81910")
    ]
}

Whenever this user does something, all the users you need to notify are right there. The downside is that now you need to query the whole users collection to find everyone a user follows (the opposite limitation as in the previous case).

Either of these options comes with an additional downside: they make your user documents larger and more volatile. The "following" (or "followers") field often won’t even need to be returned: how often do you want to list every follower? Thus, the final option neutralizes these downsides by normalizing even further and storing subscriptions in another collection. Normalizing this far is often overkill, but it can be useful for an extremely volatile field that often isn’t returned with the rest of the document. "followers" may be a sensible field to normalize this way.

In this case you keep a collection that matches publishers to subscribers, with documents that look something like this:

{
    "_id" : ObjectId("51250a7ed86041c7dca81936"), // followee's "_id"
    "followers" : [
        ObjectId("512510e8d86041c7dca81912"),
        ObjectId("51250a5cd86041c7dca8190f"),
        ObjectId("512510ffd86041c7dca81910")
    ]
}

This keeps your user documents svelte but means an extra query is needed to get the followers.

Dealing with the Wil Wheaton effect

Regardless of which strategy you use, embedding only works with a limited number of subdocuments or references. If you have celebrity users, they may overflow any document that you’re storing followers in. The typical way of compensating for this is to use the Outlier pattern discussed in “Schema Design Patterns” and have a “continuation” document, if necessary. For example, you might have:

> db.users.find({"username" : "wil"})
{
    "_id" : ObjectId("51252871d86041c7dca8191a"),
    "username" : "wil",
    "email" : "[email protected]",
    "tbc" : [
        ObjectId("512528ced86041c7dca8191e"),
        ObjectId("5126510dd86041c7dca81924")
    ]
    "followers" : [
        ObjectId("512528a0d86041c7dca8191b"),
        ObjectId("512528a2d86041c7dca8191c"),
        ObjectId("512528a3d86041c7dca8191d"),
        ...
    ]
}
{
    "_id" : ObjectId("512528ced86041c7dca8191e"),
    "followers" : [
        ObjectId("512528f1d86041c7dca8191f"),
        ObjectId("512528f6d86041c7dca81920"),
        ObjectId("512528f8d86041c7dca81921"),
        ...
    ]
}
{
    "_id" : ObjectId("5126510dd86041c7dca81924"),
    "followers" : [
        ObjectId("512673e1d86041c7dca81925"),
        ObjectId("512650efd86041c7dca81922"),
        ObjectId("512650fdd86041c7dca81923"),
        ...
    ]
}

Then add application logic to support fetching the documents in the “to be continued” ("tbc") array.

Optimizations for Data Manipulation

To optimize your application, you must first determine what its bottleneck is by evaluating its read and write performance. Optimizing reads generally involves having the correct indexes and returning as much of the information as possible in a single document. Optimizing writes usually involves minimizing the number of indexes you have and making updates as efficient as possible.

There is often a trade-off between schemas that are optimized for writing quickly and those that are optimized for reading quickly, so you may have to decide which is more important for your application. Factor in not only the importance of reads versus writes, but also their proportions: if writes are more important but you’re doing a thousand reads to every write, you may still want to optimize reads first.

Removing Old Data

Some data is only important for a brief time: after a few weeks or months it is just wasting storage space. There are three popular options for removing old data: using capped collections, using TTL collections, and dropping collections per time period.

The easiest option is to use a capped collection: set it to a large size and let old data “fall off” the end. However, capped collections pose certain limitations on the operations you can do and are vulnerable to spikes in traffic, temporarily lowering the length of time that they can hold. See “Capped Collections” for more information.

The second option is to use a TTL collections. This gives you finer-grain control over when documents are removed, but it may not be fast enough for collections with a very high write volume: it removes documents by traversing the TTL index the same way a user-requested remove would. If a TTL collection can keep up, though, this is probably the easiest solution to implement. See “Time-To-Live Indexes” for more information about TTL indexes.

The final option is to use multiple collections: for example, one collection per month. Every time the month changes, your application starts using this month’s (empty) collection and searching for data in both the current and previous months’ collections. Once a collection is older than, say, six months, you can drop it. This strategy can keep up with nearly any volume of traffic, but it’s more complex to build an application around because you have to use dynamic collection (or database) names and possibly query multiple databases.

Planning Out Databases and Collections

Once you have sketched out what your documents look like, you must decide what collections or databases to put them in. This is often a fairly intuitive process, but there are some guidelines to keep in mind.

In general, documents with a similar schema should be kept in the same collection. MongoDB generally disallows combining data from multiple collections, so if there are documents that need to be queried or aggregated together, those are good candidates for putting in one big collection. For example, you might have documents that are fairly different “shapes,” but if you’re going to be aggregating them, they should all live in the same collection (or you can use the $merge stage if they are in separate collections or databases).

For collections, the big issues to consider are locking (you get a read/write lock per document) and storage. Generally, if you have a high-write workload you may need to consider using multiple physical volumes to reduce I/O bottlenecks. Each database can reside in its own directory when you use the --directoryperdb option, allowing you to mount different databases to different volumes. Thus, you may want all items within a database to be of similar “quality,” with a similar access pattern or similar traffic levels.

For example, suppose you have an application with several components: a logging component that creates a huge amount of not-very-valuable data, a user collection, and a couple of collections for user-generated data. These collections are high-value: it is important that user data is safe. There is also a high-traffic collection for social activities, which is of lower importance but not quite as unimportant as the logs. This collection is mainly used for user notifications, so it is almost an append-only collection.

Splitting these up by importance, you might end up with three databases: logs, activities, and users. The nice thing about this strategy is that you may find that your highest-value data is also what you have the least of (e.g., users probably don’t generate as much data as logging does). You might not be able to afford an SSD for your entire dataset, but you might be able to get one for your users, or you might use RAID10 for users and RAID0 for logs and activities.

Be aware that there are some limitations when using multiple databases prior to MongoDB 4.2 and the introduction of the $merge operator in the Aggregation Framework, which allows you to store results from an aggregation from one database to a different database and a different collection within that database. An additional point to note is that the renameCollection command is slower when copying an existing collection from one database to a different database, as it must copy all the documents to the new database.

Managing Consistency

You must figure out how consistent your application’s reads need to be. MongoDB supports a huge variety of consistency levels, from always being able to read your own writes to reading data of unknown oldness. If you’re reporting on the last year of activity, you might only need data that’s correct to the last couple of days. Conversely, if you’re doing real-time trading, you might need to immediately read the latest writes.

To understand how to achieve these varying levels of consistency, it is important to understand what MongoDB is doing under the hood. The server keeps a queue of requests for each connection. When the client sends a request, it will be placed at the end of its connection’s queue. Any subsequent requests on the connection will occur after the previously enqueued operation is processed. Thus, a single connection has a consistent view of the database and can always read its own writes.

Note that this is a per-connection queue: if we open two shells, we will have two connections to the database. If we perform an insert in one shell, a subsequent query in the other shell might not return the inserted document. However, within a single shell, if we query for a document after inserting it, the document will be returned. This behavior can be difficult to duplicate by hand, but on a busy server interleaved inserts and queries are likely to occur. Often developers run into this when they insert data in one thread and then check that it was successfully inserted in another. For a moment or two, it looks like the data was not inserted, and then it suddenly appears.

This behavior is especially worth keeping in mind when using the Ruby, Python, and Java drivers, because all three use connection pooling. For efficiency, these drivers open multiple connections (a pool) to the server and distribute requests across them. They all, however, have mechanisms to guarantee that a series of requests is processed by a single connection. There is detailed documentation on connection pooling for the various languages in the MongoDB Drivers Connection Monitoring and Pooling specification.

When you send reads to a replica set secondary (see Chapter 12), this becomes an even larger issue. Secondaries may lag behind the primary, leading to reading data from seconds, minutes, or even hours ago. There are several ways to deal with this, the easiest being to simply send all reads to the primary if you care about staleness.

MongoDB offers the readConcern option to control the consistency and isolation properties of the data being read. It can be combined with writeConcern to control the consistency and availability guarantees made to your application. There are five levels: "local", "available", "majority", "linearizable", and "snapshot". Depending on the application, in cases where you want to avoid read staleness you could consider using "majority", which returns only durable data that has been acknowledged by the majority of the replica set members and will not be rolled back. "linearizable" may also be an option: it returns data that reflects all successful majority-acknowledged writes that have completed prior to the start of the read operation. MongoDB may wait for concurrently executing writes to finish before returning the results with the "linearizable" readConcern.

Three senior engineers from MongoDB published a paper called “Tunable Consistency in MongoDB” at the PVLDB conference in 2019.1 This paper outlines the different MongoDB consistency models used for replication and how application developers can utilize the various models.

Migrating Schemas

As your application grows and your needs change, your schema may have to grow and change as well. There are a couple of ways of accomplishing this, but regardless of the method you choose, you should carefully document each schema that your application has used. Ideally, you should consider if the Document Versioning pattern (see “Schema Design Patterns”) is applicable.

The simplest method is to simply have your schema evolve as your application requires, making sure that your application supports all old versions of the schema (e.g., accepting the existence or nonexistence of fields or dealing with multiple possible field types gracefully). But this technique can become messy, particularly if you have conflicting schema versions. For instance, one version might require a "mobile" field, another version might require not having a "mobile" field but instead require a different field, and yet another version might treat the "mobile" field as optional. Keeping track of these shifting requirements can gradually turn your code into spaghetti.

To handle changing requirements in a slightly more structured way, you can include a "version" field (or just "v") in each document and use that to determine what your application will accept for document structure. This enforces your schema more rigorously: a document has to be valid for some version of the schema, if not the current one. However, it still requires supporting old versions.

The final option is to migrate all of your data when the schema changes. Generally this is not a good idea: MongoDB allows you to have a dynamic schema in order to avoid migrates because they put a lot of pressure on your system. However, if you do decide to change every document, you will need to ensure that all the documents were successfully updated. MongoDB supports transactions, which support this type of migration. If MongoDB crashes in the middle of a transaction, the older schema will be retained.

Managing Schemas

MongoDB introduced schema validation in version 3.2, which allows for validation during updates and insertions. In version 3.6 it added JSON Schema validation via the $jsonSchema operator, which is now the recommended method for all schema validation in MongoDB. At the time of writing MongoDB supports draft 4 of JSON Schema, but please check the documentation for the most up-to-date information on this feature.

Validation does not check existing documents until they are modified, and it is configured per collection. To add validation to an existing collection, you use the collMod command with the validator option. You can add validation to a new collection by specifying the validator option when using db.createCollection(). MongoDB also provides two additional options, validationLevel and validationAction. validationLevel determines how strictly validation rules are applied to existing documents during an update, and validationAction decides whether an error plus rejection or a warning with allowance for illegal documents should occur.

When Not to Use MongoDB

While MongoDB is a general-purpose database that works well for most applications, it isn’t good at everything. There are a few reasons you might need to avoid it:

  • Joining many different types of data across many different dimensions is something relational databases are fantastic at. MongoDB isn’t supposed to do this well and most likely never will.

  • One of the big (if, hopefully, temporary) reasons to use a relational database over MongoDB is if you’re using tools that don’t support it. From SQLAlchemy to WordPress, there are thousands of tools that just weren’t built to support MongoDB. The pool of tools that do support it is growing, but its ecosystem is hardly the size of relational databases’ yet.

1 The authors are William Schultz, senior software engineer for replication; Tess Avitabile, team lead of the replication team; and Alyson Cabral, product manager for Distributed Systems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.23.127.197