Consolidating read querying

We should aim to have as few queries as possible. This can be achieved by embedding information into subdocuments instead of having separate entities. This may lead to an increased write load, as we have to keep the same data point in multiple documents and maintain their values everywhere when they change in one place.

The design consideration here is:

Read performance benefits from data duplication/denormalization
Data integrity benefits from data references (DBRef or in-application code using an attribute as a foreign key)

We should denormalize especially if our read/write ratio is too high (our data rarely changes value but gets accessed several times in between), if our data can afford to be inconsistent for brief periods of time, and most importantly if we absolutely need our reads to be as fast as possible and are willing to pay the price in consistency/write performance.

The most obvious candidates for fields that we should denormalize (embed) are dependent fields. If we have an attribute or a document structure that we don't plan to query on its own but only as part of a contained attribute/document, then it makes sense to embed it rather than have it in a separate document/collection.

Using our Mongo books example, a book can have a related data structure named review that refers to a review from a reader of the book. If our most common use case is showing a book along with its associated reviews, then we can embed reviews into the book document.

The downside to this design is that, when we want to find all book reviews by a user, this will be costly as we will have to iterate all books for associated reviews. Denormalizing users and embedding their reviews in there too can be a solution to this problem.

A counter example is data that can grow unbounded. In our example here, embedding reviews along with heavy metadata can lead to an issue if we hit the 16 MB document size limit. A solution here is to distinguish between data structures that we expect to grow rapidly and those that won't, and keep an eye on their size through monitoring processes that query our live dataset at off-peak times and report on attributes that may pose a risk down the line.

Don't embed data that can grow unbounded.

When we embed attributes, we have to decide as to whether we will use a subdocument or an enclosing array.

When we have a unique identifier to access the subdocument, then we should embed it as a subdocument. If we don't know exactly how to access it or we need the flexibility to be able to query for an attribute's values, then we should embed it in an array.

For example, with our books collection, if we decide to embed reviews into each book document we have the following two designs:

With array:

A book document:

{
Isbn: '1001',
Title: 'Mastering MongoDB',
Reviews: [
{ 'user_id': 1, text: 'great book', rating: 5 },
{ 'user_id': 2, text: 'not so bad book', rating: 3 },
]
}

With embedded document:

A book document::

{
Isbn: '1001',
Title: 'Mastering MongoDB',
Reviews:
{ 'user_id': 1, text: 'great book', rating: 5 },
{ 'user_id': 2, text: 'not so bad book', rating: 3 },
}

The array structure has the advantage that we can directly query MongoDB for all reviews with rating > 4 through the embedded array reviews.

Using the embedded document structure on the other hand, we can retrieve all reviews the same way as we would do using the array, but if we want to filter on them it has to be done on the application side rather than on the database side.

Table of Contents for Consolidating read querying

Create new playlist

Sign In

Sign Up

Table of Contents for
Consolidating read querying