© Subhashini Chellappan and Dharanitharan Ganesan 2020
S. Chellappan, D. GanesanMongoDB Recipeshttps://doi.org/10.1007/978-1-4842-4891-1_3

3. Data Modeling and Aggregation

Subhashini Chellappan1  and Dharanitharan Ganesan2
(1)
Bangalore, India
(2)
Krishnagiri, Tamil Nadu, India
 
In Chapter 2, we discussed MongoDB CRUD operations, embedded documents, and arrays. In this chapter, we cover the following topics.
  • Data models.

  • Data model relationship between documents.

  • Modeling tree structures.

  • Aggregation operations.

  • SQL aggregation terms and corresponding MongoDB aggregation operations.

Data Models

MongoDB provides two data model designs for data modeling:
  • Embedded data models.

  • Normalized data models.

Embedded Data Models

In MongoDB, you can embed related data in a single document. This schema design is known as denormalized models. Consider the example shown in Figure 3-1.
../images/475461_1_En_3_Chapter/475461_1_En_3_Fig1_HTML.png
Figure 3-1

A denormalized model

This embedded document model allows applications to store the related piece of information in the same records. As a result, the application requires only few queries and updates to complete common operations.

We can use embedded documents to represent both one-to-one relationships (a “contains” relationship between two entities) and one-to-many relationships (when many documents are viewed in the context of one parent).

Embedded documents provide better results in these cases:
  • For read operations.

  • When we need to retrieve related data in a single database operation.

Embedded data models update related data in a single atomic operation. Embedded document data can be accessed using dot notation.

Normalized Data Models

Normalized data models describe relationships using references, as illustrated in Figure 3-2.
../images/475461_1_En_3_Chapter/475461_1_En_3_Fig2_HTML.jpg
Figure 3-2

Normalized data model

Normalized data models can best be used in the following circumstances:
  • When embedding data model results duplication of data.

  • To represent complex many-to-many relationships.

  • To model large hierarchical data sets.

Normalized data models do not provide good read performance.

Data Model Relationship Between Documents

Let’s explore a data model that uses an embedded document and references.

Recipe 3-1. Data Model Using an Embedded Document

In this recipe, we are going to discuss a data model using an embedded document.

Problem

You want to create a data model for a one-to-one relationship.

Solution

Use an embedded document.

How It Works

Let’s follow the steps in this section to design a data model for a one-to-one relationship.

Step 1: One-to-One Relationships

Consider this example.
{
   _id: "James",
   name: "James William"
}
{
   student_id: "James",
   street: "123 Hill Street",
   city: "New York",
   state: "US",
 }
Here, we have student and address relationships in which an address belongs to the student. If we are going to retrieve address data with the name frequently, then referencing requires multiple queries to resolve references. In this scenario, we can embed address data with the student data to provide a better data model, as shown here.
{
   _id: "James",
   name: "James William",
   address: {
              street: "123 Hill Street",
              city: "New York",
              state: "US",
              }
}

With this data model, we can retrieve complete student information with one query.

Step 2: One-to-Many Relationships

Consider this example.
{
   _id: "James",
   name: "James William"
}
{
   student_id: "James",
   street: "123 Hill Street",
   city: "New York",
   state: "US",
 }
{
   student_id: "James",
   street: "234 Thomas Street",
   city: "New Jersey",
   state: "US",
 }
Here, we have a student and multiple address relationships (a student has multiple addresses). If we are going to retrieve address data with the name frequently, then referencing requires multiple queries to resolve references. In this scenario, the optimal way to design the schema is to embed address data with the student data as shown here.
{
   _id: "James",
   name: "James William",
   address: [{
              street: "123 Hill Street",
              city: "New York",
              state: "US",
              },
            {
              street: "234 Thomas Street",
              city: "New Jersey",
              state: "US",
             }]
 }

This data model allows us to retrieve complete student information with one query.

Recipe 3-2. Data Model Using Document References

In this recipe, we are going to discuss a data model using document references.

Problem

You want to create a data model for a one-to-many relationship.

Solution

Use a document reference.

How It Works

Let’s follow the steps in this section to design a data model for a one-to-many relationship.

Step 1: One-to-Many Relationships

Consider the following data model that maps a publisher and book relationship.
{
   title: "Practical Apache Spark",
   author: [ "Subhashini Chellappan", "Dharanitharan Ganesan" ],
   published_date: ISODate("2018-11-30"),
   pages: 300,
   language: "English",
   publisher: {
              name: "Apress",
              founded: 1999,
              location: "US"
            }
}
{
   title: "MongoDB Recipes",
   author: [ "Subhashini Chellappan"],
   published_date: ISODate("2018-11-30"),
   pages: 120,
   language: "English",
   publisher: {
              name: "Apress",
              founded: 1999,
              location: "US"
            }
}

Here, the publisher document is embedded inside the book document, which leads to repetition of the publisher data model.

In this scenario, we can document references to avoid repetition of data. In document references, the growth of relationships determines where to store the references. If the number of books per publisher is small, then we can store the book reference inside the publisher document as shown here.
{
   name: "Apress",
   founded: 1999,
   location: "US",
   books: [123456, 456789, ...]
}
{
    _id: 123456,
    title: "Practical Apache Spark",
    author: [ "Subhashini Chellappan", "Dharanitharan Ganesan" ],
    published_date: ISODate("2018-11-30"),
    pages: 300,
    language: "English"
}
{
   _id: 456789,
   title: "MongoDB Recipes",
   author: [ "Subhashini Chellappan"],
   published_date: ISODate("2018-11-30"),
   pages: 120,
   language: "English"
}
If the number of books per publisher is unbounded, this data model would lead to mutable, growing arrays. We can avoid this situation by storing the publisher reference inside the book document as shown here.
{
   _id:"Apress",
   name: "Apress",
   founded: 1999,
   location: "US"
}
{
   _id: 123456,
   title: "Practical Apache Spark",
   author: [ "Subhashini Chellappan", "Dharanitharan Ganesan" ],
   published_date: ISODate("2018-11-30"),
   pages: 300,
   language: "English",
   publisher_id: "Apress"
}
{
   _id: 456789,
   title: "MongoDB Recipes",
   author: [ "Subhashini Chellappan"],
   published_date: ISODate("2018-11-30"),
   pages: 120,
   language: "English",
   publisher_id: "Apress"
}

Step 2: Query Document References

Let’s discuss how to query document references. Consider the following collections.
db.publisher.insert({_id:"Apress",name: "Apress", founded: 1999,location:"US"})
db.authors.insertMany([{_id: 123456,title: "Practical Apache Spark",author:["Subhashini Chellappan", "Dharanitharan Ganesan" ], published_date: ISODate("2018-11-30"),pages: 300,language: "English",publisher_id: "Apress"},{_id: 456789,title: "MongoDB Recipes",  author: [ "Subhashini Chellappan"],published_date: ISODate("2018-11-30"),  pages: 120,language: "English",publisher_id: "Apress"}])
To perform a left outer join, use $lookup as shown here.
db.publisher.aggregate([{$lookup:{from:"authors",localField:"_id",
foreignField:"publisher_id",as:"authors_docs"}}])
Here is the output,
> db.publisher.aggregate([{$lookup:{from:"authors",localField:"_id",
... foreignField:"publisher_id",as:"authors_docs"}}])
{ "_id" : "Apress", "name" : "Apress", "founded" : 1999, "location" : "US", "authors_docs" : [ { "_id" : 123456, "title" : "Practical Apache Spark", "author" : [ "Subhashini Chellappan", "Dharanitharan Ganesan" ], "published_date" : ISODate("2018-11-30T00:00:00Z"), "pages" : 300, "language" : "English", "publisher_id" : "Apress" }, { "_id" : 456789, "title" : "MongoDB Recipes", "author" : [ "Subhashini Chellappan" ], "published_date" : ISODate("2018-11-30T00:00:00Z"), "pages" : 120, "language" : "English", "publisher_id" : "Apress" } ] }
>

Modeling Tree Structures

Let’s look at a data model that describes a tree-like structure.

Recipe 3-3. Model Tree Structure with Parent References

In this recipe, we are going to discuss a tree structure data model using parent references.

Problem

You want to create a data model for a tree structure with parent references.

Solution

Use the parent references pattern.

How It Works

Let’s follow the steps in this section to design a data model for a tree structure with parent references.

Step 1: Tree Structure with Parent References

The parent references pattern stores each tree node in a document; in addition to the tree node, the document stores the _id of the node’s parent.

Consider the following author tree model with parent references.
db.author.insert( { _id: "Practical Apache Spark", parent: "Books" } )
db.author.insert( { _id: "MongoDB Recipes", parent: "Books" } )
db.author.insert( { _id: "Books", parent: "Subhashini" } )
db.author.insert( { _id: "A Framework For Extracting Information From Web Using VTD-XML ' s XPath", parent: "Article" } )
db.author.insert( { _id: "Article", parent: "Subhashini" } )
db.author.insert( { _id: "Subhashini", parent: null } )
The tree structure of this author collection is shown in Figure 3-3.
../images/475461_1_En_3_Chapter/475461_1_En_3_Fig3_HTML.jpg
Figure 3-3

Tree structure for the author collection

The following command retrieves the parent of a node MongoDB Recipes.
db.author.findOne( { _id: "MongoDB Recipes" } ).parent
Here is the output,
> db.author.findOne( { _id: "MongoDB Recipes" } ).parent
Books
>
The next command retrieves the immediate children of the parent.
db.author.find( { parent: "Subhashini" } )
Here is the output,
> db.author.find( { parent: "Subhashini" } )
{ "_id" : "Books", "parent" : "Subhashini" }
{ "_id" : "Article", "parent" : "Subhashini" }
>

Step 2: Tree Structure with Child References

The child references pattern stores each tree node in a document; in addition to the tree node, the document stores in an array the _id value(s) of the node’s children.

Consider the following author tree model with child references.
db.author.insert( { _id: "Practical Apache Spark", children: [] } )
db.author.insert( { _id: "MongoDB", children: [] } )
db.author.insert( { _id: "Books", children: [ "Practical Apache Spark", "MongoDB Recipes" ] } )
db.author.insert( { _id: " A Framework For Extracting Information From Web Using VTD-XML ' s XPath ", children: [] } )
db.author.insert( { _id: "Article", children: [ " A Framework For Extracting Information From Web Using VTD-XML ' s XPath " ] } )
db.categories.insert( { _id: "Subhashini", children: [ "Books","Article" ] } )
The following command retrieves the immediate children of node Books.
db.author.findOne( { _id: "Books" } ).children
Here is the output,
> db.author.findOne( { _id: "Books" } ).children
[ "Practical Apache Spark", "MongoDB Recipes" ]
>
The next command retrieves the MongoDB Recipes parent node and its siblings.
db.author.find( { children: "MongoDB Recipes" } )
Here is the output,
> db.author.find( { children: "MongoDB Recipes" } )
{ "_id" : "Books", "children" : [ "Practical Apache Spark", "MongoDB Recipes" ] }
>

Child references are a good choice to work with tree storage when there are no subtree operations.

Step 3: Tree Structure with an Array of Ancestors

The array of ancestors pattern stores each tree node in a document; in addition to the tree node, the document stores in an array the _id value(s) of the node’s ancestors or path.

Consider this author tree model with an array of ancestors.
db.author.insert( { _id: "Practical Apache Spark", ancestors: [ "Subhashini", "Books" ], parent: "Books" } )
db.author.insert( { _id: "MongoDB Recipes", ancestors: [ "Subhashini", "Books" ], parent: "Books" } )
db.author.insert( { _id: "Books", ancestors: [ "Subhashini" ], parent: "Subhashini" } )
db.author.insert( { _id: " A Framework For Extracting Information From Web Using VTD-XML ", ancestors: [ "Subhashini", "Article" ], parent: "Article" } )
db.author.insert( { _id: "Article", ancestors: [ "Subhashini" ], parent: "Subhashini" } )
db.author.insert( { _id: "Subhashini", ancestors: [ ], parent: null } )

The array of ancestors field stores the ancestors field and reference to the immediate parent.

To retrieve the ancestors, use this command.
db.author.findOne( { _id: "MongoDB Recipes" } ).ancestors
Here is the output,
> db.author.findOne( { _id: "MongoDB Recipes" } ).ancestors
[ "Subhashini", "Books" ]
>
Use this command to find all its descendants.
db.author.find( { ancestors: "Subhashini" } )
Here is the output,
> db.author.find( { ancestors: "Subhashini" } )
{ "_id" : "Practical Apache Spark", "ancestors" : [ "Subhashini", "Books" ], "parent" : "Books" }
{ "_id" : "MongoDB Recipes", "ancestors" : [ "Subhashini", "Books" ], "parent" : "Books" }
{ "_id" : "Books", "ancestors" : [ "Subhashini" ], "parent" : "Subhashini" }
{ "_id" : " A Framework For Extracting Information From Web Using VTD-XML ", "ancestors" : [ "Subhashini", "Article" ], "parent" : "Article" }
{ "_id" : "Article", "ancestors" : [ "Subhashini" ], "parent" : "Subhashini" }
>

This pattern provides an efficient solution to find all descendants and the ancestors of a node. The array of ancestors pattern is a good choice for working with subtrees.

Aggregation

Aggregation operations group values from multiple documents and can perform variety of operations on the grouped values to return a single result. MongoDB provides following aggregation operations:
  • Aggregation pipeline.

  • Map-reduce function.

  • Single-purpose aggregation methods.

Aggregation Pipeline

The aggregation pipeline is a framework for data aggregation. It is modeled based on the concept of data processing pipelines. Pipelines execute an operation on some input and use that output as an input to the next operation. Documents enter a multistage pipeline that transforms them into an aggregated result.

Recipe 3-4. Aggregation Pipeline

In this recipe, we are going to discuss how the aggregation pipeline works.

Problem

You want to work with aggregation functions.

Solution

Use this method.
db.collection.aggregate()

How It Works

Let’s follow the steps in this section to work with the aggregation pipeline.

Step 1: Aggregation Pipeline

Execute the following orders collection to perform aggregation.
db.orders.insertMany([{custID:"10001",amount:500,status:"A"},{custID:"10001",amount:250,status:"A"},{custID:"10002",amount:200,status:"A"},{custID:"10001",amount: 300, status:"D"}]);
To project only customer IDs, use this syntax.
db.orders.aggregate( [ { $project : { custID : 1 , _id : 0 } } ] )
Here is the output,
> db.orders.aggregate( [ { $project : { custID : 1 , _id : 0 } } ] )
{ "custID" : "10001" }
{ "custID" : "10001" }
{ "custID" : "10002" }
{ "custID" : "10001" }
>
To group on custID and compute the sum of amount use the following command.
db.orders.aggregate({$group:{_id:"$custID",TotalAmount:{$sum:"$amount"}}});

In the preceding example, the value of the variable is accessed by using the $ sign.

Here is the output,
> db.orders.aggregate({$group:{_id:"$custID",TotalAmount:{$sum:"$amount"}}});
{ "_id" : "10002", "TotalAmount" : 200 }
{ "_id" : "10001", "TotalAmount" : 1050 }
>
To filter on status: "A" and then group it on "custID" and compute the sum of amount, use the following command.
db.orders.aggregate({$match:{status:"A"}},{$group:{_id:"$custID",TotalAmount:{ $sum:"$amount"}}});
Here is the ouput:
>db.orders.aggregate({$match:{status:"A"}},{$group:{_id:"$custID",TotalAmount:{ $sum:"$amount"}}});
{ "_id" : "10002", "TotalAmount" : 200 }
{ "_id" : "10001", "TotalAmount" : 750 }
>
To group on "custID" and compute the average of the amount for each group, use this command.
db.orders.aggregate({$group:{_id:"$custID",AverageAmount:{$avg:"$amount"}}});
Here is the output,
> db.orders.aggregate({$group:{_id:"$custID",AverageAmount:{$avg:"$amount"}}});
{ "_id" : "10002", "AverageAmount" : 200 }
{ "_id" : "10001", "AverageAmount" : 350 }
>

Map-Reduce

MongoDB also provides map-reduce to perform aggregation operations. There are two phases in map-reduce: a map stage that processes each document and outputs one or more objects and a reduce stage that combines the output of the map operation.

A custom JavaScript function is used to perform map and reduce operations. Map-reduce is less efficient and more complex compared to the aggregation pipeline.

Recipe 3-5. Map-Reduce

In this recipe, we are going to discuss how to perform aggregation operations using map-reduce.

Problem

You want to work with aggregation operations using map-reduce.

Solution

Use a customized JavaScript function.

How It Works

Let’s follow the steps in this section to work with map-reduce.

Step 1: Map-Reduce

Execute the following orders collection to perform aggregation operations.
db.orders.insertMany([{custID:"10001",amount:500,status:"A"},{custID:"10001",amount:250,status:"A"},{custID:"10002",amount:200,status:"A"},{custID:"10001",amount: 300, status:"D"}]);

To filter on status:"A" and then group it on custID and compute the sum of amount, use the following map-reduce function.

Map function:
var map = function(){
emit (this.custID, this.amount);}
Reduce function:
var reduce = function(key, values){ return Array.sum(values) ; }
To execute the query:
db.orders.mapReduce(map, reduce,{out: "order_totals",query:{status:"A"}});
db.order_totals.find()
Here is the output,
> var map = function(){
... emit (this.custID, this.amount);}
> var reduce = function(key, values){ return Array.sum(values) ; }
> db.orders.mapReduce(map, reduce,{out: "order_totals",query:{status:"A"}});
{
        "result" : "order_totals",
        "timeMillis" : 82,
        "counts" : {
                "input" : 3,
                "emit" : 3,
                "reduce" : 1,
                "output" : 2
        },
        "ok" : 1
}
> db.order_totals.find()
{ "_id" : "10001", "value" : 750 }
{ "_id" : "10002", "value" : 200 }
>

Single-Purpose Aggregation Operations

MongoDB also provides single-purpose aggregation operations such as db.collection.count() and db.collection.distinct(). These aggregate operations aggregate documents from a single collection. This functionality provides simple access to common aggregation processes.

Recipe 3-6. Single-Purpose Aggregation Operations

In this recipe, we are going to discuss how to use single-purpose aggregation operations.

Problem

You want to work with single-purpose aggregation operations.

Solution

Use these commands.
db.collection.count()
db.collection.distinct()

How It Works

Let’s follow the steps in this section to work with single-purpose aggregation operations.

Step 1: Single-Purpose Aggregation Operations

Execute the following orders collection to perform single-purpose aggregation operations.
db.orders.insertMany([{custID:"10001",amount:500,status:"A"},{custID:"10001",amount:250,status:"A"},{custID:"10002",amount:200,status:"A"},{custID:"10001",amount: 300, status:"D"}]);
Use the following syntax to find a distinct "custID".
db.orders.distinct("custID")
Here is the output,
> db.orders.distinct("custID")
[ "10001", "10002" ]
>
To count the number of documents, use this code.
db.orders.count()
Here is the output,
> db.orders.count()
4
>

SQL Aggregation Terms and Corresponding MongoDB Aggregation Operators

Table 3-1 shows SQL aggregation terms and their corresponding MongoDB aggregation operators.
Table 3-1

SQL Aggregation Terms and Corresponding MongoDB Operators

SQL Term

MongoDB Operator

WHERE

$match

GROUP BY

$group

HAVING

$match

SELECT

$project

ORDER BY

$sort

LIMIT

$limit

SUM

$sum

COUNT

$sum

JOIN

$lookup

Recipe 3-7. Matching SQL Aggregation to MongoDB Aggregation Operations

In this recipe, we are going to discuss examples of matching MongoDB operations to equivalent SQL aggregation terms.

Problem

You want to understand the equivalent of MongoDB queries for any SQL queries.

Solution

Refer to Table 3-1 and use the equivalent MongoDB operator for a respective SQL clause.

How It Works

Let’s follow the steps in this section to understand the MongoDB queries for certain SQL operations.

Step 1: Converting SQL Aggregation Operations to MongoDB

Execute the following query to check the details of the orders collection.
> db.orders.find()
Here is the output,
> db.orders.find()
{ "_id" : ObjectId("5d636112eea2dccfdeafa522"), "custID" : "10001", "amount" : 500, "status" : "A" }
{ "_id" : ObjectId("5d636112eea2dccfdeafa523"), "custID" : "10001", "amount" : 250, "status" : "A" }
{ "_id" : ObjectId("5d636112eea2dccfdeafa524"), "custID" : "10002", "amount" : 200, "status" : "A" }
{ "_id" : ObjectId("5d636112eea2dccfdeafa525"), "custID" : "10001", "amount" : 300, "status" : "D" }

Now, let’s find the count of records from the orders colletion.

Imagine this is the same as an orders table in any RDBMS and the SQL to get the count of records from the table is as follows:
SELECT COUNT(*) AS count FROM orders
Use the following query to get the count of documents from the collection in MongoDB.
> db.orders.aggregate( [ { $group: {  _id: null,  count: { $sum: 1 } } } ] )
Here is the output,
> db.orders.aggregate( [ { $group: {  _id: null,  count: { $sum: 1 } } } ] )
{ "_id" : null, "count" : 4 }
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.50.87