MongoDB is sometimes referred to as a “schemaless” database, meaning that it does not enforce a particular structure on documents in a collection. It is perfectly legal (though of questionable utility) to store every object in your application in the same collection, regardless of its structure. In a well-designed application, however, it is more frequently the case that a collection will contain documents of identical, or closely related, structure. When all the documents in a collection are similarly, but not identically, structured, we call this a polymorphic schema.
In this chapter, we’ll explore the various reasons for using a polymorphic schema, the types of data models that they can enable, and the methods of such modeling. You’ll learn how to use polymorphic schemas to build powerful and flexible data models.
In the world of object-oriented (OO) programming, developers have become accustomed to the ability to have different classes of objects that share data and behavior through inheritance. In particular, OO languages allow functions to manipulate child classes as though they were their parent classes, calling methods that are defined in the parent but may have been overridden with different implementations in the child. This feature of OO languages is referred to as polymorphism.
Relational databases, with their focus on tables with a fixed schema, don’t support this feature all that well. It would be useful in such cases if our relational database management system (RDBMS) allowed us to define a related set of schemas for a table so that we could store any object in our hierarchy in the same table (and retrieve it using the same mechanism).
For instance, consider a content management system that needs to store wiki pages and photos. Many of the fields stored for wiki pages and photos will be similar, including:
Some of the fields, of course, will be distinct. The photo doesn’t need to have a long markup field containing its text, and the wiki page doesn’t need to have a large binary field containing the photo data. In a relational database, there are several options for modeling such an inheritance hierarchy:
If we assume the polymorphic approach, we might end up with a schema similar to that shown in Table 2-1, Table 2-2, and Table 2-3.
node_id | title | url | type |
1 | Welcome | / | page |
2 | About | /about | page |
3 | Cool Photo | /photo.jpg | photo |
node_id | text |
1 | Welcome to my wonderful wiki. |
2 | This is text that is about the wiki. |
In MongoDB, on the other hand, we can store all of our content node types in the same collection, storing only relevant fields in each document:
// "Page" document (stored in "nodes" collection")
{
_id
:
1
,
title
:
"Welcome"
,
url
:
"/"
,
type
:
"page"
,
text
:
"Welcome to my wonderful wiki."
}
...
// "Photo" document (also in "nodes" collection)
{
_id
:
3
,
title
:
"Cool Photo"
,
url
:
"/photo.jpg"
,
type
:
"photo"
,
content
:
Binary
(...)
}
If we use such a polymorphic schema in MongoDB, we can use the same collection to perform queries on common fields shared by all content nodes, as well as queries for only a particular node type. For instance, when deciding what to display for a given URL, the CMS needs to look up the node by URL and then perform type-specific formatting to display it. In a relational database, we might execute something like the following:
SELECT
nodes
.
node_id
,
nodes
.
title
,
nodes
.
type
,
pages
.
text
,
photos
.
content
FROM
nodes
LEFT
JOIN
pages
ON
nodes
.
node_id
=
pages
.
node_id
LEFT
JOIN
photos
ON
nodes
.
node_id
=
pages
.
node_id
WHERE
url
=
:
url
;
Notice in particular that we are performing a three-way join, which will slow down the query substantially. Of course, we could have chosen the single-table model, in which case our query is quite simple:
SELECT
*
FROM
nodes
WHERE
url
=
:
url
;
In the single-table inheritance model, however, we still retain the drawback of large amounts of wasted space in each row. If we had chosen the concrete-table inheritance model, we would actually have to perform a query for each type of content node:
SELECT
*
FROM
pages
WHERE
url
=
:
url
;
SELECT
*
FROM
photos
WHERE
url
=
:
url
;
In MongoDB, the query is as simple as the single-table model, with the efficiency of the concrete-table model:
db
.
nodes
.
find_one
({
url
:
url
})
When developing a database-driven application, one concern that programmers must take into account is schema evolution. Typically, this is taken care of using a set of migration scripts that upgrade the database from one version of a schema to another. Before an application is actually deployed with “live” data, the “migrations” may consist of dropping the database and re-creating it with a new schema. Once your application is live and populated with customer data, however, schema changes require complex migration scripts to change the format of data while preserving its content.
Relational databases typically support migrations via the ALTER TABLE
statement, which allows the developer to add or remove columns from a
table.
For instance, suppose we wanted to add a short description
field to our nodes
table from Table 2-1.
The SQL for this operation would be similar to the following:
ALTER
TABLE
nodes
ADD
COLUMN
short_description
varchar
(
255
);
The main drawbacks to the ALTER TABLE
statement is that it can be time consuming
to run on a table with a large number of rows, and may require that your
application experience some downtime while the migration executes, since the
ALTER TABLE
statement needs to hold a lock that your application requires to execute.
In MongoDB, we have the option of doing something similar by updating all documents in a collection to reflect a new field:
db
.
nodes
.
update
(
{},
{
$set
:
{
short_description
:
''
}
},
false
,
// upsert
true
// multi
);
This approach, however, has the same drawbacks as an ALTER TABLE
statement: it
can be slow, and it can impact the performance of your application negatively.
Another option for MongoDB users is to update your application to account for the
absence of the new field.
In Python, we might write the following code to handle retrieving both
“old style” documents (without a short_description
field) as well as
“new style” documents (with a short_description
field):
def
get_node_by_url
(
url
):
node
=
db
.
nodes
.
find_one
({
'url'
:
url
})
node
.
setdefault
(
'short_description'
,
''
)
return
node
Once we have the code in place to handle documents with or without the
short_description
field, we might choose to gradually migrate the collection in
the background, while our application is running. For instance, we might migrate
100 documents at a time:
def
add_short_descriptions
():
node_ids_to_migrate
=
db
.
nodes
.
find
(
{
'short_description'
:
{
'$exists'
:
False
}})
.
limit
(
100
)
db
.
nodes
.
update
(
{
'_id'
:
{
'$in'
:
node_ids_to_migrate
}
},
{
'$set'
:
{
'short_description'
:
''
}
},
multi
=
True
)
Once the entire collection is migrated, we can replace our application code to load the node by URL to omit the default:
def
get_node_by_url
(
url
):
node
=
db
.
nodes
.
find_one
({
'url'
:
url
})
return
node
There is one major drawback to MongoDB’s lack of schema enforcement, and that is storage efficiency. In a RDBMS, since all the column names and types are defined at the table level, this information does not need to be replicated in each row. MongoDB, by contrast, doesn’t know, at the collection level, what fields are present in each document, nor does it know their types, so this information must be stored on a per-document basis. In particular, if you are storing small values (integers, datetimes, or short strings) in your documents and are using long property names, then MongoDB will tend to use a much larger amount of storage than an RDBMS would for the same data. One approach to mitigating this in MongoDB is to use short field names in your documents, but this approach can make it more difficult to inspect the database directly from the shell.
One approach that can help with storage efficiency and with migrations is the use of a MongoDB object-document mapper (ODM). There are several ODMs available for Python, including MongoEngine, MongoKit, and Ming. In Ming, for example, you might create a “Photo” model as follows:
class
Photo
(
Document
):
...
short_description
=
Field
(
'sd'
,
str
,
if_missing
=
''
)
...
Using such a schema, Ming will lazily migrate documents as they are loaded from
the database, as well as renaming the short_description
field (in Python) to
the sd
property (in BSON).
In some applications, we may want to store semi-structured domain data.
For instance, we may have a product
table in a database where products may have
various attributes, but not all products have all attributes.
One approach to such modeling, of course, is to define all the product classes
we’re interested in storing and use the object-oriented mapping approach just described.
There are, however, some pitfalls to avoid when this approach meets data in the
real business world:
For instance, suppose we are storing a database of disk drives.
Although all drives in our inventory specify capacity, some may also specify the cache
size, while others omit it.
In this case, we can use a generic properties
subdocument containing the variable fields:
{
_id
:
ObjectId
(...),
price
:
499.99
,
title
:
'Big and Fast Disk Drive'
,
gb_capacity
:
1000
,
properties
:
{
'Seek Time'
:
'5ms'
,
'Rotational Speed'
:
'15k RPM'
,
'Transfer Rate'
:
'...'
...
}
}
The drawback to storing semi-structured data in this way is that it’s difficult to perform queries and indexing on fields that you wish your application to be ignorant of. Another approach you might use is to keep an array of property-value pairs:
{
_id
:
ObjectId
(...),
price
:
499.99
,
title
:
'Big and Fast Disk Drive'
,
gb_capacity
:
1000
,
properties
:
[
[
'Seek Time'
,
'5ms'
],
[
'Rotational Speed'
,
'15k RPM'
],
[
'Transfer Rate'
,
'...'
],
...
]
}
If we use the array of properties approach, we can instruct MongoDB to index the properties field with the following command:
db
.
products
.
ensure_index
(
'properties'
)
Once this field is indexed, our queries simply specify the property-value pairs we’re interested in:
db
.
products
.
find
({
'properties'
:
[
'Seek Time'
:
'5ms'
]})
Doing the equivalent operation in a relational database requires more cumbersome approaches, such as entity-attribute-value schemas, covered in more detail in Entity attribute values.
The flexibility that MongoDB offers by not enforcing a particular schema for all documents in a collection provides several benefits to the application programmer over an RDBMS solution:
Effectively using MongoDB requires recognizing when a polymorphic schema may benefit your application and not over-normalizing your schema by replicating the same data layout you might use for a relational database system.
18.226.170.187