Chapter 2. Polymorphic Schemas

MongoDB is sometimes referred to as a “schemaless” database, meaning that it does not enforce a particular structure on documents in a collection. It is perfectly legal (though of questionable utility) to store every object in your application in the same collection, regardless of its structure. In a well-designed application, however, it is more frequently the case that a collection will contain documents of identical, or closely related, structure. When all the documents in a collection are similarly, but not identically, structured, we call this a polymorphic schema.

In this chapter, we’ll explore the various reasons for using a polymorphic schema, the types of data models that they can enable, and the methods of such modeling. You’ll learn how to use polymorphic schemas to build powerful and flexible data models.

Polymorphic Schemas to Support Object-Oriented Programming

In the world of object-oriented (OO) programming, developers have become accustomed to the ability to have different classes of objects that share data and behavior through inheritance. In particular, OO languages allow functions to manipulate child classes as though they were their parent classes, calling methods that are defined in the parent but may have been overridden with different implementations in the child. This feature of OO languages is referred to as polymorphism.

Relational databases, with their focus on tables with a fixed schema, don’t support this feature all that well. It would be useful in such cases if our relational database management system (RDBMS) allowed us to define a related set of schemas for a table so that we could store any object in our hierarchy in the same table (and retrieve it using the same mechanism).

For instance, consider a content management system that needs to store wiki pages and photos. Many of the fields stored for wiki pages and photos will be similar, including:

  • The title of the object
  • Some locator that locates the object in the URL space of the CMS
  • Access controls for the object

Some of the fields, of course, will be distinct. The photo doesn’t need to have a long markup field containing its text, and the wiki page doesn’t need to have a large binary field containing the photo data. In a relational database, there are several options for modeling such an inheritance hierarchy:

  • We could create a single table containing a union of all the fields that any object in the hierarchy might contain, but this is wasteful of space since no row will populate all its fields.
  • We could create a table for each concrete instance (in this case, photo and wiki page), but this introduces redundancy in our schema (anything in common between photos and wiki pages) as well as complicating any type of query where we want to retrieve all content “nodes” (including photos and wiki pages).
  • We could create a common table for a base content “node” class that we join with an appropriate concrete table. This is referred to as polymorphic inheritance modeling, and removes the redundancy from the concrete-table approach without wasting the space of the single-table approach.

If we assume the polymorphic approach, we might end up with a schema similar to that shown in Table 2-1, Table 2-2, and Table 2-3.

Table 2-1. “Nodes” table
node_id title url type

1

Welcome

/

page

2

About

/about

page

3

Cool Photo

/photo.jpg

photo

Table 2-2. “Pages” table
node_id text

1

Welcome to my wonderful wiki.

2

This is text that is about the wiki.

Table 2-3. “Photos” table
node_id content

3

… binary data …

In MongoDB, on the other hand, we can store all of our content node types in the same collection, storing only relevant fields in each document:

// "Page" document (stored in "nodes" collection")
{
  _id: 1,
  title: "Welcome",
  url: "/",
  type: "page",
  text: "Welcome to my wonderful wiki."
}
...

// "Photo" document (also in "nodes" collection)
{
  _id: 3,
  title: "Cool Photo",
  url: "/photo.jpg",
  type: "photo",
  content: Binary(...)
}

If we use such a polymorphic schema in MongoDB, we can use the same collection to perform queries on common fields shared by all content nodes, as well as queries for only a particular node type. For instance, when deciding what to display for a given URL, the CMS needs to look up the node by URL and then perform type-specific formatting to display it. In a relational database, we might execute something like the following:

SELECT nodes.node_id, nodes.title, nodes.type,
       pages.text, photos.content
  FROM nodes
    LEFT JOIN pages ON nodes.node_id=pages.node_id
    LEFT JOIN photos ON nodes.node_id=pages.node_id
  WHERE url=:url;

Notice in particular that we are performing a three-way join, which will slow down the query substantially. Of course, we could have chosen the single-table model, in which case our query is quite simple:

SELECT * FROM nodes WHERE url=:url;

In the single-table inheritance model, however, we still retain the drawback of large amounts of wasted space in each row. If we had chosen the concrete-table inheritance model, we would actually have to perform a query for each type of content node:

SELECT * FROM pages WHERE url=:url;
SELECT * FROM photos WHERE url=:url;

In MongoDB, the query is as simple as the single-table model, with the efficiency of the concrete-table model:

db.nodes.find_one({url:url})

Polymorphic Schemas Enable Schema Evolution

When developing a database-driven application, one concern that programmers must take into account is schema evolution. Typically, this is taken care of using a set of migration scripts that upgrade the database from one version of a schema to another. Before an application is actually deployed with “live” data, the “migrations” may consist of dropping the database and re-creating it with a new schema. Once your application is live and populated with customer data, however, schema changes require complex migration scripts to change the format of data while preserving its content.

Relational databases typically support migrations via the ALTER TABLE statement, which allows the developer to add or remove columns from a table. For instance, suppose we wanted to add a short description field to our nodes table from Table 2-1. The SQL for this operation would be similar to the following:

ALTER TABLE nodes
   ADD COLUMN short_description varchar(255);

The main drawbacks to the ALTER TABLE statement is that it can be time consuming to run on a table with a large number of rows, and may require that your application experience some downtime while the migration executes, since the ALTER TABLE statement needs to hold a lock that your application requires to execute.

In MongoDB, we have the option of doing something similar by updating all documents in a collection to reflect a new field:

db.nodes.update(
    {},
    {$set: { short_description: '' } },
    false, // upsert
    true // multi
    );

This approach, however, has the same drawbacks as an ALTER TABLE statement: it can be slow, and it can impact the performance of your application negatively.

Another option for MongoDB users is to update your application to account for the absence of the new field. In Python, we might write the following code to handle retrieving both “old style” documents (without a short_description field) as well as “new style” documents (with a short_description field):

def get_node_by_url(url):
    node = db.nodes.find_one({'url': url})
    node.setdefault('short_description', '')
    return node

Once we have the code in place to handle documents with or without the short_description field, we might choose to gradually migrate the collection in the background, while our application is running. For instance, we might migrate 100 documents at a time:

def add_short_descriptions():
    node_ids_to_migrate = db.nodes.find(
        {'short_description': {'$exists':False}}).limit(100)
    db.nodes.update(
        { '_id': {'$in': node_ids_to_migrate } },
        { '$set': { 'short_description': '' } },
        multi=True)

Once the entire collection is migrated, we can replace our application code to load the node by URL to omit the default:

def get_node_by_url(url):
    node = db.nodes.find_one({'url': url})
    return node

Storage (In-)Efficiency of BSON

There is one major drawback to MongoDB’s lack of schema enforcement, and that is storage efficiency. In a RDBMS, since all the column names and types are defined at the table level, this information does not need to be replicated in each row. MongoDB, by contrast, doesn’t know, at the collection level, what fields are present in each document, nor does it know their types, so this information must be stored on a per-document basis. In particular, if you are storing small values (integers, datetimes, or short strings) in your documents and are using long property names, then MongoDB will tend to use a much larger amount of storage than an RDBMS would for the same data. One approach to mitigating this in MongoDB is to use short field names in your documents, but this approach can make it more difficult to inspect the database directly from the shell.

Object-Document Mappers

One approach that can help with storage efficiency and with migrations is the use of a MongoDB object-document mapper (ODM). There are several ODMs available for Python, including MongoEngine, MongoKit, and Ming. In Ming, for example, you might create a “Photo” model as follows:

class Photo(Document):
    ...
    short_description = Field('sd', str, if_missing='')
    ...

Using such a schema, Ming will lazily migrate documents as they are loaded from the database, as well as renaming the short_description field (in Python) to the sd property (in BSON).

Polymorphic Schemas Support Semi-Structured Domain Data

In some applications, we may want to store semi-structured domain data. For instance, we may have a product table in a database where products may have various attributes, but not all products have all attributes. One approach to such modeling, of course, is to define all the product classes we’re interested in storing and use the object-oriented mapping approach just described. There are, however, some pitfalls to avoid when this approach meets data in the real business world:

  • Product hierarchies may change frequently as items are reclassified
  • Many products, even within the same class, may have incomplete data

For instance, suppose we are storing a database of disk drives. Although all drives in our inventory specify capacity, some may also specify the cache size, while others omit it. In this case, we can use a generic properties subdocument containing the variable fields:

{
  _id: ObjectId(...),
  price: 499.99,
  title: 'Big and Fast Disk Drive',
  gb_capacity: 1000,
  properties: {
    'Seek Time': '5ms',
    'Rotational Speed': '15k RPM',
    'Transfer Rate': '...'
    ... }
}

The drawback to storing semi-structured data in this way is that it’s difficult to perform queries and indexing on fields that you wish your application to be ignorant of. Another approach you might use is to keep an array of property-value pairs:

{
  _id: ObjectId(...),
  price: 499.99,
  title: 'Big and Fast Disk Drive',
  gb_capacity: 1000,
  properties: [
    ['Seek Time', '5ms' ],
    ['Rotational Speed', '15k RPM'],
    ['Transfer Rate', '...'],
    ... ]
}

If we use the array of properties approach, we can instruct MongoDB to index the properties field with the following command:

db.products.ensure_index('properties')

Once this field is indexed, our queries simply specify the property-value pairs we’re interested in:

db.products.find({'properties': [ 'Seek Time': '5ms' ]})

Doing the equivalent operation in a relational database requires more cumbersome approaches, such as entity-attribute-value schemas, covered in more detail in Entity attribute values.

Conclusion

The flexibility that MongoDB offers by not enforcing a particular schema for all documents in a collection provides several benefits to the application programmer over an RDBMS solution:

  • Better mapping of object-oriented inheritance and polymorphism
  • Simpler migrations between schemas with less application downtime
  • Better support for semi-structured domain data

Effectively using MongoDB requires recognizing when a polymorphic schema may benefit your application and not over-normalizing your schema by replicating the same data layout you might use for a relational database system.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.120.57