Chapter 6. Managing the Data

Planning a database operation is one of the most important phases of data model maintenance. In MongoDB, depending on the nature of the data, we can segregate the application's operations by functionality or by geographic groups.

In this chapter, we will review some concepts already introduced in Chapter 5, Optimizing Queries, such as read preferences and write concerns. But this time we will focus on understanding how these functionalities can help us to split the operations through MongoDB deployments, for instance, separating read and write operations, or ensuring information consistency using the write propagation through replica set nodes, considering the application's characteristics.

You will also see how it is possible to have collections that support a high read/write throughput—which is essential for some applications—by exploring special properties.

Therefore, in this chapter, you will learn about:

  • Operational segregation
  • Capped collections
  • Data self-expiration

Operational segregation

So far, we have seen how our application's queries can influence, in general, our decisions regarding document design. However, there is more to the read preferences and write concern concepts than we have already explored.

MongoDB offers us a series of functionalities that allow us to segregate the application operations by functional or geographic groups. When using the functional segregation, we can direct an application responsible for report generation to use only a certain MongoDB deployment. The geographic segregation means that we can target operations considering the geographic distance from a MongoDB deployment.

Giving priority to read operations

It is not that hard to imagine that once an application is built, marketing or commercial people will ask for a new report of the application's data, and by the way, this will be the essential report. We know how dangerous it can be to build and plug such applications in our main database just for the purpose of reporting. Besides the data concurrence with other applications, we know that this type of application can overload our database by making complex queries and manipulating a huge amount of data.

This is the reason why we must target dedicated MongoDB deployments to the operations that handle huge volumes of data and need heavier processing from the database. We will make applications target the right MongoDB deployments through read preferences as you already saw in Chapter 5, Optimizing Queries.

By default, an application will always read the first node from our replica set. This behavior guarantees that the application will always read the most recent data, which ensures the data's consistency. Although, if the intention is to reduce the throughput in the first node, and we can accept eventual consistence, it is possible to redirect the read operations to secondary nodes from the replica set by enabling the secondary or secondaryPreferred mode.

Besides the function of throughput reduction on the primary node, giving preference to read operations in a secondary node is crucial when we have application distributed in multiple datacenters and consequently, we have replica sets distributed geographically. This is because we make it possible to choose the nearest node or a node with the lowest latency to execute the read operation by setting the nearest mode.

Finally, we can substantially increase the database's availability by allowing the read operations to be executed in any replica set node using the primaryPreferred mode.

But what if, in addition to the read preference specification, primary or secondary, we could specify in which instance we will target an operation? For instance, think of a replica set that is distributed in two different locations and each instance has a different type of physical storage. In addition to this, we want to ensure that the write operation will be performed in at least one instance of each datacenter that has an ssd disk. Is this possible? The answer is yes!

This is possible due to tag sets. Tags sets are a configuration property that give you control over write concerns and read preferences for a replica set. They consist of a document containing zero or more tags. We will store this configuration in the replica set configuration document in the members[n].tags field.

In the case of read preferences, the tag sets grant you target read operations for a specific member of a replica set. The tag sets values are applied when the replica set member for the read process is chosen.

The tag sets will affect only one of the read preference modes, which are primaryPreferred, secondary, secondaryPreferred, and nearest. The tag sets will have no effect on the primary mode, meaning that it will only impact the choice of a replica set secondary member, unless used in combination with the nearest mode, where the closest node or the less latency node can be the primary node.

Before we see how to do this configuration, you need to understand how the replica set member is chosen. The client driver that will perform the operation makes the choice, or in the case of a sharded cluster, the choice is done by the mongos instance.

Therefore, the choice process is done in the following way:

  1. A list of the members, both primary and secondary, is created.
  2. If a tag set is specified, the members that do not match the specification are skipped.
  3. The client that is nearest to the application is determined.
  4. A list of the other replica set members is created considering the latency among the other members. This latency can be defined as soon as the write operation is performed through the secondaryAcceptableLatencyMS property. In the case of a sharded cluster, it is set through the --localThreshold or localPingThresholdMs options. If none of these configurations are set, the default value will be 15 milliseconds.

    Tip

    You can find more about this configuration in the MongoDB manual reference at http://docs.mongodb.org/manual/reference/configuration-options/#replication.localPingThresholdMs.

  5. The host that will be chosen to perform the operation is randomly selected and the read operation is performed.

The tag set configuration is as easy and simple as any other MongoDB configuration. As always, we use documents to create a configuration, and as stated before, the tag sets are a field of the replica set configuration document. This configuration document can be retrieved by running the conf() method on a replica set member.

Tip

You can find out more about the conf() method in the MongoDB documentation at http://docs.mongodb.org/manual/reference/method/rs.conf/#rs.conf.

The following document shows a tag set example for a read operation, after an execution of the rs.conf() command on the mongod shell of the rs1, which is our replica set's primary node:

rs1:PRIMARY> rs.conf()
{ // This is the replica set configuration document

   "_id" : "rs1",
   "version" : 4,
   "members" : [
      {
         "_id" : 0,
         "host" : "172.17.0.2:27017"
      },
      {
         "_id" : 1,
         "host" : "172.17.0.3:27017"
      },
      {
         "_id" : 2,
         "host" : "172.17.0.4:27017"
      }
   ]
}

To create a tag set configuration for each node of the replica set, we must do the following sequence of commands in the primary mongod shell:

First, we will get the replica set configuration document and store it in the cfg variable:

rs1:PRIMARY> cfg = rs.conf()
{
   "_id" : "rs1",
   "version" : 4,
   "members" : [
      {
         "_id" : 0,
         "host" : "172.17.0.7:27017"
      },
      {
         "_id" : 1,
         "host" : "172.17.0.5:27017"
      },
      {
         "_id" : 2,
         "host" : "172.17.0.6:27017"
      }
   ]
}

Then, by using the cfg variable, we will set a document as a new value to the members[n].tags field for each one of our three replica set members:

rs1:PRIMARY> cfg.members[0].tags = {"media": "ssd", "application": "main"}
rs1:PRIMARY> cfg.members[1].tags = {"media": "ssd", "application": "main"}
rs1:PRIMARY> cfg.members[2].tags = {"media": "ssd", "application": "report"}

Finally, we call the reconfig() method, passing in our new configuration document stored in the cfg variable to reconfigure our replica set:

rs1:PRIMARY> rs.reconfig(cfg)

If everything is correct, then we must see this output in the mongod shell:

{ "ok" : 1 }

To check the configuration, we can re-execute the command rs.conf(). This will return the following:

rs1:PRIMARY> cfg = rs.conf()
{
   "_id" : "rs1",
   "version" : 5,
   "members" : [
      {
         "_id" : 0,
         "host" : "172.17.0.7:27017",
         "tags" : {
            "application" : "main",
            "media" : "ssd"
         }
      },
      {
         "_id" : 1,
         "host" : "172.17.0.5:27017",
         "tags" : {
            "application" : "main",
            "media" : "ssd"
         }
      },
      {
         "_id" : 2,
         "host" : "172.17.0.6:27017",
         "tags" : {
            "application" : "report",
            "media" : "ssd"
         }
      }
   ]
}

Now, consider the following customer collection:

{
    "_id": ObjectId("54bf0d719a5bc523007bb78f"),
    "username": "customer1",
    "email": "[email protected]",
    "password": "1185031ff57bfdaae7812dd705383c74",
    "followedSellers": [
        "seller3",
        "seller1"
    ]
}
{
    "_id": ObjectId("54bf0d719a5bc523007bb790"),
    "username": "customer2",
    "email": "[email protected]",
    "password": "6362e1832398e7d8e83d3582a3b0c1ef",
    "followedSellers": [
        "seller2",
        "seller4"
    ]
}
{
    "_id": ObjectId("54bf0d719a5bc523007bb791"),
    "username": "customer3",
    "email": "[email protected]",
    "password": "f2394e387b49e2fdda1b4c8a6c58ae4b",
    "followedSellers": [
        "seller2",
        "seller4"
    ]
}
{
    "_id": ObjectId("54bf0d719a5bc523007bb792"),
    "username": "customer4",
    "email": "[email protected]",
    "password": "10619c6751a0169653355bb92119822a",
    "followedSellers": [
        "seller1",
        "seller2"
    ]
}
{
    "_id": ObjectId("54bf0d719a5bc523007bb793"),
    "username": "customer5",
    "email": "[email protected]",
    "password": "30c25cf1d31cbccbd2d7f2100ffbc6b5",
    "followedSellers": [
        "seller2",
        "seller4"
    ]
}

The following read operations will use the tags created in our replica set's instances:

db.customers.find(
   {username: "customer5"}
).readPref(
   {
      tags: [{application: "report", media: "ssd"}]
   }
)
db.customers.find(
   {username: "customer5"}
).readPref(
   {
      tags: [{application: "main", media: "ssd"}]
   }
)

The preceding configuration is an example of segregation by application operation. We created tag sets, marking the application's nature and what the media type that will be read will be.

As we have seen before, tag sets are very useful when we need to separate our application geographically. Suppose that we have MongoDB applications and instances of our replica set in two different datacenters. Let's create tags that will indicate in which datacenter our instances are present by running the following sequence on the replica set primary node mongod shell. First, we will get the replica set configuration document and store it in the cfg variable:

rs1:PRIMARY> cfg = rs.conf()

Then, by using the cfg variable, we will set a document as a new value to the members[n].tags field, for each one of our three replica set members:

rs1:PRIMARY> cfg.members[0].tags = {"media": "ssd", "application": "main", "datacenter": "A"}
rs1:PRIMARY> cfg.members[1].tags = {"media": "ssd", "application": "main", "datacenter": "B"}
rs1:PRIMARY> cfg.members[2].tags = {"media": "ssd", "application": "report", "datacenter": "A"}

Finally, we call the reconfig() method, passing our new configuration document stored in the cfg variable to reconfigure our replica set:

rs1:PRIMARY> rs.reconfig(cfg)

If everything is correct, then we will see this output in the mongod shell:

{ "ok" : 1 }

The result of our configuration can be checked by executing the command rs.conf():

rs1:PRIMARY> rs.conf()
{
   "_id" : "rs1",
   "version" : 6,
   "members" : [
      {
         "_id" : 0,
         "host" : "172.17.0.7:27017",
         "tags" : {
            "application" : "main",
            "datacenter" : "A",
            "media" : "ssd"
         }
      },
      {
         "_id" : 1,
         "host" : "172.17.0.5:27017",
         "tags" : {
            "application" : "main",
            "datacenter" : "B",
            "media" : "ssd"
         }
      },
      {
         "_id" : 2,
         "host" : "172.17.0.6:27017",
         "tags" : {
            "application" : "report",
            "datacenter" : "A",
            "media" : "ssd"
         }
      }
   ]
}

In order to target a read operation to a given datacenter, we must specify it in a new tag inside our query. The following queries will use the tags and each one will be executed in its own datacenter:

db.customers.find(
   {username: "customer5"}
).readPref(
   {tags: [{application: "main", media: "ssd", datacenter: "A"}]}
) // It will be executed in the replica set' instance 0 
db.customers.find(
   {username: "customer5"}
).readPref(
   {tags: [{application: "report", media: "ssd", datacenter: "A"}]}
) //It will be executed in the replica set's instance 2 
db.customers.find(
   {username: "customer5"}
).readPref(
   {tags: [{application: "main", media: "ssd", datacenter: "B"}]}
) //It will be executed in the replica set's instance 1

In write operations, the tag sets are not used to choose the replica set member that will be available to write. Although, it is possible to use tag sets in write operations through the creation of custom write concerns.

Let's get back to the requirement raised at the beginning of this section. How can we ensure that a write operation will be spread over at least two instances of a geographic area? By running the sequence of the following commands on the replica set primary node mongod shell, we will configure a replica set with five instances:

rs1:PRIMARY> cfg = rs.conf()
rs1:PRIMARY> cfg.members[0].tags = {"riodc": "rack1"}
rs1:PRIMARY> cfg.members[1].tags = {"riodc": "rack2"}
rs1:PRIMARY> cfg.members[2].tags = {"riodc": "rack3"}
rs1:PRIMARY> cfg.members[3].tags = {"spdc": "rack1"}
rs1:PRIMARY> cfg.members[4].tags = {"spdc": "rack2"}
rs1:PRIMARY> rs.reconfig(cfg)

The tags riodc and spdc represent which localities our instances are physically present in.

Now, let's create a custom writeConcern MultipleDC using the property getLastErrorModes. This will ensure that the write operation will be spread to at least one location member.

To do this, we will execute the preceding sequence, where we set a document representing our custom write concern on the settings field of our replica set configuration document:

rs1:PRIMARY> cfg = rs.conf()
rs1:PRIMARY> cfg.settings = {getLastErrorModes: {MultipleDC: {"riodc": 1, "spdc":1}}}

The output in the mongod shell should look like this:

{
   "getLastErrorModes" : {
      "MultipleDC" : {
         "riodc" : 1,
         "spdc" : 1
      }
   }
}

Then we call the reconfig() method, passing the new configuration:

rs1:PRIMARY> rs.reconfig(cfg)

If the execution was successful, the output in the mongod shell is a document like this:

{ "ok" : 1 }

From this moment, we can use a writeConcern MultipleDC in order to ensure that the write operation will be performed in at least one node of each data center shown, as follows:

db.customers.insert(
   {
      username: "customer6", 
      email: "[email protected]",
      password: "1185031ff57bfdaae7812dd705383c74", 
      followedSellers: ["seller1", "seller3"]
   }, 
   {
      writeConcern: {w: "MultipleDC"} 
   }
)

Back to our requirement, if we want the write operation to be performed in at least two instances of each datacenter, we must configure it in the following way:

rs1:PRIMARY> cfg = rs.conf()
rs1:PRIMARY> cfg.settings = {getLastErrorModes: {MultipleDC: {"riodc": 2, "spdc":2}}}
rs1:PRIMARY> rs.reconfig(cfg)

And, fulfilling our requirement, we can create a writeConcern MultipleDC called ssd. This will ensure that the write operation will happen in at least one instance that has this type of disk:

rs1:PRIMARY> cfg = rs.conf()
rs1:PRIMARY> cfg.members[0].tags = {"riodc": "rack1", "ssd": "ok"}
rs1:PRIMARY> cfg.members[3].tags = {"spdc": "rack1", "ssd": "ok"}
rs1:PRIMARY> rs.reconfig(cfg)
rs1:PRIMARY> cfg.settings = {getLastErrorModes: {MultipleDC: {"riodc": 2, "spdc":2}, ssd: {"ssd": 1}}}
rs1:PRIMARY> rs.reconfig(cfg)

In the following query, we see how using a writeConcern MultipleDC requires the write operation to be present in at least one instance that has ssd:

db.customers.insert(
   {
      username: "customer6", 
      email: "[email protected]", 
      password: "1185031ff57bfdaae7812dd705383c74", 
      followedSellers: ["seller1", "seller3"]
   }, 
   {
      writeConcern: {w: "ssd"} 
   }
)

It is not a simple task to make an operational segregation in our database. However, it is very useful for the database's management and maintenance. The early implementation of this kind of task requires a good knowledge of our data model, as details about the storage our database resides in are highly important.

In the next section, we will see how to plan collections for applications that need high throughput and fast response times.

Tip

If you want to learn more about how to configure replica set tag sets, you can visit the MongoDB reference manual at http://docs.mongodb.org/manual/tutorial/configure-replica-set-tag-sets/#replica-set-configuration-tag-sets.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.176.194