Chapter 10. Setting Up a Replica Set

This chapter introduces MongoDB’s high-availability system: replica sets. It covers:

  • What replica sets are

  • How to set up a replica set

  • What configuration options are available for replica set members

Introduction to Replication

Since the first chapter, we’ve been using a standalone server, a single mongod server. It’s an easy way to get started but a dangerous way to run in production. What if your server crashes or becomes unavailable? Your database will be unavailable for at least a little while. If there are problems with the hardware, you might have to move your data to another machine. In the worst case, disk or network issues could leave you with corrupt or inaccessible data.

Replication is a way of keeping identical copies of your data on multiple servers and is recommended for all production deployments. Replication keeps your application running and your data safe, even if something happens to one or more of your servers.

With MongoDB, you set up replication by creating a replica set. A replica set is a group of servers with one primary, the server taking writes, and multiple secondaries, servers that keep copies of the primary’s data. If the primary crashes, the secondaries can elect a new primary from amongst themselves.

If you are using replication and a server goes down, you can still access your data from the other servers in the set. If the data on a server is damaged or inaccessible, you can make a new copy of the data from one of the other members of the set.

This chapter introduces replica sets and covers how to set up replication on your system. If you are less interested in replication mechanics and simply want to create a replica set for testing/development or production, use MongoDB’s cloud solution, MongoDB Atlas. It’s easy to use and provides a free-tier option for experimentation. Alternatively, to manage MongoDB clusters in your own infrastructure, you can use Ops Manager.

Setting Up a Replica Set, Part 1

In this chapter, we’ll show you how to set up a three-node replica set on a single machine so you can start experimenting with replica set mechanics. This is the type of setup that you might script just to get a replica set up and running and then poke at it with administrative commands in the mongo shell or simulate network partitions or server failures to better understand how MongoDB handles high availability and disaster recovery. In production, you should always use a replica set and allocate a dedicated host to each member to avoid resource contention and provide isolation against server failure. To provide further resilience, you should also use the DNS Seedlist Connection format to specify how your applications connect to your replica set. The advantage to using DNS is that servers hosting your MongoDB replica set members can be changed in rotation without needing to reconfigure the clients (specifically, their connection strings).

Given the variety of virtualization and cloud options available, it is nearly as easy to bring up a test replica set with each member on a dedicated host. We’ve provided a Vagrant script to allow you to experiment with this option.1

To get started with our test replica set, let’s first create separate data directories for each node. On Linux or macOS, run the following command in the terminal to create the three directories:

$ mkdir -p ~/data/rs{1,2,3}

This will create the directories ~/data/rs1, ~/data/rs2, and ~/data/rs3 (~ identifies your home directory).

On Windows, to create these directories, run the following in the Command Prompt (cmd) or PowerShell:

> md c:data
s1 c:data
s2 c:data
s3

Then, on Linux or macOS, run each of the following commands in a separate terminal:

$ mongod --replSet mdbDefGuide --dbpath ~/data/rs1 --port 27017 
    --smallfiles --oplogSize 200
$ mongod --replSet mdbDefGuide --dbpath ~/data/rs2 --port 27018 
    --smallfiles --oplogSize 200
$ mongod --replSet mdbDefGuide --dbpath ~/data/rs3 --port 27019 
    --smallfiles --oplogSize 200

On Windows, run each of the following commands in its own Command Prompt or PowerShell window:

> mongod --replSet mdbDefGuide --dbpath c:data
s1 --port 27017 
    --smallfiles --oplogSize 200
> mongod --replSet mdbDefGuide --dbpath c:data
s2 --port 27018 
    --smallfiles --oplogSize 200
> mongod --replSet mdbDefGuide --dbpath c:data
s3 --port 27019 
    --smallfiles --oplogSize 200

Once you’ve started them, you should have three separate mongod processes running.

Note

In general, the principles we will walk through in the rest of this chapter apply to replica sets used in production deployments where each mongod has a dedicated host. However, there are additional details pertaining to securing replica sets that we address in Chapter 19; we’ll touch on those just briefly here as a preview.

Networking Considerations

Every member of a set must be able to make connections to every other member of the set (including itself). If you get errors about members not being able to reach other members that you know are running, you may have to change your network configuration to allow connections between them.

The processes you’ve launched can just as easily be running on separate servers. However, with the release of MongoDB 3.6, mongod binds to localhost (127.0.0.1) only by default. In order for each member of replica set to communicate with the others, you must also bind to an IP address that is reachable by other members. If we were running a mongod instance on a server with a network interface having an IP address of 198.51.100.1 and we wanted to run it as a member of replica set with each member on different servers, we could specify the command-line parameter --bind_ip or use bind_ip in the configuration file for this instance:

$ mongod --bind_ip localhost,192.51.100.1 --replSet mdbDefGuide 
    --dbpath ~/data/rs1 --port 27017 --smallfiles --oplogSize 200

We would make similar modifications to launch the other mongods as well in this case, regardless of whether we’re running on Linux, macOS, or Windows.

Security Considerations

Before you bind to IP addresses other than localhost, when configuring a replica set, you should enable authorization controls and specify an authentication mechanism. In addition, it is a good idea to encrypt data on disk and communication among replica set members and between the set and clients. We’ll go into more detail on securing replica sets in Chapter 19.

Setting Up a Replica Set, Part 2

Returning to our example, with the work we’ve done so far, each mongod does not yet know that the others exist. To tell them about one another, we need to create a configuration that lists each of the members and send this configuration to one of our mongod processes. It will take care of propagating the configuration to the other members.

In a fourth terminal, Windows Command Prompt, or PowerShell window, launch a mongo shell that connects to one of the running mongod instances. You can do this by typing the following command. With this command, we’ll connect to the mongod running on port 27017:

$ mongo --port 27017

Then, in the mongo shell, create a configuration document and pass this to the rs.initiate() helper to initiate a replica set. This will initiate a replica set containing three members and propagate the configuration to the rest of the mongods so that a replica set is formed:

> rsconf = {
    _id: "mdbDefGuide",
    members: [
      {_id: 0, host: "localhost:27017"},
      {_id: 1, host: "localhost:27018"},
      {_id: 2, host: "localhost:27019"} 
    ]
  }
> rs.initiate(rsconf)
{ "ok" : 1, "operationTime" : Timestamp(1501186502, 1) }

There are several important parts of a replica set configuration document. The config’s "_id" is the name of the replica set that you passed in on the command line (in this example, "mdbDefGuide"). Make sure that this name matches exactly.

The next part of the document is an array of members of the set. Each of these needs two fields: an "_id" that is an integer and unique among the replica set members, and a hostname.

Note that we are using localhost as a hostname for the members in this set. This is for example purposes only. In later chapters where we discuss securing replica sets, we’ll look at configurations that are more appropriate for production deployments. MongoDB allows all-localhost replica sets for testing locally but will protest if you try to mix localhost and non-localhost servers in a config.

This config document is your replica set configuration. The member running on localhost:27017 will parse the configuration and send messages to the other members, alerting them of the new configuration. Once they have all loaded the configuration, they will elect a primary and start handling reads and writes.

Tip

Unfortunately, you cannot convert a standalone server to a replica set without some downtime for restarting it and initializing the set. Thus, even if you only have one server to start out with, you may want to configure it as a one-member replica set. That way, if you want to add more members later, you can do so without downtime.

If you are starting a brand-new set, you can send the configuration to any member in the set. If you are starting with data on one of the members, you must send the configuration to the member with data. You cannot initiate a replica set with data on more than one member.

Once initiated, you should have a fully functional replica set. The replica set should elect a primary. You can view the status of a replica set using rs.status(). The output from rs.status() tells you quite a bit about the replica set, including a number of things we’ve not yet covered, but don’t worry, we’ll get there! For now, take a look at the members array. Note that all three of our mongod instances are listed in this array and that one of them, in this case the mongod running on port 27017, has been elected primary. The other two are secondaries. If you try this for yourself you will certainly have different values for "date" and the several Timestamp values in this output, but you might also find that a different mongod was elected primary (that’s totally fine):

> rs.status()
{
    "set" : "mdbDefGuide",
    "date" : ISODate("2017-07-27T20:23:31.457Z"),
    "myState" : 1,
    "term" : NumberLong(1),
    "heartbeatIntervalMillis" : NumberLong(2000),
    "optimes" : {
        "lastCommittedOpTime" : {
            "ts" : Timestamp(1501187006, 1),
            "t" : NumberLong(1)
        },
        "appliedOpTime" : {
            "ts" : Timestamp(1501187006, 1),
            "t" : NumberLong(1)
        },
        "durableOpTime" : {
            "ts" : Timestamp(1501187006, 1),
            "t" : NumberLong(1)
        }
    },
    "members" : [
        {
            "_id" : 0,
            "name" : "localhost:27017",
            "health" : 1,
            "state" : 1,
            "stateStr" : "PRIMARY",
            "uptime" : 688,
            "optime" : {
                "ts" : Timestamp(1501187006, 1),
                "t" : NumberLong(1)
            },
            "optimeDate" : ISODate("2017-07-27T20:23:26Z"),
            "electionTime" : Timestamp(1501186514, 1),
            "electionDate" : ISODate("2017-07-27T20:15:14Z"),
            "configVersion" : 1,
            "self" : true
        },
        {
            "_id" : 1,
            "name" : "localhost:27018",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 508,
            "optime" : {
                "ts" : Timestamp(1501187006, 1),
                "t" : NumberLong(1)
            },
            "optimeDurable" : {
                "ts" : Timestamp(1501187006, 1),
                "t" : NumberLong(1)
            },
            "optimeDate" : ISODate("2017-07-27T20:23:26Z"),
            "optimeDurableDate" : ISODate("2017-07-27T20:23:26Z"),
            "lastHeartbeat" : ISODate("2017-07-27T20:23:30.818Z"),
            "lastHeartbeatRecv" : ISODate("2017-07-27T20:23:30.113Z"),
            "pingMs" : NumberLong(0),
            "syncingTo" : "localhost:27017",
            "configVersion" : 1
        },
        {
            "_id" : 2,
            "name" : "localhost:27019",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 508,
            "optime" : {
                "ts" : Timestamp(1501187006, 1),
                "t" : NumberLong(1)
            },
            "optimeDurable" : {
                "ts" : Timestamp(1501187006, 1),
                "t" : NumberLong(1)
            },
            "optimeDate" : ISODate("2017-07-27T20:23:26Z"),
            "optimeDurableDate" : ISODate("2017-07-27T20:23:26Z"),
            "lastHeartbeat" : ISODate("2017-07-27T20:23:30.818Z"),
            "lastHeartbeatRecv" : ISODate("2017-07-27T20:23:30.113Z"),
            "pingMs" : NumberLong(0),
            "syncingTo" : "localhost:27017",
            "configVersion" : 1
        }
    ],
    "ok" : 1,
    "operationTime" : Timestamp(1501187006, 1)
}

Observing Replication

If your replica set elected the mongod on port 27017 as primary, then the mongo shell used to initiate the replica set is currently connected to the primary. You should see the prompt change to something like the following:

mdbDefGuide:PRIMARY>

This indicates that we are connected to the primary of the replica set having the "_id" "mdbDefGuide". To simplify and for the sake of clarity, we’ll abbreviate the mongo shell prompt to just > throughout the replication examples.

If your replica set elected a different node primary, quit the shell and connect to the primary by specifying the correct port number in the command line, as we did when launching the mongo shell earlier. For example, if your set’s primary is on port 27018, connect using the following command:

$ mongo --port 27018

Now that you’re connected to the primary, try doing some writes and see what happens. First, insert 1,000 documents:

> use test
> for (i=0; i<1000; i++) {db.coll.insert({count: i})}
>
> // make sure the docs are there
> db.coll.count()
1000

Now check one of the secondaries and verify that it has a copy of all of these documents. You could do this by quitting the shell and connecting using the port number of one of the secondaries, but it’s easy to acquire a connection to one of the secondaries by instantiating a connection object using the Mongo constructor within the shell you’re already running.

First, use your connection to the test database on the primary to run the isMaster command. This will show you the status of the replica set, in a much more concise form than rs.status(). It is also a convenient means of determining which member is primary when writing application code or scripting:

> db.isMaster()
{
    "hosts" : [
        "localhost:27017",
        "localhost:27018",
        "localhost:27019"
    ],
    "setName" : "mdbDefGuide",
    "setVersion" : 1,
    "ismaster" : true,
    "secondary" : false,
    "primary" : "localhost:27017",
    "me" : "localhost:27017",
    "electionId" : ObjectId("7fffffff0000000000000004"),
    "lastWrite" : {
        "opTime" : {
            "ts" : Timestamp(1501198208, 1),
            "t" : NumberLong(4)
        },
        "lastWriteDate" : ISODate("2017-07-27T23:30:08Z")
    },
    "maxBsonObjectSize" : 16777216,
    "maxMessageSizeBytes" : 48000000,
    "maxWriteBatchSize" : 1000,
    "localTime" : ISODate("2017-07-27T23:30:08.722Z"),
    "maxWireVersion" : 6,
    "minWireVersion" : 0,
    "readOnly" : false,
    "compression" : [
        "snappy"
    ],
    "ok" : 1,
    "operationTime" : Timestamp(1501198208, 1)
}

If at any point an election is called and the mongod you’re connected to becomes a secondary, you can use the isMaster command to determine which member has become primary. The output here tells us that localhost:27018 and localhost:27019 are both secondaries, so we can use either for our purposes. Let’s instantiate a connection to localhost:27019:

> secondaryConn = new Mongo("localhost:27019")
connection to localhost:27019
>
> secondaryDB = secondaryConn.getDB("test")
test

Now, if we attempt to do a read on the collection that has been replicated to the secondary, we’ll get an error. Let’s attempt to do a find on this collection and then review the error and why we get it:

> secondaryDB.coll.find()
Error: error: {
    "operationTime" : Timestamp(1501200089, 1),
    "ok" : 0,
    "errmsg" : "not master and slaveOk=false",
    "code" : 13435,
    "codeName" : "NotMasterNoSlaveOk"
}

Secondaries may fall behind the primary (or lag) and not have the most current writes, so secondaries will refuse read requests by default to prevent applications from accidentally reading stale data. Thus, if you attempt to query a secondary, you’ll get an error stating that it’s not the primary. This is to protect your application from accidentally connecting to a secondary and reading stale data. To allow queries on the secondary, we can set an “I’m okay with reading from secondaries” flag, like so:

> secondaryConn.setSlaveOk()

Note that slaveOk is set on the connection (secondaryConn), not the database (secondaryDB).

Now you’re all set to read from this member. Query it normally:

> secondaryDB.coll.find()
{ "_id" : ObjectId("597a750696fd35621b4b85db"), "count" : 0 }
{ "_id" : ObjectId("597a750696fd35621b4b85dc"), "count" : 1 }
{ "_id" : ObjectId("597a750696fd35621b4b85dd"), "count" : 2 }
{ "_id" : ObjectId("597a750696fd35621b4b85de"), "count" : 3 }
{ "_id" : ObjectId("597a750696fd35621b4b85df"), "count" : 4 }
{ "_id" : ObjectId("597a750696fd35621b4b85e0"), "count" : 5 }
{ "_id" : ObjectId("597a750696fd35621b4b85e1"), "count" : 6 }
{ "_id" : ObjectId("597a750696fd35621b4b85e2"), "count" : 7 }
{ "_id" : ObjectId("597a750696fd35621b4b85e3"), "count" : 8 }
{ "_id" : ObjectId("597a750696fd35621b4b85e4"), "count" : 9 }
{ "_id" : ObjectId("597a750696fd35621b4b85e5"), "count" : 10 }
{ "_id" : ObjectId("597a750696fd35621b4b85e6"), "count" : 11 }
{ "_id" : ObjectId("597a750696fd35621b4b85e7"), "count" : 12 }
{ "_id" : ObjectId("597a750696fd35621b4b85e8"), "count" : 13 }
{ "_id" : ObjectId("597a750696fd35621b4b85e9"), "count" : 14 }
{ "_id" : ObjectId("597a750696fd35621b4b85ea"), "count" : 15 }
{ "_id" : ObjectId("597a750696fd35621b4b85eb"), "count" : 16 }
{ "_id" : ObjectId("597a750696fd35621b4b85ec"), "count" : 17 }
{ "_id" : ObjectId("597a750696fd35621b4b85ed"), "count" : 18 }
{ "_id" : ObjectId("597a750696fd35621b4b85ee"), "count" : 19 }
Type "it" for more

You can see that all of our documents are there.

Now, try to write to a secondary:

> secondaryDB.coll.insert({"count" : 1001})
WriteResult({ "writeError" : { "code" : 10107, "errmsg" : "not master" } })
> secondaryDB.coll.count()
1000

You can see that the secondary does not accept the write. A secondary will only perform writes that it gets through replication, not from clients.

There is one other interesting feature that you should try out: automatic failover. If the primary goes down, one of the secondaries will automatically be elected primary. To test this, stop the primary:

> db.adminCommand({"shutdown" : 1})

You’ll see some error messages generated when you run this command because the mongod running on port 27017 (the member we’re connected to) will terminate and the shell we’re using will lose its connection:

2017-07-27T20:10:50.612-0400 E QUERY    [thread1] Error: error doing query: 
 failed: network error while attempting to run command 'shutdown' on host 
 '127.0.0.1:27017'  :
DB.prototype.runCommand@src/mongo/shell/db.js:163:1
DB.prototype.adminCommand@src/mongo/shell/db.js:179:16
@(shell):1:1
2017-07-27T20:10:50.614-0400 I NETWORK  [thread1] trying reconnect to 
 127.0.0.1:27017 (127.0.0.1) failed
2017-07-27T20:10:50.615-0400 I NETWORK  [thread1] reconnect 
 127.0.0.1:27017 (127.0.0.1) ok
MongoDB Enterprise mdbDefGuide:SECONDARY> 
2017-07-27T20:10:56.051-0400 I NETWORK  [thread1] trying reconnect to 
 127.0.0.1:27017 (127.0.0.1) failed
2017-07-27T20:10:56.051-0400 W NETWORK  [thread1] Failed to connect to 
 127.0.0.1:27017, in(checking socket for error after poll), reason: 
 Connection refused
2017-07-27T20:10:56.051-0400 I NETWORK  [thread1] reconnect 
 127.0.0.1:27017 (127.0.0.1) failed failed 
MongoDB Enterprise > 
MongoDB Enterprise > secondaryConn.isMaster()
2017-07-27T20:11:15.422-0400 E QUERY    [thread1] TypeError: 
 secondaryConn.isMaster is not a function :
@(shell):1:1

This isn’t a problem. It won’t cause the shell to crash. Go ahead and run isMaster on the secondary to see who has become the new primary:

> secondaryDB.isMaster()

The output from isMaster should look something like this:

{
    "hosts" : [
        "localhost:27017",
        "localhost:27018",
        "localhost:27019"
    ],
    "setName" : "mdbDefGuide",
    "setVersion" : 1,
    "ismaster" : true,
    "secondary" : false,
    "primary" : "localhost:27018",
    "me" : "localhost:27019",
    "electionId" : ObjectId("7fffffff0000000000000005"),
    "lastWrite" : {
        "opTime" : {
            "ts" : Timestamp(1501200681, 1),
            "t" : NumberLong(5)
        },
        "lastWriteDate" : ISODate("2017-07-28T00:11:21Z")
    },
    "maxBsonObjectSize" : 16777216,
    "maxMessageSizeBytes" : 48000000,
    "maxWriteBatchSize" : 1000,
    "localTime" : ISODate("2017-07-28T00:11:28.115Z"),
    "maxWireVersion" : 6,
    "minWireVersion" : 0,
    "readOnly" : false,
    "compression" : [
        "snappy"
    ],
    "ok" : 1,
    "operationTime" : Timestamp(1501200681, 1)
}

Note that the primary has switched to 27018. Your primary may be the other server; whichever secondary noticed that the primary was down first will be elected. Now you can send writes to the new primary.

Tip

isMaster is a very old command, predating replica sets to when MongoDB only supported master/slave replication. Thus, it does not use the replica set terminology consistently: it still calls the primary a “master.” You can generally think of “master” as equivalent to “primary” and “slave” as equivalent to “secondary.”

Go ahead and bring back up the server we had running at localhost:27017. You simply need to find the command-line interface from which you launched it. You’ll see some messages indicating that it terminated. Just run it again using the same command you used to launch it originally.

Congratulations! You just set up, used, and even poked a little at a replica set to force a shutdown and an election for a new primary.

There are a few key concepts to remember:

  • Clients can send a primary all the same operations they could send a standalone server (reads, writes, commands, index builds, etc.).

  • Clients cannot write to secondaries.

  • Clients, by default, cannot read from secondaries. You can enable this by explicitly setting an “I know I’m reading from a secondary” setting on the connection.

Changing Your Replica Set Configuration

Replica set configurations can be changed at any time: members can be added, removed, or modified. There are shell helpers for some common operations. For example, to add a new member to the set, you can use rs.add:

> rs.add("localhost:27020")

Similarly, you can remove members:

> rs.remove("localhost:27017")
{ "ok" : 1, "operationTime" : Timestamp(1501202441, 2) }

You can check that a reconfiguration succeeded by running rs.config() in the shell. It will print the current configuration:

> rs.config()
{
    "_id" : "mdbDefGuide",
    "version" : 3,
    "protocolVersion" : NumberLong(1),
    "members" : [
        {
            "_id" : 1,
            "host" : "localhost:27018",
            "arbiterOnly" : false,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {
                
            },
            "slaveDelay" : NumberLong(0),
            "votes" : 1
        },
        {
            "_id" : 2,
            "host" : "localhost:27019",
            "arbiterOnly" : false,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {
                
            },
            "slaveDelay" : NumberLong(0),
            "votes" : 1
        },
        {
            "_id" : 3,
            "host" : "localhost:27020",
            "arbiterOnly" : false,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {
                
            },
            "slaveDelay" : NumberLong(0),
            "votes" : 1
        }
    ],
    "settings" : {
        "chainingAllowed" : true,
        "heartbeatIntervalMillis" : 2000,
        "heartbeatTimeoutSecs" : 10,
        "electionTimeoutMillis" : 10000,
        "catchUpTimeoutMillis" : -1,
        "getLastErrorModes" : {
            
        },
        "getLastErrorDefaults" : {
            "w" : 1,
            "wtimeout" : 0
        },
        "replicaSetId" : ObjectId("597a49c67e297327b1e5b116")
    }
}

Each time you change the configuration, the "version" field will increase. It starts at version 1.

You can also modify existing members, not just add and remove them. To make modifications, create the configuration document that you want in the shell and call rs.reconfig(). For example, suppose we have a configuration such as the one shown here:

> rs.config()
{
    "_id" : "testReplSet",
    "version" : 2,
    "members" : [
        {
            "_id" : 0,
            "host" : "198.51.100.1:27017"
        },
        {
            "_id" : 1,
            "host" : "localhost:27018"
        },
        {
            "_id" : 2,
            "host" : "localhost:27019"
        }
    ]
}

Someone accidentally added member 0 by IP address, instead of its hostname. To change that, first we load the current configuration in the shell and then we change the relevant fields:

> var config = rs.config()
> config.members[0].host = "localhost:27017"

Now that the config document is correct, we need to send it to the database using the rs.reconfig() helper:

> rs.reconfig(config)

rs.reconfig() is often more useful than rs.add() and rs.remove() for complex operations, such as modifying members’ configurations or adding/removing multiple members at once. You can use it to make any legal configuration change you need: simply create the config document that represents your desired configuration and pass it to rs.reconfig().

How to Design a Set

To plan out your set, there are certain concepts that you must be familiar with. The next chapter goes into more detail about these, but the most important is that replica sets are all about majorities: you need a majority of members to elect a primary, a primary can only stay primary as long as it can reach a majority, and a write is safe when it’s been replicated to a majority. This majority is defined to be “more than half of all members in the set,” as shown in Table 10-1.

Table 10-1. What is a majority?
Number of members in the setMajority of the set
11
22
32
43
53
64
74

Note that it doesn’t matter how many members are down or unavailable; majority is based on the set’s configuration.

For example, suppose that we have a five-member set and three members go down, as shown in Figure 10-1. There are still two members up. These two members cannot reach a majority of the set (at least three members), so they cannot elect a primary. If one of them were primary, it would step down as soon as it noticed that it could not reach a majority. After a few seconds, your set would consist of two secondaries and three unreachable members.

Figure 10-1. With a minority of the set available, all members will be secondaries

Many users find this frustrating: why can’t the two remaining members elect a primary? The problem is that it’s possible that the other three members didn’t actually go down, and that it was instead the network that went down, as shown in Figure 10-2. In this case, the three members on the left will elect a primary, since they can reach a majority of the set (three members out of five). In the case of a network partition, we do not want both sides of the partition to elect a primary, because then the set would have two primaries. Both primaries would be writing to the database, and the datasets would diverge. Requiring a majority to elect or stay a primary is a neat way of avoiding ending up with more than one primary.

Figure 10-2. For the members, a network partition looks identical to servers on the other side of the partition going down

It is important to configure your set in such a way that you’ll usually be able to have one primary. For example, in the five-member set described here, if members 1, 2, and 3 are in one data center and members 4 and 5 are in another, there should almost always be a majority available in the first data center (it’s more likely to have a network break between data centers than within them).

There are a couple of common configurations that are recommended:

  • A majority of the set in one data center, as in Figure 10-2. This is a good design if you have a primary data center where you always want your replica set’s primary to be located. So long as your primary data center is healthy, you will have a primary. However, if that data center becomes unavailable, your secondary data center will not be able to elect a new primary.

  • An equal number of servers in each data center, plus a tie-breaking server in a third location. This is a good design if your data centers are “equal” in preference, since generally servers from either data center will be able to see a majority of the set. However, it involves having three separate locations for servers.

More complex requirements might require different configurations, but you should keep in mind how your set will acquire a majority under adverse conditions.

All of these complexities would disappear if MongoDB supported having more than one primary. However, this would bring its own host of complexities. With two primaries, you would have to handle conflicting writes (e.g., if someone updates a document on one primary and someone deletes it on another primary). There are two popular ways of handling conflicts in systems that support multiple writers: manual reconciliation or having the system arbitrarily pick a “winner.” Neither of these options is a very easy model for developers to code against, seeing as you can’t be sure that the data you’ve written won’t change out from under you. Thus, MongoDB chose to only support having a single primary. This makes development easier but can result in periods when the replica set is read-only.

How Elections Work

When a secondary cannot reach a primary, it will contact all the other members and request that it be elected primary. These other members do several sanity checks: Can they reach a primary that the member seeking election cannot? Is the member seeking election up to date with replication? Is there any member with a higher priority available that should be elected instead?

In version 3.2, MongoDB introduced version 1 of the replication protocol. Protocol version 1 is based on the RAFT consensus protocol developed by Diego Ongaro and John Ousterhout at Stanford University. It is best described as RAFT-like and is tailored to include a number of replication concepts that are specific to MongoDB, such as arbiters, priority, nonvoting members, write concern, etc. Protocol version 1 provided the foundation for new features such as a shorter failover time and greatly reduces the time to detect false primary situations. It also prevents double voting through the use of term IDs.

Note

RAFT is a consensus algorithm that is broken into relatively independent subproblems. Consensus is the process through which multiple servers or processes agree on values. RAFT ensures consensus such that the same series of commands produces the same series of results and arrives at the same series of states across the members of a deployment.

Replica set members send heartbeats (pings) to each other every two seconds. If a heartbeat does not return from a member within 10 seconds, the other members mark the delinquent member as inaccessible. The election algorithm will make a “best-effort” attempt to have the secondary with the highest priority available call an election. Member priority affects both the timing and the outcome of elections; secondaries with higher priority call elections relatively sooner than secondaries with lower priority, and are also more likely to win. However, a lower-priority instance can be elected as primary for brief periods, even if a higher-priority secondary is available. Replica set members continue to call elections until the highest-priority member available becomes primary.

To be elected primary, a member must be up to date with replication, as far as the members it can reach know. All replicated operations are strictly ordered by an ascending identifier, so the candidate must have operations later than or equal to those of any member it can reach.

Member Configuration Options

The replica sets we have set up so far have been fairly uniform in that every member has the same configuration as every other member. However, there are many situations when you don’t want members to be identical: you might want one member to preferentially be primary or make a member invisible to clients so that no read requests can be routed to it. These and many other configuration options can be specified in the member subdocuments of the replica set configuration. This section outlines the member options that you can set.

Priority

Priority is an indication of how strongly this member “wants” to become primary. Its value can range from 0 to 100, and the default is 1. Setting "priority" to 0 has a special meaning: members with a priority of 0 can never become primary. These are called passive members.

The highest-priority member will always be elected primary (so long as it can reach a majority of the set and has the most up-to-date data). For example, suppose you add a member with a priority of 1.5 to the set, like so:

> rs.add({"host" : "server-4:27017", "priority" : 1.5})

Assuming the other members of the set have priority 1, once server-4 caught up with the rest of the set, the current primary would automatically step down and server-4 would elect itself. If server-4 was, for some reason, unable to catch up, the current primary would stay primary. Setting priorities will never cause your set to go primary-less. It will also never cause a member that is behind to become primary (until it has caught up).

The absolute value of "priority" only matters in relation to whether it is greater or less than the other priorities in the set: members with priorities of 100, 1, and 1 will behave the same way as members of another set with priorities 2, 1, and 1.

Hidden Members

Clients do not route requests to hidden members, and hidden members are not preferred as replication sources (although they will be used if more desirable sources are not available). Thus, many people will hide less powerful or backup servers.

For example, suppose you had a set that looked like this:

> rs.isMaster()
{
    ...
    "hosts" : [
        "server-1:27107",
        "server-2:27017",
        "server-3:27017"
    ],
    ...
}

To hide server-3, you could add the hidden: true field to its configuration. A member must have a priority of 0 to be hidden (you can’t have a hidden primary):

> var config = rs.config()
> config.members[2].hidden = true
0
> config.members[2].priority = 0
0
> rs.reconfig(config)

Now running isMaster will show:

> rs.isMaster()
{
    ...
    "hosts" : [
        "server-1:27107",
        "server-2:27017"
    ],
    ...
}

rs.status() and rs.config() will still show the member; it only disappears from isMaster. When clients connect to a replica set, they call isMaster to determine the members of the set. Thus, hidden members will never be used for read requests.

To unhide a member, change the hidden option to false or remove the option entirely.

Election Arbiters

A two-member set has clear disadvantages for majority requirements. However, many people with small deployments do not want to keep three copies of their data, feeling that two is enough and that keeping a third copy is not worth the administrative, operational, and financial costs.

For these deployments, MongoDB supports a special type of member called an arbiter, whose only purpose is to participate in elections. Arbiters hold no data and aren’t used by clients: they just provide a majority for two-member sets. In general, deployments without arbiters are preferable.

As arbiters don’t have any of the traditional responsibilities of a mongod server, you can run an arbiter as a lightweight process on a wimpier server than you’d generally use for MongoDB. It’s often a good idea, if possible, to run an arbiter in a separate failure domain from the other members, so that it has an “outside perspective” on the set, as described in the deployment recommendations in “How to Design a Set”.

You start up an arbiter in the same way that you start a normal mongod, using the --replSet name option and an empty data directory. You can add it to the set using the rs.addArb() helper:

> rs.addArb("server-5:27017")

Equivalently, you can specify the "arbiterOnly" option in the member configuration:

> rs.add({"_id" : 4, "host" : "server-5:27017", "arbiterOnly" : true})

An arbiter, once added to the set, is an arbiter forever: you cannot reconfigure an arbiter to become a nonarbiter, or vice versa.

One other thing that arbiters are good for is breaking ties in larger clusters. If you have an even number of nodes, you may have half the nodes vote for one member and half for another. An arbiter can cast the deciding vote. There are a few things to keep in mind when using arbiters, though; we’ll look at these next.

Use at most one arbiter

Note that, in both of the use cases just described, you need at most one arbiter. You do not need an arbiter if you have an odd number of nodes. A common misconception seems to be that you should add extra arbiters “just in case.” However, it doesn’t help elections go any faster or provide any additional data safety to add extra arbiters.

Suppose you have a three-member set. Two members are required to elect a primary. If you add an arbiter, you’ll have a four-member set, so three members will be required to choose a primary. Thus, your set is potentially less stable: instead of requiring 67% of your set to be up, you’re now requiring 75%.

Having extra members can also make elections take longer. If you have an even number of nodes because you added an arbiter, your arbiters can cause ties, not prevent them.

The downside to using an arbiter

If you have a choice between a data node and an arbiter, choose a data node. Using an arbiter instead of a data node in a small set can make some operational tasks more difficult. For example, suppose you are running a replica set with two “normal” members and one arbiter, and one of the data-holding members goes down. If that member is well and truly dead (the data is unrecoverable), you will have to get a copy of the data from the current primary to the new server you’ll be using as a secondary. Copying data can put a lot of stress on a server, and thus slow down your application. (Generally, copying a few gigabytes to a new server is trivial but more than a hundred starts becoming impractical.)

Conversely, if you have three data-holding members, there’s more “breathing room” if a server completely dies. You can use the remaining secondary to bootstrap a new server instead of depending on your primary.

In the two-member-plus-arbiter scenario, the primary is the last remaining good copy of your data and the one trying to handle load from your application while you’re trying to get another copy of your data online.

Thus, if possible, use an odd number of “normal” members instead of an arbiter.

Warning

In three-member replica sets with a primary-secondary-arbiter (PSA) architecture or sharded clusters with a three-member PSA shard, there is a known issue with cache pressure increasing if either of the two data-bearing nodes are down and the "majority" read concern is enabled. Ideally, you should replace the arbiter with a data-bearing member for these deployments. Alternatively, to prevent storage cache pressure the "majority" read concern can be disabled on each of the mongod instances in the deployment or shards.

Building Indexes

Sometimes a secondary does not need to have the same (or any) indexes that exist on the primary. If you are using a secondary only for backup data or offline batch jobs, you might want to specify "buildIndexes" : false in the member’s configuration. This option prevents the secondary from building any indexes.

This is a permanent setting: members that have "buildIndexes" : false specified can never be reconfigured to be “normal” index-building members again. If you want to change a non-index-building member to an index-building one, you must remove it from the set, delete all of its data, add it to the set again, and allow it to resync from scratch.

As with hidden members, this option requires the member’s priority to be 0.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.247.196