Starting the nodes

Starting a MariaDB Galera Cluster node means to start a MariaDB server so that we can call the mysqld binary or the mysqld_safe script.

Before starting a node for the first time, we should prepare the configuration file. The following example shows a minimal configuration:

wsrep_provider=/usr/lib/galera/libgalera_smm.so
default_storage_engine=InnoDB
binlog_format=ROW
innodb_autoinc_lock_mode=2
innodb_doublewrite=0
innodb_support_xa=0
query_cache_size=0

Here's an explanation of the options we set:

  • The wsrep_provider option is the most important. It tells Galera where the wsrep library is located. Its path depends on your system.
  • Since storage engines other than InnoDB should not be used, it is important to set default_storage_engine correctly.
  • The binlog_format variable is set to the only allowed value, ROW.
  • The innodb_autoinc_lock_mode variable must be set to 2.
  • The InnoDB doublewrite buffer and XA transactions are not supported in Galera.
  • The query cache is not supported.

When starting the first node, we must specify the --wsrep-new-cluster option. So we can start the cluster this way:

mysqld --wsrep-new-cluster

If this node crashes, it must not be restarted using the previous command because it will not reconnect to the existing cluster. For this reason, we should not use mysqld_safe to start a cluster. Instead, the node must be added to the cluster again as if it was a new node.

After a node is started, some permission needs to be set. Each node in the cluster must allow any other node to connect as root to create a copy of the databases when needed. This mechanism is called node provisioning, and will be described in the Node provisioning section in this chapter. So, we will execute statements like this on each node:

GRANT ALL ON *.* TO 'root'@'node_hostname';

The node_hostname variable must be replaced with the new node's name or IP address.

Every node in the cluster is identified by a unique URL. To add a new node to the cluster, we need to specify the URL of at least one cluster node that is currently running. While one node is normally sufficient, a more robust practice is providing the addresses of multiple nodes, possibly all. Consider the following example:

mysqld_safe --wsrep_cluster_address=gcomm://214.140.10.5

No other information is needed. The specified node will communicate the URLs of the other existing nodes to the new node. Then, the new node will inform the existing nodes about its presence.

After a node has been set up, we might want to check that it is working. Galera provides some status variables for diagnostic purposes. All Galera-related status variables have names starting with wsrep_, shown as follows:

MariaDB [(none)]> SHOW STATUS LIKE 'wsrep%';
+--------------------------+----------------------+
| Variable_name            | Value                |
+--------------------------+----------------------+
| wsrep_cluster_conf_id    | 18446744073709551615 |
| wsrep_cluster_size       | 0                    |
| wsrep_cluster_state_uuid |                      |
| wsrep_cluster_status     | Disconnected         |
| wsrep_connected          | OFF                  |
| wsrep_local_bf_aborts    | 0                    |
| wsrep_local_index        | 18446744073709551615 |
| wsrep_provider_name      |                      |
| wsrep_provider_vendor    |                      |
| wsrep_provider_version   |                      |
| wsrep_ready              | ON                   |
+--------------------------+----------------------+
11 rows in set (0.04 sec)

Note that this query has been executed on a node that has been started but is not connected to the cluster.

The variables we want to check, in this case, are the following:

  • wsrep_ready: This states whether the node is connected to the cluster and waiting to receive replication events. The ON value is the value that we want to see. The only other value is OFF.
  • wsrep_connected: This states whether the node is connected to a wsrep provider.
  • wsrep_cluster_status: This states whether the node is connected to the cluster. If no other nodes are connected, the value is Disconnected. Otherwise, we will see Primary or Non Primary.
  • wsrep_cluster_size: This states the number of nodes in the cluster.

Before the new node can start replicating the events it receives from other nodes, the current data must be copied into the new server. This operation is called node provisioning or state transfer, and it is described in the Node provisioning section later in this chapter.

Determining a node URL

As explained previously, to start a new node or restart a node after a crash, the URL of another node must be specified. Thus, the DBA needs to know how to determine a node's URL.

The formal syntax of a Galera URL is as follows:

<schema>://<address><:port>[?option=value[&option=value …]]

Two schemas are supported:

  • gcomm: This is the schema used for fully working Galera Cluster. It must always be used in production.
  • dummy: This is the schema used to test the Galera configuration. If it is used, the data is not replicated.

The address is an IP address or a hostname, optionally followed by a port number. The default port is 4567, for example, 214.140.10.5:9999.

It is possible to list multiple addresses separated by a comma, for example, 214.140.10.5,214.140.10.6. It is possible to use multicast addresses, such as IPv4 or IPv6 addresses whose last part is 1 to identify all the hosts in the subnet.

A set of options can be specified, separated by semicolons. These options are the same that are contained in wsrep_provider_options, which will be described later. For these settings to be applied after a node restart, they must be specified in a configuration file, not in the URL.

Some examples of URLs are as follows:

  • gcomm://server_name
  • gcomm://214.140.10.5,214.140.10.6
  • gcomm://server_name:9999,214.140.10.5,214.140.10.6

Node provisioning

Node provisioning, or state transfer, consists of copying a full backup of the data from one node to another. The backup is often referred to as snapshot or state, to highlight that it is a consistent version of the data in a precise point in time. The node that sends its state is called the donor, and the node that receives it is called the joiner, because this operation occurs when a node joins the cluster. This may happen because a new machine has been added or because an existing node has crashed and, after restart, needs to receive the latest data changes.

There are two main node provisioning methods:

  • State Snapshot Transfer (SST) consists of transmitting a full snapshot
  • Incremental State Transfer (IST) consists of transmitting the modifications

In practice, these methods consist of using a full backup or an incremental backup.

State Snapshot Transfer

This node provisioning method is used when a new node joins the cluster because it contains no data. There are two ways to execute an SST:

  • mysqldump: This method uses the mysqldump tool to generate the SQL statements needed to recreate the database on another node. This method is slower because it usually requires a huge amount of network traffic. The donor is made read-only via a global lock for the whole duration of the state transfer. Also, this method requires that the joiner node is already running. The use of mysqldump is necessary if the nodes use different MariaDB versions, or a different data directories layout.
  • rsync, rsync_wan, and xtrabackup: These tools are used to copy the data files from the donor to the joiner. This method is much faster. The files can be copied using rsync, which only copies the files that have been modified; the rsync_wan method uses rsync with the delta transfer algorithm, which should be used to copy data through a Wide Area Network (WAN) or a slow network, but is slower in any other situation. Percona XtraBackup makes a copy of tables without locking the server. The rsync method and XtraBackup have been discussed in Chapter 8, Backup and Disaster Recovery. The rsync method is faster than XtraBackup but it is a blocking method. These methods require that the settings that affect the way files are stored, such as @@innodb_file_per_table or @@innodb_file_format, have the same values on both the nodes. Note that if one of these methods is used, the joiner node must not be initialized before the transfer.

To choose the method to be used for SST, the wsrep_sst_method option can be set in the joiner's configuration file. Consider the following example:

wsrep_sst_method=xtrabackup

Note

The SST method supports a scriptable interface. This feature allows us to write scripts, which customizes the data transfer operations, adapting them to our specific use case. This is a very powerful characteristic but it is beyond the purpose of this chapter. The Galera documentation contains more detailed information about this topic.

Incremental State Transfer

All write sets committed by a node are written to a special cache called Galera Cache (GCache). Its purpose is to speed up the data I/O operations. When a node crashes and is restarted, it is possible that the write sets performed by other nodes are completely stored in the GCache of at least one node. In this case, the Incremental State Transfer method is used to bring the new node up to date. This method has two important advantages. Data provisioning is much faster than SST because only the recently modified data is sent to the joiner. There is no need to lock the donor, which can continue to replicate the events it receives during the state transfer.

The split brain problem

Clusters of any type must be prepared to solve a problem called split brain. To understand what this problem is and a possible real case, imagine that a cluster of database servers is spread over two data centers. Also, imagine that one of the data centers loses its Internet connection. The cluster is now split into two partitions. However, its node still works and local clients are able to connect to it and send queries. If the cluster is not prepared to deal with a situation like this, both the data centers will probably continue to modify the same set of data. When the Internet connection of the disconnected data center is repaired, there will be many conflicts in the database. A cluster may still be able to automatically solve these conflicts; this is called an optimistic approach to the split brain problem. If the cluster cannot handle the conflicts, we can say that a disaster has happened.

Galera adopts a pessimistic approach to split brain. The technique it uses is called weighted quorum and it is a variation of the quorum-consensus algorithm described in the book Distributed Systems: Concepts and Design by George Coulouris, Jean Dollimore, Tim Kindberg, and Gordon Blair, Pearson Publication. Let's see how it works.

All the nodes in a cluster keep a count of the cluster size, which is the number of nodes in the cluster. This count is constantly updated. If a new node is added, the cluster size is increased. If a node gracefully shuts down, it communicates to the other nodes that it is leaving the cluster. However, if a node crashes or a permanent network failure occurs, it cannot communicate anything; it simply becomes unreachable.

If a node is unreachable for a given amount of time, other nodes assume that it is not reachable anymore. The timeout is 5 seconds by default, but it can be customized via the evs.suspect_timeout option, in the wsrep_provider_options server variable, which will be discussed later in this chapter.

When some nodes detect that another node is not reachable, the quorum algorithm is used. If more than half of the nodes in the cluster are still reachable, the current partition of the cluster is still a primary cluster. This is the default situation even before any crash occurs. This means that the current partition can continue to perform all its normal operations.

If only half or less than half of the original number of nodes is reachable, the current partition becomes a non-primary cluster. The nodes will still be able to accept connections and run the queries sent by the clients, but their databases are made read-only. As mentioned earlier, we can check whether the current cluster partition is primary cluster or not by querying the @@wsrep_cluster_status status variable.

Note

If a cluster only has two nodes and the connection between them is lost, no partition will be bigger than the other one. The same problem occurs if one node crashes. In both cases, none of the resulting partitions will be primary, so none of the nodes will be able to write data. This explains why all Galera Cluster should consist of at least three nodes.

This algorithm guarantees that if a cluster is split into two or more partitions, only one of them will be primary. In no situation, more than one partition can modify data.

However, we already mentioned that Galera uses a variation of the quorum algorithm called weighted quorum. This means that nodes can be assigned different weights. When a node is unreachable, Galera does not really count the size of the current partition; instead, it calculates the weight of the partition, which is the sum of all the individual node's weights. By default, each node's weight is 1; thus, the weight of a partition is identical to its size.

A different weight can be assigned using the pc.weight option. The allowed range is from 0 to 255. If the weight of a node is 0, the node does not affect the result of the quorum formula. If the weight of a group of nodes is increased, and those nodes lose their Internet connection, the group will probably become the primary cluster. This makes sense, for example, if a data center has less database servers than the others, but it is vital that it keeps modifying its data.

Even if the weight of the nodes is explicitly set, it is always guaranteed that no more than one partition can become a primary cluster.

Note

The weighted quorum formula used by Galera is much more complex than what it seems like from this description. The algorithm has been simplified here for the sake of clarity.

The weighted quorum algorithm can be disabled by enabling the pc.ignore_quorum option in the wsrep_provider_options server variable. See the Setting the wsrep parameters section in this chapter for details about how to do this. However, note that if the pc.ignore_quorum option is enabled, a split brain problem can occur. In this case, we need to know how to solve conflicts without the help of Galera; for example, in some cases it could be acceptable to overwrite the changes performed by a partition.

The Galera arbitrator

An arbitrator is a special type of node designed to help solve the split brain problem.

It communicates with the rest of the cluster as if it was a normal node, but it does not replicate any data. Its only purpose is incrementing the size of the partition it can communicate to, possibly making it become a primary cluster.

For example, suppose we only have two nodes. As explained previously, if one of them crashes or if the connection between them is lost, the resulting partitions will be non-primary. However, if we have an arbitrator, the node that did not crash or lose its connection will be able to communicate with the arbitrator. Technically, the nodes will form a cluster of two nodes, which will be the primary cluster.

A more complex example is when two data centers form Galera Cluster and each data center contains the same number of nodes. This example, while involving a higher number of servers, is almost identical to the previous one; if the connection between the data centers is lost, none of them will be able to become a primary cluster. The problem can be solved without adding any new node. We can set up an arbitrator, which is not located in any of the two data centers. If one of them loses its Internet connection, the other one will still communicate with the arbitrator and become a primary cluster.

To start an arbitrator, we need to call the Galera Arbitrator Daemon (garbd) binary. Its system variables and the wsrep parameters are the same as the options of regular nodes. The exception here is that the wsrep parameters in the repl group are missing in the arbitrator because they do not make sense for a node that does not replicate anything.

To use a configuration file with garbd, we can start it this way:

garbd --cfg /path/to/garb.cnf --daemon

To stop an arbitrator, we can kill the garbd process. This is safe, since the arbitrator does not write any data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.54.168