The two main components to the software we use in this chapter are Corosync and Pacemaker. Each of these is comprised of or depends on several other elements and prerequisites. For now, we'll simply refer to the entire suite as Pacemaker, as it comprises the bulk of how we will control the failover system.
This recipe should be relatively short, as we will only discuss installation of Corosync and Pacemaker, not their configuration.
Red-Hat-based systems such as Fedora, CentOS, and Scientific Linux will already have Pacemaker in their repositories. Debian and its derivatives such as Ubuntu also include Pacemaker as an optional install from standard repositories. Red Hat Enterprise Linux (RHEL) itself, however, only offers the software as a paid add-on, available at http://www.redhat.com/products/enterprise-linux-add-ons/high-availability/.
Whatever choice you make, it shouldn't be necessary to compile Pacemaker from source on most Linux distributions.
Follow these quick steps to install Pacemaker and Corosync on all PostgreSQL server pairs running a Debian-based distribution:
sudo apt-get install corosync pacemaker
sudo update-rc.d corosync disable
For those running a Red-Hat-based operating system, follow these steps to install and prepare Pacemaker:
sudo yum install corosync pacemaker
sudo chkconfig corosync off
Each of these short recipes consists of two steps:
While the first step makes sense, why do we need the second? When running a highly available cluster, caution is a beneficial attribute. A server may reboot for any number of reasons, and many of those include crashes that require further investigation.
Were Pacemaker to start immediately following a server reboot, we could potentially lose valuable diagnostic information. More importantly, a rebooted server should be considered in an unknown or potentially damaged state until it is examined by an experienced system administrator. We don't want a misbehaving server as part of our critical infrastructure.
Corosync is the communication layer between each Pacemaker node. It also launches the Pacemaker management system. This means that we can prevent all node management simply by disabling it.
If you believe we are being too wary, simply skip the second step in our recipe. However, it's important to remember that services are easy to start on Linux servers. This command, for instance, will start Corosync normally:
sudo service corosync start
If the server was rebooted as the result of maintenance, the preceding command will return the system to normal operation. Otherwise, a few cursory checks through server logs may determine that the cause of the system crash does not adversely affect PostgreSQL data. If so, once again, it is easy to start Corosync and re-establish the dual-node cluster.
What we have done here is a very rudimentary form of STONITH, which means to Shoot The Other Node In The Head. Dedicated STONITH hardware may power a server off completely or remove it from the network, making it inaccessible through anything other than console emulation or direct access. Truly high-availability systems cannot afford to introduce unknown entities into a carefully crafted and manicured architecture. To do so invites undefined behavior across the spectrum of database services that could lead to outages or data loss.
If we want to claim that our data is important and our uptime is essential, we need to adopt a similar stance toward crashed or damaged servers. We haven't gone so far as to completely disable the server in this recipe; we only prevent it from rejoining a functioning Pacemaker pair. In a true STONITH-enabled organization, our measures would be much more drastic.
3.147.27.131