Recovering from a quorum failure

There are various reason why a Proxmox cluster can lose a quorum. For the cluster to operate correctly, a quorum must exist within the nodes. A quorum is established when the majority of the nodes are online. If 51% of the nodes go offline for whatever reason, a quorum will be lost, resulting in a cluster error. A Proxmox quorum relies on multicast. So if multicast gets disabled in the switch, the cluster can also lose a quorum. A manual misconfiguration in the cluster file can also cause loss of a quorum. When a quorum is lost, the following error messages will appear in log files under /var/log/corosync:

......................
corosync[9999]: [QUORUM] Quorum provider: corosync_votequorum failed to initialize. corosync[9999]: [SERV ] Service engine 'corosync_quorum' failed to load for reason 'configuration error: nodelist or quorum.expected_votes must be configured!'
......................

The previous error may be because the hostname of the node could not be resolved. Adding all the nodes' hostnames and IP addresses to /etc/hosts may help establish a quorum. The following is the host's file content of our example node:

 

If the quorum is lost due to manual editing of the cluster configuration file, then we need to reverse the change by re-editing the /etc/pve/corosync.conf file or restoring it from a recent backup. Note that after a quorum is lost, the pmxcfs will become read-only and so will all the files in it, including corosync.conf. To be able to edit the file, we can run the following command to temporarily establish a quorum:

# pvecm expected 1

The previous command sets the total vote count to 1 and lets the cluster establish a quorum. Always make sure you edit the local copy of the cluster file and that the content of this configuration is the same on all nodes. Only then can a quorum be established. Any misconfiguration will cause split-brain, causing the full loss of a quorum. 

It is of utmost importance to avoid any manual configuration of the corosync.conf file. If manual editing becomes necessary, then only commit changes when fully capable of doing so. If unsure of how the corosync.conf file works, it is best to avoid doing it yourself and seek help from the Proxmox forum or paid support. 

After restoring the content of corosync.conf with a working configuration, restart the cluster using the following commands:

# systemctl restart pve-cluster
# systemctl restart corosync

If the quorum is lost or unable to be established due to a multicast error, then the first step is to check if the multicast is properly configured or exists on the network. We can use the following command format to check multicast between nodes:

# omping -c 10000 -i 0.001 -F -q <node1_ip> <node2_ip>

If the previous test fails, that means multicast does not exist and the quorum is failing. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.185.147