One looming danger when running any replication system is that of node status conflicts. This happens when more than one node has been primary, and we want to reestablish the previous mirror state. This can happen in many ways, but a common scenario can occur if the existing primary node experiences a sudden failure and the remaining secondary node is promoted to primary status.
In the case where we repair the old primary node, we can't simply reattach it to the DRBD network and expect successful synchronization. In cases where the last status for each node is that of a primary, DRBD will not resolve this conflict automatically. It is our job to manually choose the best primary node from our available choices, and reattach the other node.
In this recipe, we'll explore the steps necessary to reattach a malfunctioning node to an existing DRBD architecture. We can't have a highly available PostgreSQL cluster with only one functional node.
Since we're working with DRBD and need a fully established mirror, please follow steps in all the recipes up to Adding block-level replication before continuing. In addition, we need to simulate a split brain. A very easy way to do this is to put both nodes in primary state while disconnected from each other.
Assuming that we have nodes pg1
and pg2
, where pg1
is the current primary node, follow these instructions as the root
user to cause a split brain:
drbdadm disconnect pg
pg2
, execute this command to force it into primary status:drbdadm primary --force pg
If we were to use drbdadm
to attempt and connect the nodes now, we would see the following message in the system logs:
Split-Brain detected but unresolved, dropping connection!
Follow these instructions as the root
user to repair a split-brain scenario:
pg2
should be the new primary node.drbdadm disconnect pg
VG_POSTGRES
volume with vgchange
on pg1
:vgchange -a n VG_POSTGRES
drbdadm
to downgrade pg1
to secondary
status:drbdadm secondary pg
pg1
to connect while discarding metadata:drbdadm connect --discard-my-data pg
pg2
to connect to DRBD:drbdadm connect pg
The first step is clearly the most critical. We need to determine which node has the most recent valid data. In almost all cases, there should be sufficient logs to make this determination. However, in some network disruption scenarios coupled with automated failover solutions, this may not be obvious. Unfortunately, resolving this step is too varied to adequately express in a simple guide.
For our example, we manually promoted the pg2
node, so it should be the new primary. With that in mind, there are many states DRBD could have right now, and we want one in particular: StandAlone
. By disconnecting both nodes, we don't have to worry about aborted or premature connection attempts disrupting our progress. We want both nodes to report StandAlone
in /proc/drbd
as the connection state (cs
), as shown in this screenshot:
Our next step is actually related to LVM. If DRBD is primary on a node, the second LVM layer is probably active as well. Since LVM uses the underlying DRBD device, we can't demote this node to secondary status until we use vgchange
to set the active (-a
) state of VG_POSTGRES
to no (n
).
Given that there are no other elements connected to /dev/drbd0
, we can set its status to secondary
with drbdadm
. While in secondary state, we can attempt to connect to the DRBD network with drbdadm connect
. Since both nodes were primary at one point, each was maintaining a different map of modified blocks; these maps will not match. If this happens, DRBD will refuse to connect to the network, and it will revert to the StandAlone
status.
To prevent that, we add --discard-my-data
to the connect
operation. This option acknowledges the situation, and it tells the secondary node to ignore its own change map in favor of what the primary node may contain. If the secondary node is too out-of-date for the update map, DRBD will simply resynchronize all data on the device.
Of course, none of this will happen until we invoke drbdadm connect
from the new primary node. We do this last because we can always change our minds and abort the process. If we did this before connecting the secondary node, previously existing storage maps have already been discarded, and resynchronization is already taking place.
3.136.17.12