DB Failover Correction

Overview

After an automated failover situation in which RepMgr has needed to elect the secondary database node as the primary database node, several steps will be required to restore the replica set to a fully functioning state.

After a failover scenario (failure of the primary database node) the replica set will be running in a degraded state:

Procedure

After a failover scenario where by a secondary node was elevated to a primary node several actions will need to be undertaken to restore the previous status quo.

Once the original primary node is back up and running, check its cluster state via:

repmgr -f /etc/repmgr/13/repmgr.conf cluster show

The original primary node will be flagged as a primary node but is no longer the elected primary across the cluster, it will instead show that the original secondary node is now acting as the primary..

In order to bring the original primary node back into replication the node will need to be rebased against the original secondary (current primary) node, which will also in turn downgrade original primary to a secondary node.

This is done with the following command, executed on the original primary node:

Note: The original primary DB will need to be shutdown for this command. It is automatically started after the data files are synced.

repmgr -f /etc/repmgr/13/repmgr.conf node rejoin -d 'host=sms-02 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind

Once the command has completed the original primary will become the new secondary node, the old secondary will now officially be the cluster primary.

RepMgr supports a standby switchover command to switch the primary between the two servers; this however requires passwordless SSH access. Due to security concerns this guide does not expect passwordless SSH access between nodes for the postgres user.

If the original cluster status quo is required, the primary and secondary nodes can be switched by simply triggering a fail over in reverse, shutting down the new primary node; waiting for the old primary to be reelected and the application nodes following the elected primary.

Once the fail over has been completed then the old secondary simply needs to be rebased against the primary using the above command in reverse.

Once all fail over steps have been completed the status can be checked with:

repmgr -f /etc/repmgr/13/repmgr.conf cluster show