Fixing common replication issues

Given the goals set for this chapter, the last topic to be done is the customer's complaint about replication, which tends to break from time to time. In PostgreSQL, replication never breaks for technical reasons; only misconfiguration can lead to unexpected behavior. Here are the most common issues:

  • Replication stops and falls behind
  • Queries fail on the slave for some reason

Let's attack one problem at a time.

Fixing stopped replication

One of the most common problems is that at some point, replication just stops. The slave will start to fall behind, and the result coming back from a slave will be old and outdated. There are a couple of reasons for this kind of behavior:

  • Constant lack of bandwidth
  • Replication conflicts

In many cases, bandwidth is a serious issue. If two servers are connected using a 100 MB interconnect, it can easily happen that the slave falls behind and eventually dies if the write load is constantly higher than 10 MB per second. Keep in mind that the master only keeps a couple of xlog files around, which are needed to recover in case of a crash of the master. In the default configuration, it does not queue transaction logs forever just to handle problems on the slaves. If the load is too high on a systematic basis, this would not help anyway.

In PostgreSQL, the slave requests the transaction log it needs from the master. If the master does not have the desired xlog file any more, the slave cannot continue. To many users, this looks like a replication failure, which it really is not. It is clearly desired behavior. To reduce the problem straight away, two options are available:

  • Set wal_keep_segments
  • Use replication slots (a crash-safe feature introduced in PostgreSQL 9.4 that can be used to avoid premature removal of xlog)

In general, it is always recommended to use wal_keep_segments, unless replication slots are used. Otherwise, it is too risky and a slave is likely to fall much behind its master. When configuring wal_keep_segments, be gracious and don't use settings that are too low. A nice xlog retention pool can really pay off.

Note

Keep in mind that if you are constantly using more bandwidth than is available, wal_keep_segments won't help.

Replication slots are a nice alternative to wal_keep_segments. In the case of a normal setup using wal_keep_segments, a master will keep as much xlog as configured by the administrator. But how can the administrator know this? Usually, there is no way to know.

A replication slot can help because it keeps xlog around as long as it is needed. Here is how it works:

postgres=# SELECT * FROM   pg_create_physical_replication_slot('rep_slot'),
  slot_name  | xlog_position
-------------+---------------
 rep_slot    |

It is easy to check for existing replication slots:

postgres=# SELECT slot_name, slot_type FROM pg_replication_slots;
  slot_name  | slot_type 
-------------+-----------
 rep_slot    | physical  
(1 row)

In the replica, the name of the replication slot has to be added to recovery.conf:

primary_slot_name = 'rep_slot'

Note

Keep in mind that if the slave goes down, the master might fill up. Therefore, it is essential to monitor this kind of setup.

Fixing failed queries

Another important issue is related to failed queries. It can happen that queries are cancelled on the slave due to replication conflicts. If you are working on a system that constantly kicks queries, it can be very helpful to check out hot_standby_feedback.

The idea is to let the slave report its oldest transaction ID from time to time. The master can then react and ensure that conflicts are prevented by delaying VACUUM on itself. In reality, setting hot_standby_feedback on the slave leaves us with the same behavior as if the query had actually run on the master (not performance-wise, of course).

If the people working on those slave systems can be trusted, it usually makes sense to turn hot_standby_feedback on. If users on the slave are not considered to be competent enough, it is safer to stick to the default value—off.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.239.44