Tuning checkpoints and the XLOG

Up to now, this chapter has hopefully provided some insight into how PostgreSQL writes data and what the XLOG is used for in general. Given this knowledge, we can now move on and learn what we can do to make our databases work even more efficiently, both, in case of replication and in case of running just a single server.

Understanding the checkpoints

In this chapter, we have seen that data has to be written to the XLOG before it may go anywhere. The thing is, if the XLOG was never deleted, clearly, we would not write to it forever without filling up the disk at some point in time.

To solve the problem, the XLOG has to be deleted at some point. This process is called checkpointing.

The main question arising from this issue is: When can the XLOG be truncated up to a certain point? The answer is: When PostgreSQL has put everything that is already in the XLOG, into the storage files. If all the changes made to the XLOG are also made to the data files, the XLOG can be truncated.

Tip

Keep in mind that simply writing the data is worthless, we also have to flush the data to the data tables.

In a way, the XLOG can be seen as the repairman for the data files in case something happens. If everything is fully repaired, the repair instructions can be removed safely; this is exactly what happens during a checkpoint.

Configuring checkpoints

Checkpoints are highly important for consistency but they are also highly relevant to performance. If checkpoints are configured poorly, you might face serious performance degradations.

When it comes to configuring checkpoints, the following parameters are relevant. Note, all those parameters can be changed in postgresql.conf:

checkpoint_segments = 3
checkpoint_timeout = 5min
checkpoint_completion_target = 0.5 
checkpoint_warning = 30s

In the following sections, we will take a look at each of these variables:

About segments and timeouts

checkpoint_segments and checkpoint_timeout will define the distance between two checkpoints. A checkpoint happens either when we run out of segments or when the time is over.

Remember, a segment is usually 16 MB, so three segments means that we will do a checkpoint every 48 MB. On modern hardware, 16 MB is not enough by far. On a typical production system, a checkpoint interval of 256 or even higher is perfectly feasible.

However, when setting checkpoint_segments, one thing has to be kept in the back of your mind: In case of a crash, PostgreSQL has to replay all the changes since the last checkpoint. If the distance between two checkpoints is unusually large, you might notice that your failed database instance takes too long to start up again. This should be avoided for the sake of availability.

Tip

There will always be a trade-off between performance and recovery times after a crash; you have to balance your configuration accordingly.

checkpoint_timeout is also highly relevant. It is the upper limit of the time allowed between two checkpoints. There is no point in increasing checkpoint_segments infinitely while leaving the time as it is. On large systems, increasing checkpoint_timeout has proven to make sense for many people.

Note

In PostgreSQL, you will figure out that there is a constant number of transaction log files around. Unlike in other database systems the number of XLOG files has nothing to do with the maximum size of a transaction; a transaction can easily be much larger than the distance between two checkpoints.

To write or not to write?

We have learned in this chapter that at COMMIT time, we cannot be sure whether the data is already in the data files or not.

So, if the data files don't have to be consistent anyway, why not vary the point in time the data is written? This is exactly what we can do with checkpoint_completion target. The idea is to have a setting that specifies the target of checkpoint completion, as a fraction of total time between two checkpoints.

Let us now discuss three scenarios to illustrate the purpose of the checkpoint_completion_target:

Scenario 1 – Storing stock-market data

In this scenario, we want to store the most recent stock quotes of all stocks in the Dow Jones Industrial Average (DJIA). We don't want to store the history of all stock prices but just the most recent, current price.

Given the type of data we are dealing with, we can assume we will have a workload that is dictated by UPDATE statements.

What will happen? PostgreSQL has to update the very same data over and over again. Given the fact that the DJIA consists of only 30 different stocks, the amount of data is very limited and our table will be really small. In addition to that, the price might be updated every second or even more often.

Internally, the situation is like this: When the first UPDATE comes along, PostgreSQL will grab a block, put it into memory and modify it. Every subsequent UPDATE will most likely change the very same block. Logically, all writes have to go to the transaction log but what happens with the cached blocks in shared buffer?

The general rule is: If there are many UPDATEs (respectively changes made to the same block), it is wise to keep blocks in memory as long as possible; this will greatly increase the odds of avoiding I/O by writing multiple changes in one go.

Tip

If you want to increase the odds of having many changes in one disk I/O, consider decreasing checkpoint_complection_target. Blocks will stay in memory longer and therefore many changes might go into the same blocks before a write happens.

In the scenario just outlined, a checkpoint_completion_target of 0.05 (or 5 percent) might be reasonable.

Scenario 2 – Bulk loading

In our second scenario, we will load 1 TB of data into an empty table. If you are loading so much data at a time, what are the odds of hitting a block you have hit 10 minutes ago again? The odds are basically zero. There is no point in buffering writes in this case because we would simply miss the disk capacity lost by idling and waiting for I/O to happen.

During a bulk load, we want to use all the I/O capacity we have all the time. To make sure PostgreSQL writes data out instantly, we have to increase the checkpoint_completion_target to a value close to 1.

Scenario 3 – I/O spikes and throughput considerations

Sharp spikes can kill you; at least they can do serious harm which should be avoided. What is true in the real world around you is always true in the database world.

In this scenario, we want to assume an application storing so called Call Detail Records (CDRs) for a phone company. You can imagine that a lot of writing will happen and that people are placing phone calls all day long. Of course, there will be people placing a phone call that is instantly followed by the next call but we will also witness a great number of people placing just one call a week or so.

Technically this means that there is a good chance that a block in shared memory, which has recently been changed, will face a second or a third change soon, but, we will also have a great deal of changes made to blocks that will not be visited ever again.

How shall we handle this? Well, it is a good idea to write out data late so that as many changes as possible will go to pages that have been modified before. But, what will happen during a checkpoint? If changes (in this case, dirty pages) have been held back for too long, the checkpoint itself will be intense and many blocks must be written within a fairly short period of time. This can lead to a so called I/O spike. During an I/O spike, you will see that your I/O system is busy. It might show poor response times and those poor response times can be felt by your end user.

This adds a new dimension to the problem: predictable response times.

Let us put it this way: Let us assume you have used internet banking successfully for quite a while. You are happy. Now, some guy at your bank has found a tweak which makes the database behind the internet banking 50 percent faster, but, this gain comes with a downside: For two hours a day, the system will not be reachable. Clearly, from a performance point of view the throughput will be better:

24 hours * 1 X < 22 hours * 1.5 X

But, are you the customer going to be happy? Clearly, you would not. This is a typical use case where optimizing for maximum throughput does no good. If you can meet your performance requirements, it might be wiser to have evenly distributed response times at the cost of a small performance penalty. In our banking example, this would mean that your system is up 24x7 instead of just 22 hours a day.

Tip

Would you pay your mortgage more frequently if your internet banking was 10 times faster? Clearly, you would not. Sometimes, it is not about optimizing for many transactions per second but to optimize in a way that you can handle a pre-defined amount of load in the most reasonable way.

The same concept applies to the phone application we have outlined. We are not able to write all changes during the checkpoint anymore because this might cause latency issues during a checkpoint. It is also no good to make a change to the data files more or less instantly (meaning: a high checkpoint_completion_target) because we would write too much, too often.

This is a classical example where you have got to compromise. A checkpoint_completion_target of 0.5 might be the best idea in this case.

Conclusion

The conclusion, which should be drawn from these three examples, is that no configuration fits all purposes. You really have to think about the type of data you are dealing with in order to come up with a good and feasible configuration. For many applications, a value of 0.5 has proven to be just fine.

Tweaking WAL buffers

In this chapter, we have already adjusted some major parameters such as shared_buffers, fsync, and so on. There is one more parameter, however, which can have a dramatic impact on performance. The wal_buffers parameter has been designed to tell PostgreSQL how much memory to keep around to remember XLOG, which has not been written to the disk so far. So, if somebody pumps in a large transaction, PostgreSQL will not write any mini change to the table to the XLOG before COMMIT. Remember, if a non-committed transaction is lost during a crash, we won't care about it anyway because COMMIT is the only thing which really counts in every day life. It makes perfect sense to write XLOG in larger chunks before a COMMIT happens. This is exactly what wal_buffers does: Unless changed manually in postgresql.conf, it is an auto-tuned parameter (represented by -1) which makes PostgreSQL take 3 percent of shared_buffers but no more than 16 MB to keep XLOG around before writing it to the disk.

Tip

In older versions of PostgreSQL, this parameter was at 64 KB. This was unreasonably low for modern machines. If you are running an old version, consider increasing wal_buffers to 16 MB. This is usually a good value for reasonably sized database instances.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.31.67