Chapter 9. Handling Hardware and Software Disasters

In this chapter, you will be introduced to issues related to hardware and software failure. Even expensive hardware is far from perfect, and it may fail from time to time. Troubles could occur in the form of failing memory, broken hard drives, or damaged filesystems, kernel-related issues and so on. In this chapter, some of the most common troubles will be covered.

Here are some of the important topics we will cover:

  • Checksums—Preventing silent corruption
  • Zero out damaged pages
  • Dealing with index corruption
  • Dumping single pages
  • Resetting the transaction log
  • Power out related issues

Checksums – preventing silent corruption

When reading books about high availability and fault tolerance, in general, I always get the impression that most systems work on the assumption that crashes are atomic. What do I mean by that? Let's imagine there are two servers; one server has crashed and the other takes over. Unfortunately, crashes are not like this in many cases. In the real world, crashes are often far from atomic. Trouble is likely to build up gradually until things go really wrong. Consider memory corruption; when the memory goes wrong, it might not lead to an instant crash, and even when it does, the system might restart again without problems before troubles return. In many cases, a problem that accumulates silently is way more dangerous than a simple crash. The main danger in the case of silent corruption is that problems could gradually spread in the system without anybody noticing.

To prevent problems related to silent corruption caused by memory consumption, PostgreSQL provides block-level checksums. By default, data is written as it is—one block at a time. However, with block checksums enabled, PostgreSQL will check each 8k block on I/O to see whether it is still fine or not.

To enable data checksums, it is possible to run initdb using the -k option. This will set the configuration variable for the entire instance.

To check whether a database instance has data checksums enabled, you can run the following command (assuming that the database instance is living in /data):

$ pg_controldata /data | grep -i checks

The command will return 0 or 1, depending on whether the setting has been turned on or off.

At this point, there is no possibility to turn checksums on after initdb. There is no way to change the configuration in production.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.123.147