Determining acceptable losses

We know that the PostgreSQL database will be offline at some point in the future. Maybe we need an upgrade to remove a critical security vulnerability or address a potential data corruption issue. Perhaps a RAM module is producing errors and needs immediate replacement. Maybe the primary data center was struck by lightning.

No matter the reason, we need to make decisions quickly. A helpful way is to ensure that the decision-making process is basing the answers on what the user expects for various levels of liability and on the context of the user. The QA department will not require the same response level as 10,000 shoppers who can't make a holiday purchase during a critical sale.

System outage and response escalation expectations are generally codified in a Service Level Agreement (SLA). How long should the maintenance last? How often should planned outages occur? When should users be informed and to what extent? Who is included in the set of potential database users? All of these things, and more, should be defined before a production system is released. Otherwise, we risk alienating clients with unexpected and arbitrary downtime or outages that persist for hours.

Clients who have their trust broken may leave and never return. So, let's teach them when to expect short amounts of unavailability and set their minds at ease with prompt contact and status management.

Getting ready

Much of our work depends on knowing how much downtime the business is willing to tolerate and who uses the database and when. We also need to know how long the application can obscure a PostgreSQL outage through caches, queues, and connection management. Try to get a complete picture of the database's role before continuing.

How to do it...

Try to answer all of these questions:

  • Who uses the database? For each type of user, answer these questions:
    • When does this user access the database?
    • What is the maximum query timeout they will tolerate?
    • Will the user lose money during an outage?
    • Is the user likely to return later?
    • Should this user be included in maintenance notifications?
    • Should this user be included in emergency notifications?
  • Can we get the user to agree to or even sign the SLA?
  • What uptime percentage is expected? 99 percent? 99.9 percent? 99.99 percent? More?
  • What are the company's official business hours?
  • When should notifications be sent?
  • How long can the platform operate without the database?
  • How long should regular maintenance windows be?
  • How often can maintenance occur?
  • Which weekdays can we consider for maintenance?
  • What is an emergency?
  • What situations require the activation of disaster recovery nodes?
  • Can we get a lawyer to write all of these into a contract?

How it works...

That is a lot of questions, and the list probably isn't even complete. It is, however, a very good start. Notice how we want to know who (or what) is using the database on a regular basis. This is not the same as a user who connects to the database. In this context, we want to know the type of user. Is it the business, another department, a critical application component, or even just a regular website user? Each of these will have different expectations, reactions, usage times, and impact.

The next question we need to answer is how uptime is defined. One frequently quoted value is the number of nines, referring to a percentage approaching 100 percent. Three nines for example, would be 99.9 percent of a year, which is almost nine hours. Four nines is only about 50 minutes. Keep in mind that the SLA can be written to include or exclude planned maintenance, depending on the audience. Unplanned outages definitely count, and remember that this is the total cumulative time for the entire year.

The next important aspect is the latest time a business is officially available. Maintenance should begin after this time and no sooner. Critical PostgreSQL nodes should not be taken offline if more than 5 percent of active users are utilizing the platform and database. It is not uncommon for regular maintenance windows to appear very late at night. Disaster recovery systems, standby nodes, and stage or development copies are all excellent candidates for updates following official business hours. We still want these systems available for developers and QA staff or in case of an unexpected production-level outage, so it pays to be a little more cautious.

The rest are a mix of important questions that need answers, the last of which implies the involvement of a lawyer. If possible, have the SLA in a contract form for all applicable clients and users. A signed agreement acts as a barrier to litigation and liability and sets very definite boundaries to user expectations early in the process.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.134.114