Understanding the levels of IT service

Availability is directly tied with service level and is normally defined as a percentage. It is the percentage of uptime over a defined period. The availability that you can guarantee is your service level. The following table shows what exactly this means by considering the maximum admitted downtime for a few of the frequently used availability percentages:

Availability percentage

Max downtime per year

Max downtime per month

Max downtime per week

90% called one nine

36.5 days

72 hours

16.8 hours

95%

18.25 days

36 hours

8.4 hours

99% called two nines

3.65 days

7.20 hours

1.68 hours

99.5%

1.83 days

3.60 hours

50.4 minutes

99.9% called three nines

8.76 hours

43.8 minutes

10.1 minutes

99.95%

4.38 hours

21.56 minutes

5.04 minutes

99.99% called four nines

52.56 minutes

4.32 minutes

1.01 minutes

99.999% called five nines

5.26 minutes

25.9 seconds

6.05 seconds

99.9999% called six nines

31.5 seconds

2.59 seconds

0.605 seconds

99.99999% called seven nines

3.15 seconds

0.259 seconds

0.0605 seconds

Note

Uptime is not a synonym of availability. A system can be up and running but not available; for instance, if you have a network fault, the service will not be available, but all the systems will be up and running.

The availability must be calculated end to end, and all the components required to run the service must be available. The next sentence may seem a paradox; the more hardware you add and the more failure points you need to consider, the greater the difficulty in implementing an efficient solution. Also, an important point to consider is how easy the patching of your HA system and its maintenance will be. A truly highly available system implies that human intervention is not needed; for example, if you need to agree to a five nines service level, the human (your system administrator) will have only one second of downtime per day, so here the system must respond to the issue automatically. Instead, if you agree to a two nines service level agreement (SLA), the downtime per day can be of 15 minutes; here, the human intervention is realistic, but unfortunately this SLA is not a common case. Now, while agreeing to an SLA, the mean time to recovery is an important factor to consider.

Note

Mean Time To Recovery (MTTR) is the mean time that a device will take to recover from a failure.

The first thing to do is to keep the architecture as simple as possible and reduce the number of actors in play to a minimum. The simpler the architecture, the less the effort required to maintain, administer, and monitor it. All that the HA architecture needs is to avoid a single point of failure, and it needs to be as simple as possible. For this reason, the solution presented here is easy to understand, tested in production environments, and quite easy to implement and maintain.

Note

Complexity is the first enemy of high availability.

Unfortunately, a highly available infrastructure is not designed to achieve the highest performance possible. This is because it is normal for an overhead to be introduced to keep two servers updated, and a highly available infrastructure is not designed for maximum throughput. Also, there are implementations that consider using the standby server as a read-only server to reduce the load on a primary node, using then an unused/inactive server.

Note

A highly available infrastructure is not designed to achieve maximum performance or throughput.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.203.149