SLA

Although the term, SLA has already been mentioned, the first question is what is an SLA? Although there are several answers (any infrastructure architect among you can confirm that), I would like to confine myself to the following simple answer:

An SLA is a contract between you and your cloud service provider that guarantees you a certain degree of availability for services booked. That sounds very good, but in real life that just means that you get your money back if the guarantee is not respected.

Nevertheless, we should now pursue the question: how do I determine a degree of availability? For the degree of availability, we first need the desired uptime. This can vary according to your needs. For example, if you operate a webshop, the targeted uptime is 24 hours x 7 days or 168 hours per week or 8,760 hours per year. But if you only have one application for the daily work in your company, the desired uptime can be limited to a core time, for example, 100 hours per week or 5,200 hours per year.

Let's now proceed to the calculation—take the desired uptime value, subtract a downtime value, and divide the result again by the uptime value used. Then multiply the value by 100 and you have a value for the degree of availability.

Here is an example calculation:
8,760 hours uptime – 1 hour downtime/8,760 hours x 100 = 99.99. The degree of availability is 99.99 %.

As I wrote earlier, the SLA includes a guaranteed availability degree. This is usually between 99.95 % and 99.99 % for Microsoft Azure. For us, this means a possible downtime between 53 minutes and 43 hours, 20 minutes.

A service with an availability degree of at least 99% is referred to as high availability.

Also, it is important to know that Microsoft grants an SLA only if you have at least two instances of your service.

You can find a list of available SLAs here: https://azure.microsoft.com/en-us/support/legal/sla/.

Some extra information about SLAs, in addition to the availability level, the following key metrics are often specified:

Mean time to recovery (MTTR): This means the duration of recovery after a failure
Mean time between failures (MTBF): This is the average operating time between two occurring errors without repair time
Mean time to failure (MTTF): This metric is used when components are not repaired but replaced

Now I have taught you enough theory. Let's now dedicate ourselves to practical work with an SLA.

I have two concepts for this. The first offer is the so-called SLA management pattern and, at least for the Azure platform, is of a rather theoretical nature, since there is no example of a practical implementation.

What is it about? Quite simply, every action that any consumer takes against a cloud service is registered by an SLA Monitor Agent and recorded as a log entry in a Quality of Service (QoS) repository (a data store). The repository then transmits the data to an SLA management service and the SLA management service acts as a reporting tool for Cloud Admin, as shown here:

What is the issue with this solution? There are no ready-made offers available on the Azure platform. So, you have to develop them yourself or buy third-party components.

Is this solution too complex for you? Then I would like to introduce you to a more practical way. The procedure is known by software architects as the Health Endpoint Monitoring pattern.

What is it about? To guarantee the availability of your application, it is advisable to carry out an access test at regular intervals. This should be done through shared endpoints (but you should create an additional endpoint for the tests and not use the default endpoint), or talk to a test page in the case of a web application.

Let's take a closer look:

As shown in the preceding diagram, the test tool sends a request to the application at regular intervals. If the application is available, it responds with an HTTP 200 code (ok). If the application is not available, you get an HTTP 404 code (not found) instead.

In the same way, you can also check the availability of downstream services (for example, Azure SQL Database or Azure Storage).

An example of the Health Endpoint Monitoring pattern in code form can be found here: https://github.com/mspnp/cloud-design-patterns/tree/master/health-endpoint-monitoring.

What's the issue with this solution? Depending on the complexity of your Azure solutions, you need to extend the example with your own developments (for example, a TCP check for TCP/IP endpoints).

Table of Contents for SLA

Create new playlist

Sign In

Sign Up

Table of Contents for
SLA