Which metrics to capture

One question that comes up frequently when talking about metrics is about what metrics to emit and monitor. There are many possible metrics and even more opinions on this subject. As a good starting point, the following metrics are often gathered for web applications:

Requests per minute, transactions per minute, or something similar: This is a metric that is intended to capture the current load or throughput on a web application.
The average response time: This metric captures the response time for all requests within a time window.
The error rate: This metric captures the percentage of all requests that result in an error. As a guideline for an error, often, all HTTP response codes of 400 and up are taken.

When these three metrics are captured and graphed together in one graph, they provide the first step toward understanding the behavior of an application. Let's explore a few examples:

When the average response time goes up but the throughput (requests per minute) stays the same, this might be an indication that the infrastructure that hosts the application is having issues.
When both the throughput and the average response times go up, this might be an indication that traffic is increasing and that the current infrastructure is not capable of sustaining that throughput at the same response times.
When the error rate goes up but the other metrics stay the same, this might be an indication that a deployment has gone wrong or that a specific code path is starting to generate (more) errors.

Of course, these are just examples and there are many more possible scenarios. Other metrics can help to rule out a specific scenario or try to avoid them. For example, also starting to monitor the database load as a percentage can help detect a specific instance of the second scenario. If the database load gets close to 100%, it might be time to scale the database up to a higher performance tier to help to sustain the higher throughput at the same response times as before.

To conclude this section, there is one final recommendation—when starting with monitoring, there is often a tendency to focus on the systems that host the application. As an alternative, also consider monitoring metrics that have a direct business impact or metrics that are an indication of user satisfaction in terms of the usability of an application. This comes much closer to measuring business value than when only you watch systems.

Some examples of this are as follows:

In an online shop, the number of books sold per minute can be a very valuable business metric. Just imagine the impact it can have on a business if this metric is available in near real time using Azure Monitor and custom metrics from the application code.
For an online reading platform, the number of virtual page turns can be a valuable metric that signals whether users are happily working with the service. As soon as this number sees a sharp drop or rapidly increases, this might be an indication that something is going wrong.

To find out which metrics make sense in a given scenario, it might help to talk to the business or subject matter experts.

Table of Contents for Which metrics to capture

Create new playlist

Sign In

Sign Up

Table of Contents for
Which metrics to capture