In this chapter, we will learn how to use Prometheus and Grafana to collect, monitor, and alert about performance metrics. As we mentioned in Chapter 1, Introduction to Microservices, in a production environment it is crucial to be able to collect metrics for application performance and hardware resource usage. Monitoring these metrics is required to avoid long response times or outages for API requests and other processes.
To be able to monitor a system landscape of microservices in a cost-efficient and proactive way, we must also be able to define alarms that are triggered automatically if the metrics exceed the configured limits.
In this chapter, we will cover the following topics:
For instructions on how to install the tools used in this book and how to access the source code for this book, see:
The code examples in this chapter all come from the source code in $BOOK_HOME/Chapter19
.
If you want to view the changes applied to the source code in this chapter so that you can use Prometheus and Grafana to monitor and alert on performance metrics, you can compare it with the source code for Chapter 19, Centralized Logging with the EFK Stack. You can use your favorite diff tool and compare the two folders, $BOOK_HOME/Chapter19
and $BOOK_HOME/Chapter20
.
In this chapter, we will reuse the deployment of Prometheus and Grafana that we created in Chapter 18, Using a Service Mesh to Improve Observability and Management, in the Deploying Istio in a Kubernetes cluster section. Also in that chapter, we were briefly introduced to Prometheus, a popular open source database for collecting and storing time series data such as performance metrics. We learned about Grafana, an open source tool for visualizing performance metrics. With the Grafana deployment comes a set of Istio-specific dashboards. Kiali can also render some performance-related graphs without the use of Grafana. In this chapter, we will get some hands-on experience with these tools.
The Istio configuration we deployed in Chapter 18 includes a configuration of Prometheus, which automatically collects metrics from Pods in Kubernetes. All we need to do is set up an endpoint in our microservice that produces metrics in a format Prometheus can consume. We also need to add annotations to the Kubernetes Pods so that Prometheus can find the address of these endpoints. See the Changes in source code for collecting application metrics section of this chapter for details on how to set this up. To demonstrate Grafana's capabilities to raise alerts, we will also deploy a local mail server.
The following diagram illustrates the relationship between the runtime components we just discussed:
Figure 20.1: Adding Prometheus and Grafana to the system landscape
Here, we can see how Prometheus uses the annotations in the definitions of the Kubernetes Pods to be able to collect metrics from our microservices. It then stores these metrics in its database. A user can access the web UIs of Kiali and Grafana to monitor these metrics in a Web browser. The Web browser uses the minikube tunnel that was introduced in Chapter 18, in the Setting up access to Istio services section, to access Kiali, Grafana, and also a web page from the mail server to see alerts sent out by Grafana.
Please remember that the configuration that was used for deploying Istio from Chapter 18 is only intended for development and test, not production. For example, performance metrics stored in the Prometheus database will not survive the Prometheus Pod being restarted!
In the next section, we will look at what changes have been applied to the source code to make the microservices produce performance metrics that Prometheus can collect.
Spring Boot 2 supports producing performance metrics in a Prometheus format using the Micrometer library (https://micrometer.io). There's only one change we need to make to the source code of the microservices: we need to add a dependency on the Micrometer library, micrometer-registry-prometheus
, in the Gradle build files, build.gradle
. The dependency looks like this:
implementation 'io.micrometer:micrometer-registry-prometheus'
This will make the microservices produce Prometheus metrics on port 4004
using the /actuator/Prometheus
URI.
In Chapter 18, we separated the management port, used by the actuator, from the port serving requests to APIs exposed by a microservice. See the Observing the service mesh section for a recap, if required.
To let Prometheus know about these endpoints, each microservice's Pod is annotated with the following code:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "4004"
prometheus.io/scheme: http
prometheus.io/path: "/actuator/prometheus"
This is added to the values.yaml
file of each component's Helm chart. See kubernetes/helm/components
.
To make it easier to identify the source of the metrics once they have been collected by Prometheus, they are tagged with the name of the microservice that produced the metric. This is achieved by adding the following configuration to the common configuration file, config-repo/application.yml
:
management.metrics.tags.application: ${spring.application.name}
This will result in each metric that's produced having an extra label named application
. It will contain the value of the standard Spring property for the name of a microservice, spring.application.name
.
These are all the changes that are required to prepare the microservices to produce performance metrics and to make Prometheus aware of what endpoints to use to start collecting them. In the next section, we will build and deploy the microservices.
Building, deploying, and verifying the deployment using the test-em-all.bash
test script is done in the same way it was done in Chapter 19, Centralized Logging with the EFK Stack, in the Building and deploying the microservices section. Run the following commands:
cd $BOOK_HOME/Chapter20
eval $(minikube docker-env)
./gradlew build && docker-compose build
hands-on
, and set it as the default Namespace:
kubectl delete namespace hands-on
kubectl apply -f kubernetes/hands-on-namespace.yml
kubectl config set-context $(kubectl config current-context) --namespace=hands-on
First, we update the dependencies in the components
folder:
for f in kubernetes/helm/components/*; do helm dep up $f; done
Next, we update the dependencies in the environments
folder:
for f in kubernetes/helm/environments/*; do helm dep up $f; done
helm install hands-on-dev-env
kubernetes/helm/environments/dev-env
-n hands-on --wait
minikube tunnel
Remember that this command requires that your user has sudo privileges and that you enter your password during startup and shutdown. It takes a couple of seconds before the command asks for the password, so it is easy to miss!
./test-em-all.bash
Expect the output to be similar to what we've seen in the previous chapters:
Figure 20.2: All tests OK
With the microservices deployed, we can move on and start monitoring our microservices using Grafana!
As we already mentioned in the introduction, Kiali provides some very useful dashboards out of the box. In general, they are focused on application-level performance metrics such as requests per second, response times, and fault percentages for processing requests. As we will see shortly, they are very useful on an application level. But if we want to understand the usage of the underlying hardware resources, we need more detailed metrics, for example, Java VM-related metrics.
Grafana has an active community that, among other things, shares reusable dashboards. We will try out a dashboard from the community that's tailored for getting a lot of valuable Java VM-related metrics from a Spring Boot 2 application such as our microservices. Finally, we will see how we can build our own dashboards in Grafana. But let's start by exploring the dashboards that come out of the box in Kiali and Grafana.
Before we do that, we need to make two preparations:
In this section, we will set up a local test mail server and configure Grafana to send alert emails to the mail server.
Grafana can send emails to any SMTP mail server but, to keep the tests local, we will deploy a test mail server named maildev
. Go through the following steps:
kubectl -n istio-system create deployment mail-server --image maildev/maildev:1.1.0
kubectl -n istio-system expose deployment mail-server --port=80,25 --type=ClusterIP
kubectl -n istio-system wait --timeout=60s --for=condition=ready pod -l app=mail-server
Gateway
, VirtualService
, and DestinationRule
manifest files has been added for the mail server in Istio's Helm chart. See the template kubernetes/helm/environments/istio-system/templates/expose-mail.yml
. Run a helm upgrade
command to apply the new manifest files:
helm upgrade istio-hands-on-addons kubernetes/helm/environments/istio-system -n istio-system
Figure 20.3: Mail server web page
kubectl -n istio-system set env deployment/grafana
GF_SMTP_ENABLED=true
GF_SMTP_SKIP_VERIFY=true
GF_SMTP_HOST=mail-server:25
[email protected]
kubectl -n istio-system wait --timeout=60s --for=condition=ready pod -l app=Grafana
The ENABLE
variable is used to allow Grafana to send emails. The SKIP_VERIFY
variable is used to tell Grafana to skip SSL checks with the test mail server. The HOST
variable points to our mail server and the FROM_ADDRESS
variable specifies what "from" address to use in the mail.
For more information on the mail server, see https://hub.docker.com/r/maildev/maildev.
Now, we have a test mail server up and running and Grafana has been configured to send emails to it. In the next section, we will start the load test tool.
To have something to monitor, let's start up the load test using Siege, which we used in previous chapters. Run the following commands to get an access token and then start up the load test, using the access token for authorization:
ACCESS_TOKEN=$(curl -k https://writer:[email protected]/oauth2/token -d grant_type=client_credentials -s | jq .access_token -r)
echo ACCESS_TOKEN=$ACCESS_TOKEN
siege https://minikube.me/product-composite/1 -H "Authorization: Bearer $ACCESS_TOKEN" -c1 -d1 -v
Remember that an access token is only valid for 1 hour – after that, you need to get a new one.
Now, we are ready to learn about the dashboards in Kiali and Grafana and explore the Grafana dashboards that come with Istio.
In Chapter 18, we learned about Kiali, but we skipped the part where Kiali shows performance metrics. Now, it's time to get back to that subject!
Execute the following steps to learn about Kiali's built-in dashboards:
admin
/admin
if required.Figure 20.4: Kiali outbound metrics
Kiali will visualize some overall performance graphs that are of great value, and there are more graphs to explore. Feel free to try them out on your own!
Figure 20.5: Grafana showing Istio Mesh Dashboard
This dashboard gives a very good overview of metrics for the microservices involved in the service mesh, like request rates, response times, and the success rates.
The web page should look like the following screenshot:
Figure 20.6: Grafana with a lot of metrics for a microservice
Expand the two remaining rows to see more detailed metrics regarding the selected service. Feel free to look around!
As we've already mentioned, the Istio dashboards give a very good overview at an application level. But there is also a need for monitoring the metrics for hardware usage per microservice. In the next section, we will learn about how existing dashboards can be imported – specifically, a dashboard showing Java VM metrics for a Spring Boot 2-based application.
As we've already mentioned, Grafana has an active community that shares reusable dashboards. They can be explored at https://grafana.com/grafana/dashboards. We will try out a dashboard called JVM (Micrometer) - Kubernetes - Prometheus by Istio that's tailored for getting a lot of valuable JVM-related metrics from Spring Boot 2 applications in a Kubernetes environment. The link to the dashboard is https://grafana.com/grafana/dashboards/11955. Perform the following steps to import this dashboard:
11955
into the Import via grafana.com field and click on the Load button next to it.Figure 20.7: Grafana showing Java VM metrics
In this dashboard, we can find all types of Java VM relevant metrics for, among other things, CPU, memory, heap, and I/O usage, as well as HTTP-related metrics such as requests/second, average duration, and error rates. Feel free to explore these metrics on your own!
Being able to import existing dashboards is of great value when we want to get started quickly. However, what's even more important is to know how to create our own dashboard. We will learn about this in the next section.
Getting started with developing Grafana dashboards is straightforward. The important thing for us to understand is what metrics Prometheus makes available for us.
In this section, we will learn how to examine the available metrics. Based on these, we will create a dashboard that can be used to monitor some of the more interesting metrics.
In the Changes in source code for collecting application metrics section earlier, we configured Prometheus to collect metrics from our microservices. We can make a call to the same endpoint and see what metrics Prometheus collects. Run the following command:
curl https://health.minikube.me/actuator/prometheus -ks
Expect a lot of output from the command, as in the following example:
Figure 20.8: Prometheus metrics
Among all of the metrics that are reported, there are two very interesting ones:
resilience4j_circuitbreaker_state
: Resilience4j reports on the state of the circuit breaker.resilience4j_retry_calls
: Resilience4j reports on how the retry mechanism operates. It reports four different values for successful and failed requests, combined with and without retries.Note that the metrics have a label named application
, which contains the name of the microservice. This field comes from the configuration of the management.metrics.tags.application
property, which we did in the Changes in source code for collecting application metrics section.
These metrics seem interesting to monitor. None of the dashboards we have used so far use metrics from Resilience4j. In the next section, we will create a dashboard for these metrics.
In this section, we will learn how to create a dashboard that visualizes the Resilience4j metrics we described in the previous section.
We will set up the dashboard in the following stages:
Perform the following steps to create an empty dashboard:
Figure 20.9: Creating a new dashboard in Grafana
Hands-on Dashboard
.Perform the following steps to create a new panel for the circuit breaker metric:
A page will be displayed where the new panel can be configured.
Circuit Breaker
.resilience4j_circuitbreaker_state{state="closed"}
.{{state}}
. This will create a legend in the panel where the involved microservices will be labeled with their name and Namespace.The filled-in values should look as follows:
Figure 20.10: Specifying circuit breaker metrics in Grafana
resilience4j_circuitbreaker_state{state="open"}
and the Legend field to {{state}}
.resilience4j_circuitbreaker_state{state="half_open"}
and the Legend field to {{state}}
.Here, we will repeat the same procedure that we went through for adding a panel for the preceding circuit breaker metric, but instead, we will specify the values for the retry metrics:
Retry
as the Panel title.rate(resilience4j_retry_calls_total[30s])
.Since the retry metric is a counter, its value will only go up. An ever-increasing metric is rather uninteresting to monitor. The rate function is used to convert the retry metric into a rate per second metric. The time window specified, that is, 30 s
, is used by the rate function to calculate the average values of the rate.
{{kind}}
.Just like the output for the preceding Prometheus endpoint, we will get four metrics for the retry mechanism. To separate them in the legend, the kind
attribute needs to be added.
Perform the following steps to arrange the panels on the dashboard:
The following is an example layout of the two panels:
Figure 20.11: Moving and resizing a panel in Grafana
Since this screenshot was taken with Siege running in the background, the Retry panel reports successful_without_retry metrics while the Circuit Breaker reports that closed=1, while open and half_open=0, meaning that it is closed and operating normally (something that is about to change in the next section).
With the dashboard created, we are ready to try it out. In the next section, we will try out both metrics.
Before we start testing the new dashboard, we must stop the load test tool, Siege. For this, go to the command window where Siege is running and press Ctrl + C to stop it.
Let's start by testing how to monitor the circuit breaker. Afterward, we will try out the retry metrics.
If we force the circuit breaker to open up, its state will change from closed to open, and then eventually to the half-open state. This should be reported in the circuit breaker panel.
Open the circuit, just like we did in Chapter 13, Improving Resilience Using Resilience4j, in the Trying out the circuit breaker and retry mechanism section; that is, make some requests to the API in a row, all of which will fail. Run the following commands:
ACCESS_TOKEN=$(curl -k https://writer:[email protected]/oauth2/token -d grant_type=client_credentials -s | jq .access_token -r)
echo ACCESS_TOKEN=$ACCESS_TOKEN
for ((n=0; n<4; n++)); do curl -o /dev/null -skL -w "%{http_code}
" https://minikube.me/product-composite/1?delay=3 -H "Authorization: Bearer $ACCESS_TOKEN" -s; done
We can expect three 500
responses and a final 200
, indicating three errors in a row, which is what it takes to open the circuit breaker. The last 200
indicates a fail-fast response from the product-composite
microservice when it detects that the circuit is open.
On some rare occasions, I have noticed that the circuit breaker metrics are not reported in Grafana directly after the dashboard is created. If they don't show up after a minute, simply rerun the preceding command to reopen the circuit breaker again.
Expect the value for the closed state to drop to 0 and the open state to take the value 1, meaning that the circuit is now open. After 10s, the circuit will turn to the half-open state, indicated by the half-open metrics having the value 1 and open being set to 0. This means that the circuit breaker is ready to test some requests to see if the problem that opened the circuit is gone.
Close the circuit breaker again by issuing three successful requests to the API with the following command:
for ((n=0; n<4; n++)); do curl -o /dev/null -skL -w "%{http_code}
" https://minikube.me/product-composite/1?delay=0 -H "Authorization: Bearer $ACCESS_TOKEN" -s; done
We will get only 200
responses. Note that the circuit breaker metric goes back to normal again, meaning that the closed metric goes back to 1.
After this test, the Grafana dashboard should look as follows:
Figure 20.12: Retries and Circuit Breaker in action as viewed in Grafana
From the preceding screenshot, we can see that the retry mechanism also reports metrics that succeeded and failed. When the circuit was opened, all requests failed without retries. When the circuit was closed, all requests were successful without any retries. This is as expected.
Now that we have seen the circuit breaker metrics in action, let's see the retry metrics in action!
If you want to check the state of the circuit breaker, you can do it with the following command:
curl -ks https://health.minikube.me/actuator/health | jq -r .components.circuitBreakers.details.product.details.state
It should report CLOSED
, OPEN
, or HALF_OPEN
, depending on its state.
To trigger the retry mechanism, we will use the faultPercentage
parameter we used in previous chapters. To avoid triggering the circuit breaker, we need to use relatively low values for the parameter. Run the following command:
while true; do curl -o /dev/null -s -L -w "%{http_code}
" -H "Authorization: Bearer $ACCESS_TOKEN" -k https://minikube.me/product-composite/1?faultPercent=10; sleep 3; done
This command will call the API once every third second. It specifies that 10% of the requests should fail so that the retry mechanism will kick in and retry the failed requests.
After a few minutes, the dashboard should report metrics such as the following:
Figure 20.13: Result of retry tests viewed in Grafana
In the preceding screenshot, we can see that most of the requests have been executed successfully, without any retries. Approximately 10% of the requests have been retried by the retry mechanism and successfully executed after the retry.
Before we leave the section on creating dashboards, we will learn how we can export and import dashboards.
Once a dashboard has been created, we typically want to take two actions:
To perform these actions, we can use Grafana's API for exporting and importing dashboards. Since we only have one Grafana instance, we will perform the following steps:
Before we perform these steps, we need to understand the two different types of IDs that a dashboard has:
id
, an auto-incremented identifier that is unique only within a Grafana instance.uid
, a unique identifier that can be used in multiple Grafana instances. It is part of the URL when accessing dashboards, meaning that links to a dashboard will stay the same as long as the uid
of a dashboard remains the same. When a dashboard is created, a random uid
is created by Grafana.When we import a dashboard, Grafana will try to update a dashboard if the id
field is set. To be able to test importing a dashboard in a Grafana instance that doesn't have the dashboard already installed, we need to set the id
field to null
.
Perform the following actions to export and then import your dashboard:
uid
of your dashboard.The uid
value can be found in the URL in the web browser where the dashboard is shown. It will look like this:
https://grafana.minikube.me/d/YMcDoBg7k/hands-on-dashboard
uid
in the URL above is YMcDoBg7k
. In a terminal window, create a variable with its value. In my case, it will be:
ID=YMcDoBg7k
curl -sk https://grafana.minikube.me/api/dashboards/uid/$ID | jq '.dashboard.id=null' > "Hands-on-Dashboard.json"
The curl
command exports the dashboard to JSON format. The jq
statement sets the id
field to null
and the output from the jq
command is written to a file named Hands-on-Dashboard.json
.
In the web browser, select Dashboards and Manage in the menu to the left. Identify the Hands-on Dashboard in the list of dashboards and select it by clicking in the checkbox in front of it. A red Delete button will be shown; click on it, and click on the new Delete button that is shown in the confirm dialog box that pops up.
curl -i -XPOST -H 'Accept: application/json' -H 'Content-Type: application/json' -k
'https://grafana.minikube.me/api/dashboards/db'
-d @Hands-on-Dashboard.json
Note that the URL used to access the dashboard is still valid, in my case https://grafana.minikube.me/d/YMcDoBg7k/hands-on-dashboard.
For more information regarding Grafana's APIs, see https://grafana.com/docs/grafana/v7.2/http_api/dashboard/#get-dashboard-by-uid.
Before proceeding to the next section, remember to stop the request loop that we started for the retry test by pressing Ctrl + C in the terminal window where the request loop executes!
In the next section, we will learn how to set up alarms in Grafana, based on these metrics.
Being able to monitor the circuit breaker and retry metrics is of great value, but even more important is the capability to define automated alarms on these metrics. Automated alarms relieve us from monitoring the metrics manually.
Grafana comes with built-in support for defining alarms and sending notifications to a number of channels. In this section, we will define alerts on the circuit breaker and configure Grafana to send emails to the test mail server when alerts are raised. The local test mail server was installed in the earlier section Installing a local mail server for tests.
For other types of channels supported by the version of Grafana used in this chapter, see https://grafana.com/docs/grafana/v7.2/alerting/notifications/#list-of-supported-notifiers.
In the next section, we will define a mail-based notification channel that will be used by the alert in the section after this.
To configure a mail-based notification channel in Grafana, perform the following steps:
mail
.The configuration of the notification channel should look as follows:
Figure 20.14: Setting up an email-based notification channel
Figure 20.15: Verifying the test mail on the mail server's web page
With a notification channel in place, we are ready to define an alert on the circuit breaker.
To create an alarm on the circuit breaker, we need to create the alert and then add an alert list to the dashboard, where we can see what alert events have occurred over time.
Perform the following steps to create an alert for the circuit breaker:
10s
.0m
.max()
query(A, 10s, now)
0.5
These settings will result in an alert being raised if the Closed state (related to the A variable) goes below 0.5 during the last 10 seconds. When the circuit breaker is closed, this variable has the value 1, and 0 otherwise. So this means that an alert is raised when the circuit breaker is no longer closed.
Figure 20.16: Setting up an alarm in Grafana
Added an alarm
" and then click on the Save button.Then, we need to perform the following steps to create an alarm list:
Circuit Breaker Alerts
as the Panel title.10
, and enable the option Alerts from this dashboard.The settings should look as follows (some irrelevant information has been removed):
Figure 20.17: Setting up an alarm in Grafana, part 2
Added an alert list
".Here is a sample layout with the alarm list added:
Figure 20.18: Setting up a layout in Grafana with Retry, Circuit Breaker, and alert panels
We can see that the circuit breaker reports the metrics as healthy (with a green heart) and the alert list is currently empty.
Now, it's time to try out the alarm!
Here, we will repeat the tests from the Testing the circuit breaker metrics section, but this time, we expect alarms to be raised and emails to be sent as well! Let's get started:
ACCESS_TOKEN=$(curl -k https://writer:[email protected]/oauth2/token -d grant_type=client_credentials -s | jq .access_token -r)
echo ACCESS_TOKEN=$ACCESS_TOKEN
for ((n=0; n<4; n++)); do curl -o /dev/null -skL -w "%{http_code}
" https://minikube.me/product-composite/1?delay=3 -H "Authorization: Bearer $ACCESS_TOKEN" -s; done
The dashboard should report the circuit as open as it did previously. After a few seconds, an alarm should be raised, and an email is also sent. Expect the dashboard to look like the following screenshot:
Figure 20.19: Alarm raised in Grafana
Take note of the alarm icon in the header of the circuit breaker panel (a red broken heart). The red line marks the time of the alert event and that an alert has been added to the alert list.
Figure 20.20: Alarm email
for ((n=0; n<4; n++)); do curl -o /dev/null -skL -w "%{http_code}
" https://minikube.me/product-composite/1?delay=0 -H "Authorization: Bearer $ACCESS_TOKEN" -s; done
The closed metric should go back to normal, that is 1, and the alert should turn green again.
Expect the dashboard to look like the following screenshot:
Figure 20.21: Error resolved as reported in Grafana
Note that the alarm icon in the header of the circuit breaker panel is green again; the green line marks the time of the OK event and that an OK event has been added to the alert list.
Figure 20.22: Error resolved as reported in an email
That completes how to monitor microservices using Prometheus and Grafana.
In this chapter, we have learned how to use Prometheus and Grafana to collect and monitor alerts on performance metrics.
We saw that, for collecting performance metrics, we can use Prometheus in a Kubernetes environment. We then learned how Prometheus can automatically collect metrics from a Pod when a few Prometheus annotations are added to the Pod's definition. To produce metrics in our microservices, we used Micrometer.
Then, we saw how we can monitor the collected metrics using dashboards in both Kiali and Grafana, which comes with the installation of Istio. We also experienced how to consume dashboards shared by the Grafana community, and learned how to develop our own dashboards where we used metrics from Resilience4j to monitor the usage of its circuit breaker and retry mechanisms. Using the Grafana API, we can export created dashboards and import them into other Grafana instances.
Finally, we learned how to define alerts on metrics in Grafana and how to use Grafana to send out alert notifications. We used a local test mail server to receive alert notifications from Grafana as emails.
The next two chapters should already be familiar to you, covering the installation of tools on a Mac or Windows PC. Instead, head over to the last chapter in this book, which will introduce how we can compile our Java-based microservices into binary executable files using the brand new, still in beta when writing this book, Spring Native project. This will enable the microservices to start up in a fraction of a second, but involves increased complexity and time when it comes to building them.
management.metrics.tags.application
config parameter used for?Figure 20.23: What is going on here?
If you are reading this with screenshots rendered in grayscale, it might be hard to figure out what the metrics say. So, here's some help:
3.144.107.193