Measuring containers memory and CPU usage

If you are familiar with Kubernetes, you understand the importance of defining resource requests and limits. Since we already explored kubectl top pods command, you might have set the requested resources to match the current usage, and you might have defined the limits as being above the requests. That approach might work on the first day. But, with time, those numbers will change and we will not be able to get the full picture through kubectl top pods. We need to know how much memory and CPU containers use when on their peak loads, and how much when they are under less stress. We should observe those metrics over time, and adjust periodically.

Even if we do somehow manage to guess how much memory and CPU a container needs, those numbers might change from one release to another. Maybe we introduced a feature that requires more memory or CPU?

What we need is to observe resource usage over time and to make sure that it does not change with new releases or with increased (or decreased) number of users. For now, we'll focus on the former case and explore how to see how much memory and CPU our containers used over time.

As usual, we'll start by opening the Prometheus' graph screen.

 1  open "http://$PROM_ADDR/graph"

We can retrieve container memory usage through the container_memory_usage_bytes.

Please type the expression that follows, press the Execute button, and switch to the Graph screen.

 1  container_memory_usage_bytes

If you take a closer look at the top usage, you'll probably end up confused. It seems that some containers are using way more than the expected amount of memory.

The truth is that some of the container_memory_usage_bytes records contain cumulative values, and we should exclude them so that only memory usage of individual containers is retrieved. We can do that by retrieving only the records that have a value in the container_name field.

Please type the expression that follows, and press the Execute button.

 1  container_memory_usage_bytes{
 2    container_name!=""
 3  }

Now the result makes much more sense. It reflects memory usage of the containers running inside our cluster.

We'll get to alerts based on container resources a bit later. For now, we'll imagine that we'd like to check memory usage of a specific container (for example, prometheus-server). Since we already know that one of the available labels is container_name, retrieving the data we need should be straightforward.

Please type the expression that follows, and press the Execute button.

 1  container_memory_usage_bytes{
 2    container_name="prometheus-server"
 3  }

We can see the oscillations in memory usage of the container over the last hour. Normally, we'd be interested in a longer period like a day or a week. We can accomplish that by clicking - and + buttons above the graph, or by typing the value directly in the field between them (for example, 1w). However, changing the duration might not help much since we haven't been running the cluster for too long. We might not be able to squeeze more data than a few hours unless you are a slow reader.

Figure 3-40: Prometheus' graph screen with container memory usage limited to prometheus-server

Similarly, we should be able to retrieve CPU usage of a container as well. In that case, the metric we're looking for could be container_cpu_usage_seconds_total. However, unlike container_memory_usage_bytes that is a gauge, container_cpu_usage_seconds_total is a counter, and we'll have to combine sum and rate to get the changes in values over time.

Please type the expression that follows, and press the Execute button.

 1  sum(rate(
 2    container_cpu_usage_seconds_total{
 3      container_name="prometheus-server"
 4    }[5m]
 5  ))
 6  by (pod_name)

The query shows summed CPU seconds rate over five minutes intervals. We added by (pod_name) to the mix so that we can distinguish different Pods and see when one was created, and the other was destroyed.

Figure 3-41: Prometheus' graph screen with the rate of container CPU usage limited to prometheus-server

If that were a "real world" situation, our next step would be to compare actual resource usage with what we defined as Prometheus resources. If what we defined differs considerably compared to what it actually is, we should probably update our Pod definition (the resources section).

The problem is that using "real" resource usage to define Kubernetes resources better will provide valid values only temporarily. Over time, our resource usage will change. The load might increase, new features might be more resource-hungry, and so on. No matter the reasons, the critical thing to note is that everything is dynamic and that there is no reason to think otherwise for resources. In that spirit, our next challenge is to figure out how to get a notification when the actual resource usage differs too much from what we defined in container resources.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.162.65