Upgrading old Pods

Our primary goal should be to prevent issues from happening by being proactive. In cases when we cannot predict that a problem is about to materialize, we must, at least, be quick with our reactive actions that mitigate the issues after they occur. Still, there is a third category that can only loosely be characterized as being proactive. We should keep our system clean and up-to-date.

Among many things we could do to keep the system up-to-date, is to make sure that our software is relatively recent (patched, updated, and so on). A reasonable rule could be to try to renew software after ninety days, if not earlier. That does not mean that everything we run in our cluster should be newer than ninety days, but that it might be a good starting point. Further on, we might create finer policies that would allow some kinds of applications (usually third-party) to live up to, let's say, half a year without being upgraded. Others, especially software we're actively developing, will probably be upgraded much more frequently. Nevertheless, our starting point is to detect all the applications that were not upgraded in ninety days or more.

Just as in almost all other exercises in this chapter, we'll start by opening Prometheus' graph screen, and explore the metrics that might help us reach our goal.

 1  open "http://$PROM_ADDR/graph"

If we inspect the available metrics, we'll see that there is kube_pod_start_time. Its name provides a clear indication of its purpose. It provides the Unix timestamp that represents the start time of each Pod in the form of a Gauge. Let's see it in action.

Please type the expression that follows and click the Execute button.

 1  kube_pod_start_time

Those values alone are of no use, and there's no point in teaching you how to calculate the human date from those values. What matters, is the difference between now and those timestamps.

Figure 3-37: Prometheus' console view with the start time of the Pods

We can use Prometheus' time() function to return the number of seconds since January 1, 1970 UTC (or Unix timestamp).

Please type the expression that follows and click the Execute button.

 1  time()

Just as with the kube_pod_start_time, we got a long number that represents seconds since 1970. The only noticeable difference, besides the value, is that there is only one entry, while with kube_pod_start_time we got a result for each Pod in the cluster.

Now, let's combine the two metrics in an attempt to retrieve the age of each of the Pods.

Please type the expression that follows and click the Execute button.

 1  time() -
 2  kube_pod_start_time

The results are this time much smaller numbers representing the seconds between now and creation of each of the Pods. In my case (screenshot following), the first Pod (one of the go-demo-5 replicas), is slightly over six thousand seconds old. That would be around a hundred minutes (6096 / 60), or less than two hours (100 min / 60 min = 1.666 h).

Figure 3-38: Prometheus' console view with the time passed since the creation of the Pods

Since there are probably no Pods older than our target of ninety days, we'll lower it temporarily to a minute (sixty seconds).

Please type the expression that follows and click the Execute button.

 1  (
 2    time() -
 3    kube_pod_start_time{
 4      namespace!="kube-system"
 5    }
 6  ) > 60

In my case, all the Pods are older than a minute (as are probably yours as well). We confirmed that it works so we can increase the threshold to ninety days. To get to ninety days, we should multiply the threshold with sixty to get minutes, with another sixty to get hours, with twenty-four to get days, and, finally, with ninety. The formula would be 60 * 60 * 24 * 90. We could use the final value of 7776000, but that would make the query harder to decipher. I prefer using the formula instead.

Please type the expression that follows and click the Execute button.

 1  (
 2    time() -
 3    kube_pod_start_time{
 4      namespace!="kube-system"
 5    }
 6  ) >
 7  (60 * 60 * 24 * 90)

It should come as no surprise that there are (probably) no results. If you created a new cluster for this chapter, you'd need to be the slowest reader on earth if it took you ninety days to get here. This might be the longest chapter I've written so far, but it's still not worth ninety days of reading.

Now that we know which expression to use, we can add one more alert to our setup.

 1  diff mon/prom-values-phase.yml 
 2      mon/prom-values-old-pods.yml

The output is as follows.

146a147,154
> - alert: OldPods
>   expr: (time() - kube_pod_start_time{namespace!="kube-system"}) > 60
>   labels:
>     severity: notify
>     frequency: low
>   annotations:
>     summary: Old Pods
>     description: At least one Pod has not been updated to more than 90 days

We can see that the difference between the old and the new values is in the OldPods alert. It contains the same expression we used a few moments ago.

We kept the low threshold of 60 seconds so that we can see the alert in action. Later on, we'll increase that value to ninety days.

There was no need to specify for duration. The alert should fire the moment the age of one of the Pods reaches three months (give or take).

Let's upgrade our Prometheus' Chart with the updated values and open the Slack channel where we should see the new message.

 1  helm upgrade -i prometheus 
 2    stable/prometheus 
 3    --namespace metrics 
 4    --version 7.1.3 
 5    --set server.ingress.hosts={$PROM_ADDR} 
 6    --set alertmanager.ingress.hosts={$AM_ADDR} 
 7    -f mon/prom-values-old-pods.yml
 8
 9  open "https://devops20.slack.com/messages/CD8QJA8DS/"

All that's left is to wait for a few moments until the new message arrives. It should contain the title Old Pods and the text stating that At least one Pod has not been updated to more than 90 days.

Figure 3-39: Slack with multiple fired and resolved alert messages

Such a generic alert might not work for all your use-cases. But, I'm sure that you'll be able to split it into multiple alerts based on Namespaces, names, or something similar.

Now that we have a mechanism to receive notifications when our Pods are too old and might require upgrades, we'll jump into the next topic and explore how to retrieve memory and CPU used by our containers.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.39.144