Monitoring

Any service that you rely on in your service delivery should ideally have a way to notify you if something has gone wrong with it, and I do not mean user feedback here. Most service development nowadays is moving at incredible speeds and monitoring is one of those things like backups that most developers do not think about until something catastrophic happens, so it is something that we should cover a little bit. The big question that really should determine how you approach this topic is if your users can handle the downtimes that you will not see without monitoring.

Most tiny services might be OK with some outages, but for everything else, this would be at a bare minimum a couple of angry emails from users and at worst your company losing a huge percentage of your users, so monitoring at all scales is greatly encouraged.

While it is true that monitoring is maybe considered one of those boring pieces of your infrastructure to implement, having a way to gain insights into what your cloud is doing at all times is an absolutely essential part of managing the multitude of disparate systems and services. By adding monitoring to your Key Performance Indicators (KPIs) you can ensure that, as a whole, your system is performing as expected and by adding triggers to your critical monitoring targets you can be instantly alerted to any activity that can potentially impact your users. Having these type of insights into the infrastructure can both help reduce user turnover and drive better business decisions.

As we worked through our examples, you may have already come up with ideas of what you would monitor, but here are some common ones that consistently pop up as the most useful ones:

Node RAM utilization: If you notice that your nodes aren't using all the RAM allocated, you can move to smaller ones and vice versa. This generally gets less useful if you use memory-constrained Docker containers, but it is still a good metric to keep as you want to make sure you never hit a system-level max memory utilization on a node or your containers will run with much slower swap instead.
Node CPU utilization: You can see from this metric if your service density is too low or too high or if there are spikes in service demands.
Node unexpected terminations: This one is good to track to ensure that your CI/CD pipeline is not creating bad images, that your configuration services are online, and a multitude of other issues that could take down your services.
Service unexpected terminations: Finding out why a service is unexpectedly terminating is critical to ironing out bugs out of any system. Seeing an increase or a decrease in this value can be good indicators of codebase quality though they can also indicate a multitude of other problems, both internal and external to your infrastructure.
Messaging queue sizes: We covered this in a bit of detail before but ballooning queue sizes indicate that your infrastructure is unable to process data as quickly as it is generated, so this metric is always good to have.
Connection throughputs: Knowing exactly how much data you are dealing with can be a good indicator of service load. Comparing this to other collected stats can also tell you if the problems you are seeing are internally or externally caused.
Service latencies: Just because there are no failures does not mean that the service is unusable. By tracking latencies you can see in detail what could use improving or what is not performing to your expectations.
Kernel panics: Rare but extremely deadly, kernel panics can be a really disruptive force on your deployed services. Even though it is pretty tricky to monitor these, keeping track of kernel panics will alert you if there is an underlying kernel or hardware problem that you will need to start addressing.

This obviously is not an exhaustive list, but it covers some of the more useful ones. As you develop your infrastructure, you will find that adding monitoring everywhere leads to better turnarounds on issues and discovery of scalability bugs with your services. So once you have monitoring added to your infrastructure, don't be afraid to plug it into as many pieces of your system that you can. At the end of the day, by gaining visibility and transparency of your whole infrastructure through monitoring, you can make wiser decisions and build better services, which is exactly what we want.

Table of Contents for Monitoring

Create new playlist

Sign In

Sign Up

Table of Contents for
Monitoring