Chapter 15. High-Level Business Service Monitoring

Monitoring IT systems usually involves poking at lots of small details—CPU, disk, memory statistics, process states, and a myriad other parameters. All of these are very important, and every detail should be available to technical people. But in the end, the goal of these systems is not to have enough disk space—the goal is to serve a specific need. If one only looks at the low-level detail, it can be very hard to figure out what impact the current problem might have on users. Zabbix offers a way to have a higher-level view, called IT services. Relationships between individual systems can be configured to see how they build up to deliver services, and Service Level Agreement (SLA) calculation can be enabled for any part of the resulting tree.

Deciding on the service tree

Before configuring things, it is useful to think through the setup, and doubly so with IT services. A large service tree might look impressive, but it might not represent the actual functionality well, and might even obscure the real system state. Disk space being low is important, but it does not actually bring the system down—it does not affect the SLA. The best approach likely would be to only include specific checks that identify a service being available or operating in an acceptable manner—for example, SLA might require some performance level to be maintained. Unless we want to have a large, complicated IT service tree, we should identify key factors in delivering the service and monitor those.

What are the key factors? If the service is simple enough and can be tested easily, we could have a direct test. Maybe the SLA requires that a website is available—in that case, a simple web.page.get item would suffice. If it is a web page-based system, we might want to check the page itself, log in, and perform some operation as a logged-in user—this is possible with web scenarios.

Tip

We discussed web monitoring in more detail in Chapter 13, Monitoring Web Pages.

Sometimes, it might not be possible to use the interface directly—maybe it is not possible to have a special user for monitoring purposes, or we are not allowed to connect to the actual interface. In that case we should use lower-level monitoring, concentrating on the main pieces of the system that must be available. We should still attempt to have the highest-level checks possible. For example, we could check whether web server software is running, whether we can connect to a TCP port, and whether we can connect to the backend database from the frontend system. Memory or disk usage on the database system, and database low-level health, do not matter from the high-level monitoring point of view. It should all be monitored, of course, but having the delete query rate too high usually does not affect the top-level service. On the other hand, if a service goes down, we might be unable to see, in the same tree, that it happened because a disk filled up—but that is an operational failure, and we can expect that the responsible personnel are using such low-level triggers with proper dependencies to resolve the issue.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.190.18