Now that you have a good knowledge of all the components of a Zabbix infrastructure, it is time to implement a highly available Zabbix installation. In a large environment, especially if you need to guarantee that all your servers are up and running, it is of vital importance to have a reliable Zabbix infrastructure. The monitoring system and Zabbix infrastructure should survive any possible disaster and guarantee business continuity.
High availability is one of the solutions that guarantee business continuity and provides a disaster recovery implementation; this kind of setup cannot be missed in this book.
This chapter begins with the definition of high availability, and it further describes how to implement an HA solution.
In particular, this chapter considers the three-tier setup that we described earlier:
We have described how to set up and configure each one of the components on high availability. All the procedures presented in this chapter have been implemented and tested in a real environment.
In this chapter, we will cover the following topics:
High availability is an architectural design approach and associated service implementation that is used to guarantee the reliability of a service. Availability is directly associated with the uptime and usability of a service. This means that the downtime should be reduced to achieve an agreement on that service.
We can distinguish between two kinds of downtimes:
To distinguish between scheduled downtimes, we can include:
Unfortunately, all these downtimes will interrupt your service, but you have to agree that they can be planned into a maintenance window that is agreed upon.
The unexpected downtime normally arises from a failure, and it can be caused by one of the following reasons:
Unscheduled downtimes also include power outages and high-temperature shutdown, and all these are not planned; however, they cause an outage. Hardware and software failure are quite easy to understand, whereas a physical event is an external event that produces an outage on our infrastructure. A practical example can be an outage that can be caused by lightning or a flood that leads to the breakdown of the electrical line with consequences on our infrastructure. The availability of a service is considered from the service user's point of view; for example, if we are monitoring a web application, we need to consider this application from the web user's point of view. This means that if all your servers are up and running, but a firewall is cutting connections and the service is not accessible, this service cannot be considered available.
3.15.41.27