The importance of DevOps-friendly apps

We may look at almost any software from three distinct angles. Each angle represents the needs of different target audiences that work with the system, namely the following:

  • Business users are interested in the business functions that the system provides
  • Developers want the system to be development-friendly
  • Operational teams want the system to DevOps-friendly

Now let's explore the operational aspect of a software system. From the standpoint of a DevOps team member, a software system is DevOps-friendly when it is not hard to support the system in production. This means that the system exposes proper health checks and metrics, and provides the ability to measure its performance and update different components smoothly. Furthermore, because nowadays microservices architecture is a default software development technique, it is mandatory to have the proper monitoring capabilities even to deploy to the cloud. Without a proper monitoring infrastructure, our software would not be able to survive more than a few days in production.

The appearance of cloud environments for application deployment simplified and democratized software delivery and operational processes by providing the proper infrastructure even for the most demanding software designs. IaaS, PaaS, and container management systems such as Kubernetes (https://kubernetes.io) or Apache Mesos (http://mesos.apache.org) have removed many headaches related to OS and network configurations, file backups, metrics gathering, automatic service scaling, and much, much more. However, such external services and techniques still cannot determine on their own whether our business application delivers the appropriate quality of service. Besides, the cloud provider cannot suggest whether the underlying resources are used efficiently depending on the tasks that our system is doing. Such responsibility still rests on the shoulders of software developers and DevOps.

To operate software efficiently, mainly when it consists of dozens or even sometimes thousands of services, we need some ways to do the following:

  • Identify services
  • Check the service's health status
  • Monitor operational metrics
  • Look through the logs and dynamically change log levels
  • Trace requests or data flows

Let's go through each of the concerns one by one. Service identification is a must in microservice architecture. This is because in most cases some orchestration systems (such as Kubernetes) will spawn a multitude of service instances on different nodes, shuffling them, creating and destroying nodes as a client's demand grows and decreases. Even though containers or runnable JAR files usually have meaningful names, it is essential to be able to identify the service name, type, version, build time, and event commit revision of the source code at the time of running. This makes it possible to spot an incorrect or buggy service version in production, trace the changes that introduced a regression (if any) and automatically track performance characteristics of different versions of the same service.

When it comes to being able to distinguish between services at runtime, we want to know whether all services are healthy. If not, we want to know whether this is critical, depending on the role of a service. Service health check endpoints are often used by an orchestration system itself to identify and restart a failing service. It's not mandatory to have only two states here: healthy and unhealthy. Usually, a health status provides the whole set of essential checks—some of these are critical, some are not. For example, we may calculate a service health level based on processing queue size, error rate, available disc space, and free memory. It is worth only considering essential metrics when calculating the overall health. In other cases, we may risk building a service that can hardly ever be considered healthy. In general, the ability to offer a health status request means that the service is at least capable of serving requests. This characteristic is often used by the containers' management systems to check service availability and make a decision about a service restart.

Even when the service operates correctly and has a healthy status, we often want to have a deeper insight into operational details. A successful system not only consists of healthy components but also behaves in a predicted and acceptable way for an end user. The critical metrics for the system could include average response time, error rate, and, depending on a request's complexity, the time it takes to process a request. Having an understanding of how our system behaves under the load not only allows us to scale it adequately but also allows us to plan expenses on the infrastructure. It also enables spotting hot codes, inefficient algorithms, and limiting factors for scalability. Operational metrics give a snapshot of the current state of the system and bring a lot of value, but metrics are much more informative when gathered continuously. Operational metrics provide trends and may deliver insights into some correlated characteristics like service memory usage with regards to uptime. Wisely implemented metric reporters do not consume a lot of server resources, but keeping metrics history over time requires some additional infrastructure, usually a time-series database such as Graphite (https://graphiteapp.org), InfluxDB (https://www.influxdata.com), or Prometheus (https://prometheus.io). To visualize a time-series on meaningful dashboards and to set up alerts in response to critical situations, we often need additional monitoring software such as Grafana (https://grafana.com) or Zabbix (https://www.zabbix.com). Cloud platforms often provide such software to their clients in the form of extra services.

When monitoring the service's operational characteristics and investigating incidents, a DevOps team often reads logs. Nowadays, all the application logs should ideally be stored or at least analyzed in one centralized place. For that purpose, in a Java ecosystem, ELK stack (https://www.elastic.co/elk-stack) (consisting of Elasticsearch, Logstash, and Kibana) is often used. Although such a software stack is excellent and makes it possible to treat the dozens of services as one system, it is very inefficient to transfer over the network and store logs for all logging levels. Usually, it is enough to save INFO messages and turn on DEBUG or TRACE levels only to investigate some repeatable anomalies or errors. To do dynamic log level management, we need some hassle-free interfaces.

When logs do not reveal enough to represent the whole picture, our last resort attempt before laborious debugging would be the ability to trace processes inside our software. Tracing could represent a detailed log of recent server requests, or it could describe the complete topology of subsequent requests, including queueing time, DB-requests with timing, external calls with correlation IDs, and so on. Tracing is very helpful for visualizing request processing in real-time and it is indispensable when improving software performance. Distributed tracing does the same in a distributed system, making it possible to track all requests, messages, network delays, errors, and so on. Later in this chapter, we will describe how to enable distributed tracing with Spring Cloud Sleuth and Zipkin (https://zipkin.io).

Adrian Cole, the head of the Zipkin project, gave a great talk about the importance of different perspectives in application monitoring: https://www.dotconferences.com/2017/04/adrian-cole-observability-3-ways-logging-metrics-tracing.

Most importantly, for the successful operation of the software system, all of the techniques mentioned are required or at least desired. Fortunately, in a Spring ecosystem, we have a Spring Boot Actuator.

All previously mentioned operational techniques are well-understood and well-described for an ordinary Servlet-based application. However, since reactive programming on a Java platform is still a novelty, achieving similar objectives for a reactive application may require some code modifications or even entirely different implementation approaches.

Nevertheless, from an operational standpoint, a service implemented with reactive programming should not differ from an ordinary well-implemented synchronous service. It should follow the Twelve-Factor App guidelines (https://12factor.net), it should be DevOps friendly, and it should be easy to operate and evolve.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.136.186