IT reliability challenges and solution approaches

A resilient application keeps processing data and doing transactions, even when one or more components of the application fail due to an internal or external reason. That is, when a software system is under attack, the system has to find a way out to survive or to come back to the original state quickly. Therefore, it is imperative that every complicated and mission-critical system has to be designed, developed, and deployed using the most applicable resiliency properties. In other words, a system that is not designed with resiliency as the core feature is bound to fail at some point. For the forthcoming IoT, blockchain, and AI-enabled digital transformation and intelligence era, resilient IT infrastructures, platforms, and software applications are the most important ingredients to attain the intended and envisaged success. Let's digress a bit and discuss why this is so.

Highly distributed environments pose new challenges—centralized computing paves the way for decentralized and distributed computing to stock and analyze big data and for accomplishing large scale processing needs. Single servers make way for clustered, grid, and cloud servers. Even appliances and Hyper-Converged Infrastructures (HCI) are clubbed together to solve big-data capture, storage, and processing. But distributed computing poses several challenges. Primarily, the enigmatic dependences create a lot of problems. The dependency on third-party and other external services creates havoc. There is no 100 percent guarantee that all non co-located servers and services are functioning all the time and are responsive. We have to be mindful of the fact that each and every integration point possesses and poses one or more risks. They may fail at some point, and we have to be technologically prepared for any unexpected and unwanted failure. The lesson here is that we cannot put too much faith in an external source. Every call to third-party services or a backend database is liable to break at some point. If we fail to protect ourselves from these breaches and nuisances, our systems may ultimately become invalid and useless at some point in time. We may not be able to guarantee the service level agreement (SLA) and operational level agreement (OLA) qualities agreed with service consumers and clients.

The performance bottlenecks may come to the surface as too many disparate and distributed systems start to interact with one another to fulfil business goals. The fulfilment of various non-functional requirements (NFRs)/the Quality of Service (QoS) attributes remain the core challenges and concerns in distributed environments. Preventing cascading failures, that is, completely avoiding failure in any complicated yet sophisticated system especially IT systems is next to impossible. Therefore, we have to deftly design and empower them to be inherently able to cope with any failures and fissions and to automatically surmount them. The very key to ensure resiliency is that we need to design in such a way that a failure does not precipitate and propagate to other components and layers of the system to bring down the whole system. Any mischievous thing in a component has to be proactively and preemptively captured and contained to keep other components to do their tasks continuously. Generally speaking, there are two prominent and dominant failures: an external resource might respond with an error immediately, or it might respond slowly. Receiving an error straightaway is always preferred instead of receiving a slow response. Therefore, it is recommended to explicitly specify a time-out while calling an external resource. This enables that you get the reply quickly. If not, the request binds and strangulates the execution threads. If a time-out is not indicated, in a worst-case scenario, all the threads are blocked, resulting in disaster. The point is that any failure in one part of the system should not cascade to other parts to bring down the entire system. Employ circuit breakers—this is an important resiliency pattern. In the normal closed state, the circuit breaker executes operations as usual, and calls will be forwarded to the external resources. When a call fails to get the proper response, the circuit breaker makes a note of the failure. If the number of failures exceeds a certain limit, then the circuit breaker automatically trips and opens the circuit. That is, thereafter, no call is allowed. Calls are bound to fail immediately, without any attempt to execute the real and requested operation. After a while, the circuit breaker switches into a half-open state, wherein the next call is allowed to execute the operation. Depending on the outcome of that call, the circuit breaker either switches to the closed or to the open state again.

The following diagram illustrates everything in a vivid manner:

Hystrix is an open source framework that was developed by the company Netflix. This provides an implementation of the famous circuit breaker pattern. This framework helps in building resilient services. The key advantages of the circuit breaker pattern are as follows:

Fail fast and rapid recovery
Prevent cascading failure
Fall-back and gracefully degrade when possible

The Hystrix library provides the following features:

Implements the circuit breaker pattern
Provides near real-time monitoring via the Hystrix stream and the Hystrix dashboard
Generates monitoring events that can be published to external systems such as Graphite

As services are being built and released by different teams, it is not prudent to expect each service to have the same performance and reliability capabilities. Due to the dependency on other downstream services, the performance and reliability of any calling service may vary considerably. A front-facing service may be held hostage if one or more of its serving services is unable to meet their SLAs. This, in turn, would definitely impact the SLA of the calling service. This drawback finally results in a bad experience for service users. This situation is pictorially represented here:

The key parameters affecting the SLAs of MSA are given as follows:

Connection timeouts: This happens when the client is unable to connect to a service within a given time frame. This may be caused due to a slow or unresponsive service.
Read timeouts: This is when the client is unable to read the results from the service within a given time frame. The service may be doing a lot of computation or using some inefficient ways to prepare data to be returned.

Exceptions: These are being caused due to the following reasons:
- The client sends bad data to the requested service
- The service is down, or there is an issue with the service
- The client faces some issues while parsing the response
- There may be some changes enacted on the service, and the client is unaware of them

Thus, implementing resilient microservices and subsequently realizing reliable software applications is beset with a number of issues. Therefore, there is an insistence on enabling patterns, easy to understand and use best practices, integrated platforms, evaluation metrics, and viable solution approaches.

Table of Contents for IT reliability challenges and solution approaches

Create new playlist

Sign In

Sign Up

Table of Contents for
IT reliability challenges and solution approaches