Infrastructure as code 

The widely quoted benefit of infrastructure as code (IaC) is repeatability and reproducibility. There are a number of components (server, network, security, storage, and so on) in a data center that need to be configured to deploy applications. In cloud environments, there are thousands of such components to be configured. If all is being done manually, the time taken is very huge and error-prone. There are possibilities for the creeping in of configuration differences and drifts. Humans aren't great at undertaking repetitive and manual tasks with 100% accuracy. But machines are very good at doing repetitive, redundant, and routine tasks in scale and speed. If we produce a template and input it into a machine, the machine can execute the template thousand times without any errors. The template-centric approach for infrastructure provisioning, configuration and application deployment gains wider attraction and attention these days. Infrastructure optimization and management get elegantly simplified through the leverage of well-designed templates. With the concept of IaC picking up steadily, the infrastructure setup and sustenance is being streamlined with ease. With the enhanced visibility, controllability, and observability, IT infrastructures are being manipulated as we do programming software applications. The infrastructure life cycle management activities are being automated through a host of advancements happening in the pioneering IaC space.

Immutable infrastructure: Instead of getting updated, immutable components are being replaced for every deployment. That is, no updates are being performed on live systems. It is all about provisioning a new instance of the resource. Containers are the best example of an immutable infrastructure resource. Similarly, fresh instances of various AWS images are being created and deployed instead of updating the existing instances.

To support application deployment in an immutable infrastructure, the canary deployment is being recommended. The canary deployment reduces the risk of failure when new versions of applications go to production environments. The canary deployment helps to gradually roll out the new version to a small set of users and then expands it to make it available to everyone. This is illustrated in the following diagram:

The real benefit of canary deployment is it is possible to roll back the new version if there are any issues. Thus, faster yet safer deployment of applications with real production data is facilitated through the canary deployment model.

Stateless applications: As mentioned previously, for enabling auto-scaling, applications have to be stateless. For immutable infrastructures, the stateless feature is important. That is, any request can be expertly handled by any resource. Stateless applications respond to all client requests independent of prior requests or sessions. There is no need for the applications to store any information in memory or in local disks. Keeping state information in the application server may lead to performance degradation when there is a huge number of requests from different users. Generally, sharing state information with any resources within the auto-scaling group has to be accomplished through in-memory databases (IMDBs) and in-memory data grids (IMDGs). The popular products are Redis, Memcached, and Apache Ignite. Thus, to have reliable infrastructure and application, we need IaC, stateless application, immutable infrastructure, automation through DevOps tools, and so on.

Avoiding cascading failures: Generally, any error/misbehavior in one component of any system gets quickly propagated across the system to bring down the whole system. Thus, it is mandatory to unearth and use competent techniques that intrinsically help to avoid those cascading failures. A classic example of a cascading failure is overload. That is, when one component is stressed due to heavy load and is in utter distress, all the other components depending on the stressed component may be made to wait exorbitantly. That is, precious and expensive resources are being exhausted and resultantly, the whole system may be out of work. Thus, fault identification and isolation in a preemptive manner is vital for the intended success of any complicated and sophisticated system. There are a few widely accentuated approaches and algorithms to avoid cascading failures.

Back-off algorithms: Due to various business evolutions, we are heading towards distributed computing. Firstly, due to the varying size, speed, structure, and scope of business, social, and device data getting generated and collected, we need highly optimized and organized IT infrastructures and integrated platforms for efficient data virtualization, cleansing, storage, and processing. We aspire to have groundbreaking platforms and infrastructures for performing big, real-time, and streaming data analytics. Secondly, we have highly integrated and insights-driven applications in plenty. That is, we are destined to have both data and process-intensive applications. Distributed computing is the way forward. Large-scale complex applications are meticulously partitioned into a number of easily producible and manageable application components/services, and these modules are being distributed and decentralized.

The key IT components for distributed applications include web/application servers, load balancers, firewalls, sharded, and dedicated databases, DNS servers, and so on. There are a few crucial challenges being associated with distributed systems. Security, service discovery, service integration, service availability and reliability, network latency, and so on, are the widely circulated issues of distributed computing. Experts and evangelists have studied these problems thoroughly and have recommended a series of best practices, evaluation metrics, architecture and design patterns, and so on. In Chapter 3, Microservice Resiliency Patterns, we detailed a set of resiliency patterns to come up with reliable systems. Besides, there are resiliency-enablement frameworks, platform solutions, programming models, and so on to sufficiently enhance system and service reliability.

Retry: The one standard technique for tackling issues related to the famous distributed computing is to apply the proven retry method. The service requesters attempt to redo the failed operation as soon as an error occurs. The issue is when there is a large number of requesters, and the network can start to feel the stress. That is, the network bandwidth will be completely drained and resultantly the system is bound to collapse. To avoid such scenarios, back-off algorithms such as the common exponential back-off are recommended to be used. The exponential back-off algorithms gradually increase the rate at which retries are being performed. This way, the network congestion can be greatly avoided.

In its simplest form, a pseudo exponential back-off algorithm looks as follows:

Timeouts: This is another resiliency-guaranteeing method. Suppose there is a steady baseline traffic and, all of a sudden, the database slows down and INSERT queries take more time to respond. The baseline traffic is not changed and hence the reason for the sudden slow down is that more request threads are holding the database connections. As a result, the pool of database connections has shrunk significantly. There are no connections left out in the pool to serve any other API, and hence other APIs start to fail. This is a classic example of cascading failure. If the API had timed out instead of clinging on to the database, the service performance could have gone done instead of the unwanted complete failure. Thus, there has to be the timeout phenomenon to achieve service resiliency.

Idempotent operations: This is an important facet in ensuring data consistency and integrity. If there is a client request being sent out as a message over HTTP to an application and due to a transient error, there is the timeout reply from the application. The request message could have been received and processed by the application. Still, because of the timeout response, the user goes for the retry option.

Suppose the request is an INSERT to the backend database. When the retry option is applied again, there is a possibility for a repeat insertion of the same data. These errors can be avoided if the application implements idempotent operations. An idempotent operation is one that can be repeated any number of times and this repetition does not affect the application in any way. Importantly, the same result will be delivered even if repeatedly tried.

Service degradation and fall backs: Instead of a complete shutdown, an application can be allowed to degrade to provide a lower-quality service. That is, the application response may be a bit slow, or the throughput of the application is on the lower side. This is a kind of trade-off to be made instead of application failure. One or other fall back options have to be employed.

As we all know, enterprise-scale and mission-critical applications are being made out of distributed and decentralized microservices. With the adoption of containerized microservices, the availability of microservices is being significantly increased through the deployment of multiple instances of the services. That is, multiple instances of the same microservice are being deployed through the leverage of containers, which emerge as the most optimal runtime/execution environment for microservices. When one service experiences some difficulties, its instance can be asked to deliver the expected functionality. The API gateway works as the mediator and coordinator of microservices. The location intelligence, along with the network's latency, plays a very vital role in finalizing service instances in place of the service.

Let's see what happens if it is a database. If INSERT queries become slow, then it is prudent to go for a timeout and then fall back to the read-only mode on the database until the issue with the INSERT gets sorted out. If the application does not support read-only mode, then the cached content can be returned to the user.

Resilience against intermittent and transient errors: With cloud environments emerging as the one-stop IT solution for business process and operations regarding automation, augmentation, and acceleration requirements, enterprise-class applications are meticulously modernized and migrated to clouds to reap all the originally expressed and eulogized benefits of the flourishing cloud idea. Building and deploying application components in a distributed manner with centralized governance becomes the new norm as the aspect of hybrid IT is gaining a grip on IT. There are several problems cropping up with large-scale distributed systems in cloud environments. Due to the exorbitant rise in the size and the complexity (heterogeneity and multiplicity-induced) of the systems and the architecture used, the occurrence of intermittent errors is not ruled out. Well-known intermittent errors include transient network connectivity, request timeouts, I/O operations, and the dependency on external services, which become overloaded.

Hence, there is a clarion call for producing resilient systems that can intelligently return to their previous and preferred state if attacked and affected. The best practice is to design systems to fail and not to be fail-proof. The complications of modern IT systems demand that we need to design, develop, and deploy resilient and versatile applications. Applications have to be designed to be extremely fault-tolerant to continuously deliver their ordained functionality. One idea is to collect the statistical data about the various intermittent errors and based on that information, define a threshold that can trigger the correct reaction to errors.

Circuit breaking: This is a widely implemented resiliency technique that's used by various web-scale service providers. This is all about applying circuit breakers to failing method calls to avoid any kind of catastrophic and cascading failure. As we stated earlier, timeout, back-off, allowing service degradation, fall back, and intermittent error handling are the key methods to prevent cascading failures.

A circuit breaker monitors for the number of consecutive failures between a producer and a consumer. If the number passes over a failure threshold, the circuit breaker object trips, and all attempts by the producer to invoke the consumer will fail immediately, or return a defined fall back. After a waiting period, the circuit breaker allows for a few requests to pass through. If those requests successfully pass a success threshold, the circuit breaker resumes its normal state. Otherwise, it stays broken and continues monitoring the consumer. The following is a diagram that shows a circuit breaker with timeouts:

Circuit breaking is being presented as one of the key resilient methods, and such resiliency enabling mechanisms inspire software architects and engineers to incorporate the resiliency measures while designing and building software systems. There are free implementations such as Hystrix, and they can be incorporated in the source code to arrive at resilient applications.

Load balancing: This is another resiliency aspect that's widely used in enterprise and cloud IT environments. With more services and their instances being stuffed in any IT environment, the need for load balancers for distributing application and device requests to multiple instances of one or more application components by understanding the latest load scenario of each application component is on the rise. How every instance is occupied is being taken into consideration by a load balancer to route client requests so as to balance the load. This is clearly helping out in maintaining the system's resiliency.

There are hardware and software load balancers on the market. Consistent and continuous health checks of microservices are very vital for their continued service delivery. Load balancers are capable of doing that, as depicted in the following diagram. Load balancers are continuously probing their service instances, and if a service is not available for fulfilling service requests, then load balancers will redirect service requests to one of the functioning service instances. As mentioned previously, there can be several reasons for services to go down or become unable to fulfil their obligations in time. The service database is not responding, or the service may be overwhelmed with many service requests. The network may not be available transiently:

With the multiplicity of microservices, services and their communications have to be guaranteed for resiliency. There are several straightforward ways and means of ensuring service resiliency as discussed in the previous sections. When we have resilient services, through the various composition methods, resilient services lead to reliable systems.

Table of Contents for Infrastructure as code&#xA0;

Create new playlist

Sign In

Sign Up

Table of Contents for
Infrastructure as code