Chapter 5. Chaos Testing

A relatively famous OSS project called Chaos Monkey came from the developer team at Netflix, and its unveiling to the IT world was quite disruptive. The concept that Netflix had built code that random kills various services in their production environment blew people’s minds. When many teams struggle maintaining their uptime requirements, promoting self-sabotage and attacking oneself seemed absolutely crazy. Yet from the moment Chaos Monkey was born, a new movement arose: chaos engineering.

According to the Principles of Chaos Engineering website, “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” (You can read more at http://principlesofchaos.org/.)

In complex systems (software systems or ecological systems), things do and will fail, but the ultimate goal is stop catastrophic failure of the overall system. So how do you verify that your overall system–your network of microservices–is in fact resilient? You inject a little chaos. With Istio, this is a relatively simple matter because the istio-proxy is intercepting all network traffic, therefore, it can alter the responses including the time it takes to respond. Two interesting faults that Istio makes easy to inject are HTTP error codes and network delays.

HTTP Errors

This simple concept allows you to explore your overall system’s behavior when random faults pop up within the system. Throwing in some HTTP errors is actually very simple when using Istio’s RouteRule construct. Based on previous exercises earlier in this book, recommendation v1 and v2 are both deployed and being randomly load balanced because that is the default behavior in Kubernetes/OpenShift. Make sure to comment out the “timeout” line if that was used in a previous exercise. Now, you will be injecting errors and timeouts via Istio instead of using Java code:

oc get pods -l app=recommendation -n tutorial
NAME                                 READY   STATUS   RESTARTS   AGE
recommendation-v1-3719512284-7mlzw   2/2     Running  6         18h
recommendation-v2-2815683430-vn77w   2/2     Running  0         3h

We use the Istio RouteRule to inject a percentage of faults, in this case, returning 50% HTTP 503’s:

apiVersion: config.istio.io/v1alpha2
kind: RouteRule
metadata:
  name: recommendation-503
spec:
  destination:
    namespace: tutorial
    name: recommendation
  precedence: 2
  route:
  - labels:
      app: recommendation
  httpFault:
    abort:
      percent: 50
      httpStatus: 503

And you apply the RouteRule with the istioctl command-line tool:

istioctl create -f istiofiles/route-rule-recommendation-503.yml -n tutorial

Testing the change is as simple as issuing a few curl commands at the customer end point. Make sure to test it a few times, looking for the resulting 503 approximately 50% of the time.

curl customer-tutorial.$(minishift ip).nip.io
customer => preference => recommendation v1 from '99634814-sf4cl': 88

curl customer-tutorial.$(minishift ip).nip.io
customer => 503 preference => 503 fault filter abort

Clean up:

istioctl delete -f istiofiles/route-rule-recommendation-503.yml -n tutorial

Delays

The most insidious of possible distributed computing faults is not a “dead” service but a service that is responding slowly, potentially causing a cascading failure in your network of services. More important, if your service has a specific Service-Level Agreement (SLA) it must meet, how do you verify that slowness in your dependencies do not cause you to fail in delivery to your awaiting customer? Injecting network delays allows you to see how the system behaves when a critical service or three simply adds notable extra time to a percentage of responses.

Much like the HTTP Fault injection, network delays use the RouteRule kind, as well. The following YAML injects seven seconds of delay into 50% of the responses from recommendation service:

apiVersion: config.istio.io/v1alpha2
kind: RouteRule
metadata:
  name: recommendation-delay
spec:
  destination:
    namespace: tutorial
    name: recommendation
  precedence: 2
  route:
  - labels:
      app: recommendation
  httpFault:
    delay:
      percent: 50
      fixedDelay: 7s

Use the istioctl create command to apply the new RouteRule:

istioctl create -f istiofiles/route-rule-recommendation-delay.yml 
-n tutorial

Then, send a few requests at the customer endpoint and notice the “time” command at the front. This command will output the elapsed time for each response to the curl command, allowing you to see that seven-second delay.

#!/bin/bash
while true
do
time curl customer-tutorial.$(minishift ip).nip.io
sleep .1
done

Notice that many requests to the customer end point now have a delay. If you are monitoring the logs for recommendation v1 and v2, you will also see the delay happens before the recommendation service is actually called. The delay is in the Istio proxy (Envoy), not in the actual endpoint.

stern recommendation -n tutorial

Clean up:

istioctl delete -f istiofiles/route-rule-recommendation-delay.yml 
-n tutorial
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.17.91