A relatively famous OSS project called Chaos Monkey came from the developer team at Netflix, and its unveiling to the IT world was quite disruptive. The concept that Netflix had built code that random kills various services in their production environment blew people’s minds. When many teams struggle maintaining their uptime requirements, promoting self-sabotage and attacking oneself seemed absolutely crazy. Yet from the moment Chaos Monkey was born, a new movement arose: chaos engineering.
According to the Principles of Chaos Engineering website, “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” (You can read more at http://principlesofchaos.org/.)
In complex systems (software systems or ecological systems), things do and will fail, but the ultimate goal is stop catastrophic failure of the overall system. So how do you verify that your overall system–your network of microservices–is in fact resilient? You inject a little chaos. With Istio, this is a relatively simple matter because the istio-proxy is intercepting all network traffic, therefore, it can alter the responses including the time it takes to respond. Two interesting faults that Istio makes easy to inject are HTTP error codes and network delays.
This simple concept allows you to explore your overall system’s behavior when random faults pop up within the system. Throwing in some HTTP errors is actually very simple when using Istio’s RouteRule
construct. Based on previous exercises earlier in this book, recommendation v1 and v2 are both deployed and being randomly load balanced because that is the default behavior in Kubernetes/OpenShift. Make sure to comment out the “timeout” line if that was used in a previous exercise. Now, you will be injecting errors and timeouts via Istio instead of using Java code:
oc get pods -l app=recommendation -n tutorial NAME READY STATUS RESTARTS AGE recommendation-v1-3719512284-7mlzw 2/2 Running 6 18h recommendation-v2-2815683430-vn77w 2/2 Running 0 3h
We use the Istio RouteRule
to inject a percentage of faults, in this case, returning 50% HTTP 503’s:
apiVersion: config.istio.io/v1alpha2 kind: RouteRule metadata: name: recommendation-503 spec: destination: namespace: tutorial name: recommendation precedence: 2 route: - labels: app: recommendation httpFault: abort: percent: 50 httpStatus: 503
And you apply the RouteRule
with the istioctl
command-line tool:
istioctl create -f istiofiles/route-rule-recommendation-503.yml -n tutorial
Testing the change is as simple as issuing a few curl
commands at the customer end point. Make sure to test it a few times, looking for the resulting 503 approximately 50% of the time.
curl customer-tutorial.$(minishift ip).nip.io customer => preference => recommendation v1 from '99634814-sf4cl': 88 curl customer-tutorial.$(minishift ip).nip.io customer => 503 preference => 503 fault filter abort
Clean up:
istioctl delete -f istiofiles/route-rule-recommendation-503.yml -n tutorial
The most insidious of possible distributed computing faults is not a “dead” service but a service that is responding slowly, potentially causing a cascading failure in your network of services. More important, if your service has a specific Service-Level Agreement (SLA) it must meet, how do you verify that slowness in your dependencies do not cause you to fail in delivery to your awaiting customer? Injecting network delays allows you to see how the system behaves when a critical service or three simply adds notable extra time to a percentage of responses.
Much like the HTTP Fault injection, network delays use the RouteRule
kind, as well. The following YAML injects seven seconds of delay into 50% of the responses from recommendation service:
apiVersion: config.istio.io/v1alpha2 kind: RouteRule metadata: name: recommendation-delay spec: destination: namespace: tutorial name: recommendation precedence: 2 route: - labels: app: recommendation httpFault: delay: percent: 50 fixedDelay: 7s
Use the istioctl
create command to apply the new RouteRule
:
istioctl create -f istiofiles/route-rule-recommendation-delay.yml -n tutorial
Then, send a few requests at the customer endpoint and notice the “time” command at the front. This command will output the elapsed time for each response to the curl
command, allowing you to see that seven-second delay.
#!/bin/bash while true do time curl customer-tutorial.$(minishift ip).nip.io sleep .1 done
Notice that many requests to the customer end point now have a delay. If you are monitoring the logs for recommendation v1 and v2, you will also see the delay happens before the recommendation service is actually called. The delay is in the Istio proxy (Envoy), not in the actual endpoint.
stern recommendation -n tutorial
Clean up:
istioctl delete -f istiofiles/route-rule-recommendation-delay.yml -n tutorial
3.144.17.91