Randomly breaking your app can lead to better code

The best way to avoid failure is to fail constantly.—Netflix (http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html).

How to do it...

Netflix has built a tool they call a Chaos Monkey. Its job is to randomly kill instances and services. This forces the developers to make sure their system can fail smoothly and safely.

To build our own version of this, some of the things we would need it to do include:

  • Randomly kill processes
  • Inject faulty data at interface points
  • Shutdown network interfaces between distributed systems
  • Issue shutdown commands to subsystems
  • Create denial-of-service attacks by overloading interface points with too much data

This is a starting point. The idea is to inject errors wherever you can imagine them happening. This may require writing scripts, cron jobs, or any means necessary to cause these errors to happen.

How it works...

Given there is a chance for a remote system to be unavailable in production, we should introduce ways for this to happen in our development environment. This will encourage us to code higher fault tolerance into our system.

Before we introduce a random running "Chaos Monkey" like Netflix has, we need to see that our system can handle these situations manually. For example, if our system includes communication between two servers, a fair test is unplugging the network cable to one box, simulating network failure. When we verify our system can continue working with acceptable means, then we can add scripts to do this automatically and eventually, randomly.

Audit logs are valuable tools to verify our system is handling these random events. If we can read a log entry showing a forced network shutdown and then see log entries of similar time stamps, we can easily evaluate whether or not the system handled the situation.

After building that in, we can work on the next error to randomly introduce into the system. By following this cycle, we can build up the robustness of our system.

There's more...

This doesn't exactly fit into the realm of automated testing. This is also very high level. It's hard to go into much more detail because the type of faulty data to inject requires an intimate understanding of the actual system.

How does this compare to fuzz testing?

Fuzz testing is a style of testing where invalid, unexpected, and random data is injected into input points of our software (http://en.wikipedia.org/wiki/Fuzz_testing). If the application fails, this is considered a failure. If it doesn't, then it has passed. This type of testing goes in a similar direction, but the blog article written by Netflix appears to go much farther than simply injecting different data. They speak about killing instances and interrupting distributed communications. Basically, anything you can think of that would happen in production, we should try to replicate in a test bed.

Fusil (https://bitbucket.org/haypo/fusil) is a Python tool that aims to provide fuzz testing. You may want to investigate if it is useful for your project needs.

Are there any tools to help with this?

Jester (for Java), Pester (for Python), and Nester (for C#) are used to conduct mutation testing (http://jester.sourceforge.net/). These tools find out what code is not covered by test cases, alters the source code, and re-runs the test suites. Finally, it gives a report on what was changed, and what passed, and didn't pass. It can illuminate what is and is not covered by our test suites in ways coverage tools can't.

This isn't a complete "Chaos Monkey", but it provides one area of assistance in trying to "break the system" and force us to improve our test regime. To really build a full blown system probably wouldn't fit inside some test project, because it requires writing custom scripts based on the environment it's meant to run in.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.255.187