Testing Resiliency

Let’s start this section with a bold statement: we are not fans of testing process crashes. We’ll get philosophical. Why does a process crash? Most of the time, because of an unexpected error. In some cases, the best thing to do might be to raise an exception or exit from a process for an error condition that we know can happen (an expected error). However, we find that tends to be the exception rather than the rule, because if the error is expected then you likely want to handle it gracefully (think of TCP connections dropping or user input error). In cases where you’re raising or exiting on purpose, it might make sense to test that behavior.

Regardless of that, one of the most powerful features of the OTP architecture is that if a process bumps into an unexpected error and crashes, there will likely be a supervisor bringing it back up. That’s not behavior we want to test; supervisors work and have been tested with automated and on-the-field techniques for decades now. At the same time, we don’t really want to test that processes can crash. But if they crash because of an unexpected error, how do we test that if the error itself is unexpected? If you can test an unexpected error, we believe that error is by definition not unexpected. There’s your philosophy right there.

So, we don’t want to test that processes are restarted if they crash and we don’t want to test that processes can crash because of unexpected errors. So what do we test? Well, one interesting thing to test about processes crashing is the aftermath of a crash: the crash recovery, so to speak. Most of OTP is designed in a way so as to have most things automatically cleaned up when a process crashes, using things like links between processes and ports. For example, if a GenServer starts a TCP socket using :gen_tcp, that socket will be linked to the GenServer. If the GenServer were to crash, the socket would be closed thanks to the existing link.

However, there are some cases where your processes start or open resources aren’t automatically stopped or closed when their “parent process” dies. A straightforward example could be a GenServer process that creates and opens a file when starting and uses it to dump some internal state during its life cycle. It’s likely that you would want this file to be deleted and cleaned up if the GenServer were to crash unexpectedly. This is something that we believe is worth testing, so we’ll explore it a bit later in this section.

When it comes to supervision trees, we believe the thing that might be worth testing is that your supervision tree is laid out correctly. However, we’ll be frank here: that’s hard and messy to test. We’ll explore some solutions and alternatives toward the end of this section, but we want to set the expectations pretty low.

Testing Cleanup After Crashes

If you want to perform some cleanup after a process crashes in order to ensure that the crash doesn’t leave anything behind, your best option will often be a different and very straightforward process that monitors the first process and whose job is to perform only the necessary cleanup when the first process crashes. Let’s go back to the example we mentioned at the end of the last section: a GenServer that dumps some terms to a file during its life cycle to avoid keeping such state in memory:

 defmodule​ GenServerThatUsesFile ​do
 use​ GenServer
 
 def​ start_link(opts) ​do
  GenServer.start_link(__MODULE__, opts, ​name:​ __MODULE__)
 end
 
 def​ store(pid, term) ​do
  GenServer.cast(pid, {​:store​, term})
 end
 
  @impl true
 def​ init(opts) ​do
  path = Keyword.fetch!(opts, ​:path​)
 
  File.touch!(path)
 
  pid = self()
  ref = make_ref()
  spawn_link(​fn​ -> monitor_for_cleanup(pid, ref, path) ​end​)
 
 # Wait for the cleanup process to be ready, so that if this process
 # crashes before the cleanup process is trapping exits then we don't
 # leave a zombie process.
 receive​ ​do
  {^ref, ​:ready​} -> ​:ok
 end
 
  {​:ok​, path}
 end
 
  @impl true
 def​ handle_cast({​:store​, term}, path) ​do
  new_content = ​"​​ "​ <> ​:erlang​.term_to_binary(term)
  File.write!(path, new_content, [​:binary​, ​:append​])
  {​:noreply​, path}
 end
 
 defp​ monitor_for_cleanup(pid, ref, path) ​do
  Process.flag(​:trap_exit​, true)
  send(pid, {ref, ​:ready​})
 
 receive​ ​do
  {​:EXIT​, ^pid, _reason} ->
  File.rm_rf!(path)
 end
 end
 end

This GenServer doesn’t do anything useful, but you can imagine how it could have an API to retrieve particular terms that it adds to the file, for example. Let’s keep this possible API in our imagination for the sake of writing less code.

There’s no reliable way to make sure that if this GenServer crashes it’ll clean up the file. So, what we do is write a little “cleanup process.” This process could also crash, yes, but it’s less likely to do so given how simple its code is. We spawn this process directly from the GenServer’s init/1 callback. The code isn’t the most straightforward, but it’s just taking care of possible race conditions and ensuring the following:

  • The GenServer process dies if—because of some freak accident—the cleanup process dies, and

  • The cleanup process only removes the file when the GenServer process dies and then peacefully terminates.

Now that we have this process in place, testing the aftermath of it crashing is straightforward. We can just kill the GenServer process and make sure that the file isn’t there anymore:

 test ​"​​no file is left behind if the GenServer process crashes"​ ​do
  path =
  Path.join(
  System.tmp_dir!(),
  Integer.to_string(System.unique_integer([​:positive​]))
  )
 
  pid = start_supervised!({GenServerThatUsesFile, ​path:​ path})
  assert File.exists?(path)
 
  Process.​exit​(pid, ​:kill​)
 
  wait_for_passing(_2_seconds = 2000, ​fn​ ->
  refute File.exists?(path)
 end​)
 end

The test generates a random path in the system’s temporary directory (using System.tmp_dir!/0), then starts the GenServer and asserts that the file is there, kills the GenServer brutally (with Process.exit/2 and reason :kill), and finally asserts that the file isn’t there anymore. You’ll notice the use of a function called wait_for_passing/2. This is a little function we find ourselves writing pretty often when working on Elixir code. Its purpose is to avoid fancy message-passing in order to know when we can run an assertion.

wait_for_passing/2’s job is to run an assertion repeatedly for a maximum given interval of time (two seconds in this case), allowing it to fail during that interval. After the time interval, the assertion is run one last time without rescuing any exceptions so that if the assertion fails after the given time interval, then the test will fail. We need wait_for_passing/2 in this test because if we were to assert the non-existence of the file right after we killed the GenServer, we’d have a race condition that could result in the file not having been deleted yet when the assertion is run. By waiting for a couple of seconds and trying the assertion over and over, we’re giving what’s likely more than enough time to the cleanup process to delete the file. If after two seconds the file is still there, it means that we probably have a problem. Note that we could bump the interval to ten or twenty seconds or even more if we didn’t feel comfortable: wait_for_passing/2 returns as soon as the assertion passes, so our tests would remain fast unless the assertion would fail (which is unlikely to be our normal test run since we’d hopefully fix the bug and make it pass again!).

Let’s look at the code for this little helper function:

 defp​ wait_for_passing(timeout, fun) ​when​ timeout > 0 ​do
  fun.()
 rescue
  _ ->
  Process.sleep(100)
  wait_for_passing(timeout - 100, fun)
 end
 
 defp​ wait_for_passing(_timeout, fun), ​do​: fun.()

The implementation is straightforward and uses recursion. It decreases the given timeout until it “runs out.” In this implementation, we’re hard-coding Process.sleep(100) and timeout - 100, which means that the assertion is run every 100 milliseconds during the given interval, but we could change this value or turn it into an argument to the wait_for_passing function to make it customizable. wait_for_passing/2 isn’t the most efficient function since it repeatedly runs the assertion and “wastes” a few hundred milliseconds between runs, but it’s a good and effective tool that in the real world we’ve ended up using in more than a few places.

This section has turned out to be more about the code to test than the tests themselves, but we believe it serves the purpose of showing what it means to test cleaning up after crashes. In this instance, the test was small and simple enough. If things become more complicated, you’ll have the tools we learned about in the previous chapters to help you architect your tests using things like dependency doubles and message passing.

Let’s move on to the final enemy of easy testing in the OTP landscape: supervisors.

Testing Supervision Trees

Supervisors are one of the strongest selling points of Erlang/Elixir and the OTP set of abstractions. They allow you to structure the life cycle of the processes inside your application in a resilient way, and they make isolating failures a breeze. However, supervisors are one of the toughest things to test that we’ve come across. The reason for this is that their main job is to allow your application to recover from complex and cascading failures that are hard to trigger on purpose during testing.

Imagine having a complex and “deep” supervision tree (with several levels between the root of the tree and the leaf processes). Now imagine that a child in a corner of the tree starts crashing and doesn’t recover just by being restarted on its own. OTP works beautifully and propagates the failure to the parent supervisor of that child, which starts crashing and restarting all of its children. If that doesn’t solve the problem, then the failure is propagated up and up until restarting enough of your application fixes the problem (or the whole thing crashes, if it’s a really serious problem). Well, how do you test this behavior? Do you even want to extensively test it?

It’s hard to inject a failure during testing that isn’t solved by a simple restart but that also doesn’t bring down the whole application. At the same time, we know that supervisors work: that is, we know that the failure isolation and “bubble up” restarting behavior work. We know that because supervisors have been battle-tested for decades at this point. As already discussed, that’s not what we want to test because it falls in the antipattern of “testing the abstraction” instead of testing your own logic.

Our advice? In practice, we tend to just not have automated testing for supervision trees. We test the individual components of our applications, and sometimes those components are made of a set of processes organized in a “subtree” of the main application supervision tree. In order to stay true to the black box testing model, we often test the component as a whole regardless of whether it’s made of multiple processes living inside a supervision tree. If components are unit-tested and integration tested, then we happily rely on OTP to make sure that supervisors behave in the right way.

However, we don’t fly completely blind. Most of the time, we spend some time every now and then manually firing up our application and just doing some manual testing. We kill processes in the running application and make sure that the supervision tree isolates failures and recovers things as expected. Let’s talk about this approach some more.

Exploratory Manual Testing

As it turns out, we find that in practice there is a way of testing supervision trees that strikes a nice balance between increasing your confidence in the resiliency of your application without having to write complex, convoluted, and fragile test suites. We’re talking about exploratory manual testing. We’re using this terminology exclusively to sound fancy, because what we really mean is this: fire up observer, go right-click on random processes in your application, kill them without mercy, and see what happens.

As crude as it sounds, this method is pretty efficient and practical. What we do is start our application and simulate a situation in which it’s operating under normal conditions. We could generate a constant flow of web requests if we’re building an application with an HTTP interface, for example. Then, we kill random processes in the supervision tree and observe what happens. Of course, the expected outcome of a process being killed depends on the process: if we kill the top-level supervisor, we’ll surely see our application stop replying to HTTP requests. However, what happens if we, say, kill a process handling a single web request? We should see one dropped request but nothing else. This isn’t hard to observe. What happens if we kill the whole pool of database connections? We’ll probably start to see a bunch of requests return some 4xx or 5xx HTTP status codes. In most cases you should know what happens when a part of the supervision tree crashes, because that exact knowledge should drive how to design the shape of the supervision tree in the first place. If you want your database connections to fail in isolation while your application keeps serving HTTP requests, for example, then you need to isolate the supervision tree of the database connections from the supervision tree of the HTTP request handlers, and maybe make them siblings in their parent supervision tree.

Let’s talk about the downsides of this testing technique. The main one is that this type of testing isn’t automated. You need to have a human do all the steps manually: start the application, generate some work for it to perform, run observer, kill processes, and observe the behavior. We’re all about automated testing, as you can imagine, so we aren’t big fans of this. However, in our experience, the supervision trees of many applications tend to be changed relatively rarely compared to the rest of the code. As such, this kind of manual testing might be required only a few times during the life cycle of the application. When comparing the time it takes to run such manual tests to the time it would take to build automated testing for the failure behavior, it might make practical sense to just run those manual tests a few times.

Another observation that we have from experience is that many Elixir and Erlang applications have a relatively “flat” supervision tree, with only a handful of children. The exploratory manual testing technique works significantly better with smaller and simpler supervision trees since, well, there are fewer failure cases to manually test in the first place. Testing a flat supervision tree tends to be easier than testing a nested tree.

Property-Based Testing for Supervision Trees

Property-based testing is a technique that we’ll discuss more extensively in Chapter 7, Property-Based Testing. In short, it revolves around generating random input data for your code and making sure your code maintains a given set of properties regardless of the (valid) input data. This might sound alien, but you’ll have time to dig in and understand these ideas in the chapter. However, we just wanted to get slightly ahead of ourselves and mention a library called sups by Fred Hebert.[27] This library is experimental but can be used with a couple of different Erlang property-based testing frameworks to programmatically run your application, inject random failures in the supervision tree, and monitor some properties of the application that you define.

The library’s README explains this in the best way:

In a scenario where the supervision structure encodes the expected failure semantics of a program, this library can be used to do fault-injection and failure simulation to see that the failures taking place within the program actually respect the model exposed by the supervision tree.

We have never used this library in real-world projects ourselves because we’ve never felt the need to have complex automated testing for our supervision trees (for the reasons mentioned in the previous sections). However, we know Fred and trust his work, so we are referencing this library here for the curious reader that wants or needs a bullet-proof testing suite for their supervision trees. We want to stress that the library is experimental, so use at your own risk.

We feel that the lack of more widely used or prominent tooling around testing supervisors is a good sign that the Erlang and Elixir communities aren’t so keen on having robust automated testing of supervision trees either. Maybe one day the communities will figure out a good way of having simple and effective testing for supervision trees. But as far as we can tell, it’s not that day yet.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.1.136