A Strategy for Race Condition Reproduction

Before we get into the specific techniques for reproducing race conditions, let’s establish an overall strategy to guide how we approach the problem. Reproducing race conditions requires being able to carefully orchestrate the sequence of events in the critical section, preferably in a way that will not change when we fix the race condition. Understanding the exact nature of the race condition lets us understand the boundaries of the critical section and therefore the scope of our possible control of the execution. We are looking for the seams within the critical section.

In Listing 13-2, our race condition is bounded by testing for the null-ness of resource at line 2 and the assignment to resource on line 3. There isn’t much happening between those two lines, is there? Not really, but understanding the sequence of evaluation gives us more insight and opportunity.

Looking more closely at line 3, we see that a constructor is invoked and that the operator precedence guarantees that the constructor invocation occurs before the variable assignment. This is a seam. Depending on what happens within that constructor, we may have any number of seams to exploit. More complex situations are often rich with seams. Additionally, the underlying threading and execution implementation may offer seams that are not explicitly indicated by a surface inspection of the code.

What do we do with these seams when we find them? We use them to explicitly trigger or simulate the thread suspension that may occur naturally during execution. We will use a collection of techniques to essentially stop a thread until some other execution sequence has completed so that we may then resume the thread to duplicate the race condition behavior.

Let’s generalize the approach first. I will outline an analytical approach to diagnosing race conditions. It takes some time to build the experience and the right perspective to be able to pick them out very easily, and you will need familiarity with the synchronization APIs in your environment, but hopefully this will get you on the right path.

1. Identify your race condition.

• Determine that you may have a race condition.

Sometimes you will suspect a race because of data corruption. Sometimes you will see unexpected failures that only occur sporadically, such as the exceptions generated in the logging example later in this chapter. Other times you will see things seem to happen in the wrong order. Once in a while, some portion of the application will freeze. Any or all of these could show up. The key is that the erroneous behavior will happen sporadically, often correlated more strongly with particular types of load, or in a way that is very environment specific, such as only on a slower or faster CPU, browser, or network connection. If you can statistically characterize the occurrence of the problem, keep that data as a baseline.

• Make sure the code works correctly in the single-threaded case, if appropriate.

You will feel really frustrated by spending a lot of time chasing a race condition if it is simply a bug in the algorithm. Occasionally, you will run into code in which the synchronization is so integral to the behavior that correctness verification is difficult. Refactor so that you can verify single-threaded correctness before you go further.

• Create a hypothesis for the nature of the interaction.

Race conditions are sequencing issues. When the problem involves data corruption, then the sequence of data modification needs fixing. Strange behavior simply involves order of operations. Often the hypothesis requires a fairly deep understanding of the fundamentals of the system, such as how multitasking and timeslicing work, how locks apply in your runtime environment, or how asynchronous callbacks are processed. Formulate a plausible concept of the problem.

• Identify the critical section(s) responsible for the hypothesis.

This step identifies the code, not just the concept. For data issues, look for places where the code reads and writes the data in separate synchronization scopes. Be aware of whether data structures are thread-safe and what thread-safe guarantees they make. For both data and behavioral issues, look for asynchronous events that have unenforced dependencies on order. For freezes, look for overzealous synchronization, particularly in areas that manage multiple locks and unlock them in an order different from the reverse of the order in which they are locked. Also, look for circular dependencies in synchronized areas.

2. Reproduce the race condition.

• Validate the hypothesis.

Once in awhile, you can jump directly to code-based tests and can skip this step. Most of the time, this step involves the debugger. Pause and run the thread or threads, potentially modifying the data involved. Once you have the recipe for the race, you can proceed. If you cannot duplicate the race in the debugger, either find another critical section or formulate a new hypothesis.

• Identify the seams in the critical section.

You can apply variations on most of the techniques from the prior chapters of this book to force race conditions. The details of those applications fill the rest of this chapter. You can use almost any seam that allows you to inject a test double. Additionally, you should consider any place that allows you to interact with the locks on the critical section either directly or indirectly. A good example of a nonobvious but available lock in Java is the monitor that is available on all objects through the wait/notify APIs.

• Choose the right seam.

From the seams you have identified, eliminate any that you think will disappear when you fix the race condition. You probably have an idea of the fix from your time validating the hypothesis. If the fix will alter a seam, then use of that seam will not result in a test that can validate the fix.

• Exploit the seam to inject synchronization.

When you validated the hypothesis, you stepped through the code in the debugger to reproduce the race. Now that you have chosen a seam to exploit, use it to programmatically reproduce the sequence of control you developed in the debugger. Write the test to perform that sequence and to assert that the expected outcome does not happen. For deadlocks, a timeout value on the test handles failure nicely. For example, the JUnit @Test annotation takes an optional timeout parameter.

3. Fix the race condition.

Now that you have reproduced the race in your test, the test should pass once you fix it.

4. Monitor the results.

In the first step, I suggested you should characterize the occurrence of the race condition. Unfortunately, you may have found, duplicated, and fixed a race condition different from the one you were trying to find. Ultimately, the proof that you have fixed it is that it occurs less or not at all, but the sporadic nature of race conditions makes that only verifiable through ongoing monitoring. Notice also that I wrote “occurs less.” It is possible that there were multiple causes for similar symptoms, and you only found one.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.32.222