Blocked Threads

Managed runtime languages such as C#, Java, and Ruby almost never really crash. Sure, they get application errors, but it’s relatively rare to see the kind of core dump that a C or C++ program would have. I still remember when a rogue pointer in C could reduce the whole machine to a navel-gazing heap. (Anyone else remember Amiga’s “Guru Meditation” errors?) Here’s the catch about interpreted languages, though. The interpreter can be running, and the application can still be totally deadlocked, doing nothing useful.

As often happens, adding complexity to solve one problem creates the risk of entirely new failure modes. Multithreading makes application servers scalable enough to handle the web’s largest sites, but it also introduces the possibility of concurrency errors. The most common failure mode for applications built in these languages is navel-gazing—a happily running interpreter with every single thread sitting around waiting for Godot. Multithreading is complex enough that entire books are written about it. (For the Java programmers: the only book on Java you actually need, however, is Brian Goetz’s excellent Java Concurrency in Practice [Goe06].) Moving away from the “fork, run, and die” execution model brings you vastly higher capacity but only by introducing a new risk to stability.

The majority of system failures I have dealt with do not involve outright crashes. The process runs and runs but does nothing because every thread available for processing transactions is blocked waiting on some impossible outcome.

I’ve probably tried a hundred times to explain the distinction between saying “the system crashed” and “the system is hung.” I finally gave up when I realized that it’s a distinction only an engineer bothers with. It’s like a physicist trying to explain where the photon goes in the two-slit experiment from quantum mechanics. Only one observable variable really matters—whether the system is able to process transactions or not. The business sponsor would frame this question as, “Is it generating revenue?”

From the users’ perspective, a system they can’t use might as well be a smoking crater in the earth. The simple fact that the server process is running doesn’t help the user get work done, books bought, flights found, and so on.

That’s why I advocate supplementing internal monitors (such as log file scraping, process monitoring, and port monitoring) with external monitoring. A mock client somewhere (not in the same data center) can run synthetic transactions on a regular basis. That client experiences the same view of the system that real users experience. If that client cannot process the synthetic transactions, then there is a problem, whether or not the server process is running.

Metrics can reveal problems quickly too. Counters like “successful logins” or “failed credit cards” will show problems long before an alert goes off.

Blocked threads can happen anytime you check resources out of a connection pool, deal with caches or object registries, or make calls to external systems. If the code is structured properly, a thread will occasionally block whenever two (or more) threads try to access the same critical section at the same time. This is normal. Assuming that the code was written by someone sufficiently skilled in multithreaded programming, then you can always guarantee that the threads will eventually unblock and continue. If this describes you, then you are in a highly skilled minority.

The problem has four parts:

  • Error conditions and exceptions create too many permutations to test exhaustively.

  • Unexpected interactions can introduce problems in previously safe code.

  • Timing is crucial. The probability that the app will hang goes up with the number of concurrent requests.

  • Developers never hit their application with 10,000 concurrent requests.

Taken together, these conditions mean that it’s very, very hard to find hangs during development. You can’t rely on “testing them out of the system.” The best way to improve your chances is to carefully craft your code. Use a small set of primitives in known patterns. It’s best if you download a well-crafted, proven library.

Incidentally, this is another reason why I oppose anyone rolling their own connection pool class. It’s always more difficult than you think to make a reliable, safe, high-performance connection pool. If you’ve ever tried writing unit tests to prove safe concurrency, you know how hard it is to achieve confidence in the pool. Once you start trying to expose metrics, as I discuss in Designing for Transparency, rolling your own connection pool goes from a fun Computer Science 101 exercise to a tedious grind.

If you find yourself synchronizing methods on your domain objects, you should probably rethink the design. Find a way that each thread can get its own copy of the object in question. This is important for two reasons. First, if you are synchronizing the methods to ensure data integrity, then your application will break when it runs on more than one server. In-memory coherence doesn’t matter if there’s another server out there changing the data. Second, your application will scale better if request-handling threads never block each other.

One elegant way to avoid synchronization on domain objects is to make your domain objects immutable. Use them for querying and rendering. When the time comes to alter their state, do it by constructing and issuing a “command object.” This style is called “Command Query Responsibility Separation,” and it nicely avoids a large number of concurrency issues.

Spot the Blocking

Can you find the blocking call in the following code?

 String key = (String)request.getParameter(PARAM_ITEM_SKU);
 Availability avl = globalObjectCache.get(key);

You might suspect that globalObjectCache is a likely place to find some synchronization. You would be correct, but the point is that nothing in the calling code tells you that one of these calls is blocking and the other is not. In fact, the interface that globalObjectCache implemented didn’t say anything about synchronization either.

In Java, it’s possible for a subclass to declare a method synchronized that is unsynchronized in its superclass or interface definition. In C#, a subclass can annotate a method as synchronizing on “this.” Both of these are frowned on, but I’ve observed them in the wild. Object theorists will tell you that these examples violate the Liskov substitution principle. They are correct.

In object theory, the Liskov substitution principle (see Family Values: A Behavioral Notion of Subtyping [LW93]) states that any property that is true about objects of a type T should also be true for objects of any subtype of T. In other words, a method without side effects in a base class should also be free of side effects in derived classes. A method that throws the exception E in base classes should throw only exceptions of type E (or subtypes of E) in derived classes.

Java and C# do not let you get away with other violations of the substitution principle, so I do not know why this one is allowed. Functional behavior composes, but concurrency does not compose. As a result, though, when subclasses add synchronization to methods, you cannot transparently replace an instance of the superclass with the synchronized subclass. This might seem like nit-picking, but it can be vitally important. The basic implementation of the GlobalObjectCache interface is a relatively straightforward object registry:

 public​ ​synchronized​ Object get(String id) {
  Object obj = items.get(id);
 if​(obj == ​null​) {
  obj = create(id);
  items.put(id, obj);
  }
 
 return​ obj;
 }

The “synchronized” keyword there should draw your attention. That’s a Java keyword that makes that method into a critical section. Only one thread may execute inside the method at a time. While one thread is executing this method, any other callers of the method will be blocked. Synchronizing the method here worked because the test cases all returned quickly. So even if there was some contention between threads trying to get into this method, they should all be served fairly quickly. But like the end of Back to the Future, the problem wasn’t with this class but its descendants.

Part of the system needed to check the in-store availability of items by making expensive inventory availability queries to a remote system. These external calls took a few seconds to execute. The results were known to be valid for at least fifteen minutes because of the way the inventory system worked. Since nearly 25 percent of the inventory lookups were on the week’s “hot items” and there could be as many as 4,000 (worst case) concurrent requests against the undersized, overworked inventory system, the developer decided to cache the resulting Availability object.

The developer decided that the right metaphor was a read-through cache. On a hit, it would return the cached object. On a miss, it would do the query, cache the result, and then return it. Following good object orientation principles, the developer decided to create an extension of GlobalObjectCache, overriding the get method to make the remote call. It was a textbook design. The new RemoteAvailabilityCache was a caching proxy, as described in Pattern Languages of Program Design 2 [VCK96]. It even had a timestamp on the cached entries so they could be expired when the data became too stale. This was an elegant design, but it wasn’t enough.

The problem with this design had nothing to do with the functional behavior. Functionally, RemoteAvailabilityCache was a nice piece of work. In times of stress, however, it had a nasty failure mode. The inventory system was undersized (see Unbalanced Capacities), so when the front end got busy, the back end would be flooded with requests. Eventually it crashed. At that point, any thread calling RemoteAvailabilityCache.get would block, because one single thread was inside the create call, waiting for a response that would never come. There they sit, Estragon and Vladimir, waiting endlessly for Godot.

This example shows how these antipatterns interact perniciously to accelerate the growth of cracks. The conditions for failure were created by the blocking threads and the unbalanced capacities. The lack of timeouts in the integration points caused the failure in one layer to become a cascading failure. Ultimately, this combination of forces brought down the entire site.

Obviously, the business sponsors would laugh if you asked them, “Should the site crash if it can’t check availability for in-store pickup?” If you asked the architects or developers, “Will the site crash if it can’t check availability?” they would assert that it would not. Even the developer of RemoteAvailabilityCache would not expect the site to hang if the inventory system stopped responding. No one designed this failure mode into the combined system, but no one designed it out either.

Libraries

Libraries are notorious sources of blocking threads, whether they are open-source packages or vendor code. Many libraries that work as service clients do their own resource pooling inside the library. These often make request threads block forever when a problem occurs. Of course, these never allow you to configure their failure modes, like what to do when all connections are tied up waiting for replies that’ll never come.

If it’s an open source library, then you may have the time, skills, and resources to find and fix such problems. Better still, you might be able to search through the issue log to see if other people have already done the hard work for you.

On the other hand, if it’s vendor code, then you may need to exercise it yourself to see how it behaves under normal conditions and under stress. For example, what does it do when all connections are exhausted?

If it breaks easily, you need to protect your request-handling threads. If you can set timeouts, do so. If not, you might have to resort to some complex structure such as wrapping the library with a call that returns a future. Inside the call, you use a pool of your own worker threads. Then when the caller tries to execute the dangerous operation, one of the worker threads starts the real call. If the call makes it through the library in time, then the worker thread delivers its result to the future. If the call does not complete in time, the request-handling thread abandons the call, even though the worker thread might eventually complete. Once you’re in this territory, beware. Here there be dragons. Go too far down this path and you’ll find you’ve written a reactive wrapper around the entire client library.

If you’re dealing with vendor code, it may also be worth some time beating them up for a better client library.

A blocked thread is often found near an integration point. These blocked threads can quickly lead to chain reactions if the remote end of the integration fails. Blocked threads and slow responses can create a positive feedback loop, amplifying a minor problem into a total failure.

Remember This

Recall that the Blocked Threads antipattern is the proximate cause of most failures.

Application failures nearly always relate to Blocked Threads in one way or another, including the ever-popular “gradual slowdown” and “hung server.” The Blocked Threads antipattern leads to Chain Reactions and Cascading Failures antipatterns.

Scrutinize resource pools.

Like Cascading Failures, the Blocked Threads antipattern usually happens around resource pools, particularly database connection pools. A deadlock in the database can cause connections to be lost forever, and so can incorrect exception handling.

Use proven primitives.

Learn and apply safe primitives. It might seem easy to roll your own producer/consumer queue: it isn’t. Any library of concurrency utilities has more testing than your newborn queue.

Defend with Timeouts.

You cannot prove that your code has no deadlocks in it, but you can make sure that no deadlock lasts forever. Avoid infinite waits in function calls; use a version that takes a timeout parameter. Always use timeouts, even though it means you need more error-handling code.

Beware the code you cannot see.

All manner of problems can lurk in the shadows of third-party code. Be very wary. Test it yourself. Whenever possible, acquire and investigate the code for surprises and failure modes. You might also prefer open source libraries to closed source for this very reason.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.71.237