Tip 5Fail Gracefully
White Belt[​​White Belt] Writing code that fails well is just as important as writing code that works well.

What happens when code fails? It’s going to. Even if you wrote your part perfectly, there are all kinds of conditions that could cause the overall system to fail:

  • A rogue mail daemon on computer, busy sending offers of great wealth from some foreign country, consumes all the RAM and swap. Your next call to malloc() returns ETOOMUCHSPAM.

  • Java Update 134,001 fills up the system’s hard drive. You call write(), and the system returns ESWITCHTODECAF.

  • You try to pull data off a tape, but the tape robot is on a ship at sea, rolling waves cause the robot to drop the tape, and the driver returns EROBOTDIZZY.

  • Cosmic rays flip a bit in memory, causing a memory access to return 0x10000001 instead of 0x1, and you discover that this makes for a very bad parameter to pass into memcpy() after it returns EMEMTRASHED.

You may think, “Yeah, right” but all these cases actually happened. (Yes, I had to fix a tape robot controller because it would drop tapes when on a Navy ship.) Your code cannot naively assume that the world around it is sane—the world will take every opportunity to prove it wrong.

How your code fails is just as important as how it works. You may not be able to fix the failure, but if nothing else, your code should strive to fail gracefully.

Order of Operations

In many textbook programs, their environment is a clean slate, and the program runs to completion. In many messy, nontextbook programs, the environment is a rugby match of threads and resources, all seemingly trying to beat each other into submission.

Consider the following example: you’re creating a list of customer names and addresses that will be fed to a label printer. Your code gets passed a customer ID and a database connection, so you need to query the database for what you need. You create a linked list whose add() method looks like this:

ListUpdate.rb
 
def​ add(customer_id) ​# BAD BAD BAD, see text
 
begin
 
@mutex.lock
 
old_head = @head
 
@head = Customer.new
 
@head.name =
 
@database.query(customer_id, :name)
 
@head.address =
 
@database.query(customer_id, :address)
 
@head.next = old_head
 
ensure
 
@mutex.unlock
 
end
 
end

(Yes, I know this example is contrived. Bear with me.)

This code works in the happy path: the new element is put at the head of the list, it gets filled in, and everything is happy. But what if one of those queries to the database raises an exception? Take a look at the code again.[9]

This code doesn’t fail gracefully. In fact, it does collateral damage by allowing a database failure to destroy the customer list. The culprit is the order of operations:

  • The list @head and @head.next are absolutely vital to the list’s integrity. These shouldn’t be monkeyed with until everything else is ready.

  • The new object should be fully constructed before inserting into the list.

  • The lock should not be held during operations that could block. (Assume there’s other threads wanting to read the list.)

Transactions

In the previous section, the example had only one essential bit of state that needed to stay consistent. What about cases where there’s more than one? Consider the classic example of moving money between two bank accounts:

Transaction.rb
 
savings.deduct(100)
 
checking.deposit(100)

What happens if the database croaks right after the money has been deducted and the deposit into checking fails? Where did the money go? Perhaps you try to solve that case by putting it back into the savings account:

Transaction.rb
 
savings.deduct(100) ​# Happily works
 
 
begin
 
checking.deposit(100) ​# Fails: database went down!
 
rescue
 
begin
 
# Put money back
 
savings.deposit(100) ​# Fails: database still dead
 
rescue
 
# Now what???
 
end
 
end

Nice try, but that doesn’t help if the second deposit fails, too.

The tool you need here is a transaction. Its purpose is to allow several operations, potentially to several objects, to be either fulfilled completely or rolled back.

Transactions (here in a made-up system) would allow our previous example to look like this:

Transaction.rb
 
t = Transaction.new(savings, checking)
 
t.start
 
 
# Inject failure
 
checking.expects(:deposit).with(100).raises
 
 
begin
 
savings.deduct(100)
 
checking.deposit(100)
 
t.commit
 
rescue
 
t.rollback
 
end

You’ll usually find transactions in databases, because our example scenario is exceedingly common in that field. You may find variations on this theme anywhere systems require an all-or-nothing interlock.

Failure Injection

So far, we’ve talked about how your code responds to likely failures. For purposes of testing, how do you ensure your code responds well when an essential resource dies, passes on, is no more, ceases to be, pushes up daisies, and becomes an ex-resource?

The solution is to inject failures using an automated test harness. This is easiest with a mock object framework, because you can tell the mock to return good data several times and then return something bogus or throw an exception. Likewise, in the code under test, you assert that the appropriate exception is raised.

Revisiting our list update problem, here’s some test code that simulates a valid database response for key 1 and a failure on the query for key 2:

ListUpdate2.rb
 
require ​'rubygems'
 
require ​'test/unit'
 
require ​'mocha'
 
 
class​ ListUpdateTest < Test::Unit::TestCase
 
def​ test_database_failure
 
database = mock()
 
database.expects(:query).with(1, :name).
 
returns(​'Anand'​)
 
database.expects(:query).with(1, :address).
 
returns(​''​)
 
database.expects(:query).with(2, :name).
​ 
raises
 
 
q = ShippingQueue.new(database)
 
q.add(1)
 
 
assert_raise(RuntimeError) ​do
​ 
q.add(2)
 
end
 
 
# List is still okay
​ 
assert_equal ​'Anand'​, q.head.name
 
assert_equal nil, q.head.next
 
end
 
end

​ Injection of RuntimeError exception.

​ Call will raise; the assert_raise is expecting it (and will trap the exception).

​ Verify that the list is still intact, as if q.add(2) were never called.

Failure injection of this sort allows you to think through—and verify—each potential scenario of doom. Test in this manner just as often as you test the happy path.

Test Monkeys

You can think through scenarios all day long and build tremendously robust code. Yet most fool-proof programs can be foiled by a sufficiently talented fool. If you don’t have such a fool handy, the next best thing is a test monkey.

In my first job working on handheld computers, we had a program called Monkey that would inject random taps and drags into the UI layer, as if they had come from the touchscreen. There was nothing fancier than that. We’d run Monkey until the system crashed.

Monkey may not have been a talented fool, but a whole bunch of monkeys tapping like mad, 24 hours a day, makes up for lack of talent. Alas, no Shakespeare (but perhaps some E. E. Cummings) and a whole bunch of crashes. The crashes were things we couldn’t have envisioned—that was the point.

In the same way, can you create a test harness that beats the snot out of your program with random (but valid) data? Let it run thousands or millions of cycles; you never know what might turn up. I used this technique on a recent project and discovered that once in a blue moon, a vendor API function would return “unknown” for the state of a virtual machine. What do they mean, they don’t know the state? I had no idea the function could return that. My program crashed when it happened. Lesson learned…again.

Actions

Revisit the previous code with the customer list. How would you fix it? Here’s a shell to work with:

ListUpdate2.rb
 
require ​'thread'
 
 
class​ Customer
 
attr_accessor :name, :address, :next
 
 
def​ initialize
 
@name = nil
 
@address = nil
 
@next = nil
 
end
 
end
 
 
class​ ShippingQueue
 
attr_reader :head
 
 
def​ initialize(database)
 
@database = database
 
@head = nil
 
@mutex = Mutex.new
 
end
 
 
def​ add(customer_id)
 
# Fill in this part
 
end
 
end

Use the test code from Failure Injection to see whether you got it right.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.163.250