Chapter 5. Error Handling and Debugging

Jack Ganssle

Kamal Hyder

Bob Perrin

In this chapter, we look at error handling, error management, managing changes to memory, and debugging techniques.

You have to think through error handling from the outset of designing a product. Can the system recover from a detected error? How? Will you permit a system crash or at least leave some debugging breadcrumbs behind? How? Does a long jump back to main() suitably reset the system after a catastrophic error is detected? Are you sure the peripherals will reset properly? Maybe it makes sense to halt and let the watchdog time out. Above all, error handling is fundamental to building a reliable embedded system.

The following sections discuss various techniques to avoid, find, and handle errors, and offer numerous tips for effective debugging and troubleshooting, including an important section on managing changes to memory. (In 2003 a woman almost died when her rice cooker reprogrammed her pacemaker’s Flash memory, and a man collapsed when a retail store’s security device did the same to his!)

The Zen of Embedded Systems Development and Troubleshooting

Troubleshooting is puzzle solving. When it comes to troubleshooting, a software engineer’s mind-set is 80% of the game. The techniques and tools in the toolbox are the remaining 20%.

Before discussing the nitty-gritty technical details of debugging in Dynamic C, we will touch on some concepts to help the reader develop a state of mind conducive to embedded systems development and troubleshooting.

While developing an embedded system, an engineer wears three distinct hats. They are inscribed: DEVELOPER, FINDER, and FIXER.

Wearing the DEVELOPER hat implies a responsibility to find or create a cost-effective solution to the control problem at hand.

An engineer dons the FINDER hat when a malfunction or bug is observed. The responsibility of the FINDER is to delve deeply into the bug and determine the root cause of the malfunction.

Once a bug is identified, the engineer slips on the FIXER hat. The FIXER, much like the DEVELOPER, must find a cost effective solution to a problem. However, unlike a DEVELOPER, a FIXER is usually constrained to an existing design and seldom has the leeway a DEVELOPER does.

The DEVELOPER Hat

Embedded systems developers have many details with which to contend. An enormously helpful philosophy is that of “baby steps for the engineer.” This philosophy derives from the fact that diagnosing runtime errors in embedded systems is often significantly more difficult than diagnosing problems in more controlled environments

Software engineers writing middleware code for SAP, C++, or Perl for web servers will often knock out a few hundred lines of code in a module before testing it. In embedded systems, a good rule of thumb is to try to test the code every 20 to 50 lines. Start simple. Progress in small steps. Be sure that each function, each module, and each library operates correctly both with “normal data” as well as with data outside the boundary conditions.

Regression Testing—Test Early and Test Often

Another bit of philosophy admonishes, “Test early and test often.”

When testing embedded systems, it is useful to develop a suite of tests that can be run repeatedly throughout the code development. These tests might be as simple as a sequence of button presses on the MMI, or as complex as replacing hardware drivers with functions (often by substituting a library file) that, instead of acquiring live sensor data, report simulated sensor data.

If the test suite runs successfully on a previous code build, but not on the current code build, and the programmer has only made incremental changes between the two code builds, then we have a good idea where to start looking for defective code.

This test suite is also useful when upgrading a compiler. If the code compiles and successfully completes the tests with a previous version of the compiler, but not with the current version, then we have both an indication that something is wrong, and that changing compiler versions caused the bug.

This technique of running the same old tests on new pieces of code is called regression testing.

Case Study—Big Bang Integration and No Regression Test Suite

Not every system will have difficult-to-find bugs, but when one does crop up, the time it takes to reproduce can be staggering. Here is one company’s experience with an obscure bug:

The company made equipment that printed numbers in magnetic ink on the bottom of checks. The engineering group was long overdue releasing the new high-speed document printer. Marketing had already announced the new product and sales of the older products had subsequently tanked. Customers were holding out for the announced, but unreleased, high capacity, high-speed printer.

The new printer had a hopper that could hold 5000 documents and could process the documents as fast as the fastest human operator could enter the information to be printed.

The document handler generally operated as expected. With the exception that every 16,000 to 47,000 documents, the code would “lock-up.” After a few seconds, the onboard watchdog would reset the system. The log files stored in RAM would invariably be corrupted.

The software team had opted for a big-bang integration of software modules written by five talented engineers for the system’s microprocessors. Each engineer had his or her own style and method of “testing” their code. No collaborative regression tests existed.

The hardware design team was confident in the design. Simple bits of code were used to verify that the motors and solenoids were properly under software control. Digital storage oscilloscopes were used to verify noise levels and transients were well within design tolerances.

The software engineers were sure it was an issue associated with the big bang integration. The software team burned many a gallons of midnight oil looking for solutions and spent months developing simulators and regression tests to try and find the bug.

Through robust testing, they found and fixed many bugs that would have eventually caused customers grief. However, through software testing alone, they were unable to reproduce or identify the cause of the “it just crashes every now and again” bug.

After several months, several hundred thousand dollars, and a 30% companywide reduction in force due to lack of sales of existing product, the problem turned out to be ESD (electrostatic discharge). As the checks were pulled along by little rubber rollers, ESD built up on the documents and was accumulated on a plastic photo-sensor. If conditions were dry enough, eventually the charge would discharge into a nearby PCB trace and would addle the CPU.

The solution was to add a couple of grounded little tinsel brushes to wipe off the static build up from the documents.

This simple little problem almost killed the company. It did cost forty people their jobs. What could have been done?

The hardware group had proceeded with their design in incremental steps. Each piece of hardware was tested. The integrated units were tested. The careful testing and development gave the hardware group, and the engineering management, confidence in the hardware design.

The software group had been much more cavalier. Bright people had coded whole modules quickly—sometimes overnight. The rapid development, coupled with the lack of formal regression testing or integration testing, inspired a lack of confidence in the final code base.

Not only did management feel the issue was a software problem, so did the software engineers. They just knew that some buffer was overflowing or some pointer was errant.

The fact that as they developed tests for each module, and for the integration of modules they found bugs, further enforced the belief that the code base was unstable.

It wasn’t until months had passed and the software group had found and fixed many bugs in the code base that the company developed enough confidence in the code base to begin to seriously look at potential hardware issues again. Which is where the “show stopper” bug was found.

The lesson to be learned here is that not all problems that appear to be software actually are. Additionally, if one is not confident with the code base, a lot of time and money may be spent looking in the wrong place.

The FINDER Hat

Troubleshooting requires a very methodical mind-set. Never assume anything. Never assume that the tools are working flawlessly. Never assume that the hardware is good. Never assume that the code is without bugs.

When troubleshooting, don’t look at anything on the bench as being black or white. Think in terms of gray. Each piece of code or hardware or test gear should be assigned a degree of confidence.

For example, consider a digital multimeter (DMM) that is telling us that we have a low supply voltage. Don’t assume that the DMM is correct. Crosscheck it.

We could measure a fresh 9-volt battery. If we have confidence in the freshness of the 9-volt battery, then our confidence has increased in the DMM’s accuracy (assuming the DMM displayed about 9 volts when connected to the battery).

We could crosscheck the DMM’s measurement of the “low” supply voltage with another DMM or better yet, an analog meter. If we get the same measurement from both instruments, the confidence level improves in both the test instruments (reading similarly) and that the measured supply rail might be low.

However, we’re not done yet. We can hook up an oscilloscope to the supply rail in question. Even a crummy old slow scope will do in most situations. What we want to eliminate is the possibility that the DMM is giving us a false measurement due to alternating current (AC) noise on the supply rail—which would still be a problem, but a different one than a “low voltage” rail.

Each of the above steps gives us a clue. Each step helps us build confidence in our understanding of the situation. Never assume—always double check.

When debugging a system, there are two distinct hats that must be worn. The first hat is inscribed FINDER, the second FIXER. We should always be cognizant of which hat we are wearing.

The FINDER hat is worn during the diagnostic phase. Before we can properly “fix” a problem, we must understand it.

Some engineers sometimes approach a malfunctioning system with shotgun solutions like these:

  • Just make the stack bigger

  • Just add an extra time delay here or there

  • Just put a few capacitors on the power supply or in the feedback loop

  • Just ground the cable shields

  • Just disable interrupts

  • Just disable the watchdog

  • Just make everything a global variable

People that don’t necessarily understand the distinction between FINDER and FIXER take this sort of shotgun approach.

Shotgun solutions don’t always address the root cause of the problem. They may mask a problem in a particular unit or prototype, or under a given set of conditions, but the problem may reassert itself later in the field.

For example, an engineer may see a problem in a function’s behavior, and determine that “all this passing parameters by address is overly difficult—I’ll just make this function’s variables global.” This engineer may have fixed this problem for the function in question, but nothing was done to address the root cause of the problem. The engineer that wrote the code may not have understood how to pass parameters into and out of functions. Quite likely another place in the code has similar problems. These problems may just not be asserting themselves at the present time.

A much better solution would have been for the engineer to recognize the deficiency, correct the function that was exhibiting the odd behavior, and then carefully comb through the code base looking for similar problems.

The FINDER must carefully amass clues, generate a hypothesis for the problem, and then build confidence in the hypothesis through experimentation.

Part of the process of amassing clues involves never changing more than one thing at a time. If two or three tweaks are made to a system as part of the exploration of a problem, there is no way to tell which change affected the behavior of the problem.

If we make a change to a system and the change doesn’t seem to change the system’s behavior, in most circumstances, the next move should be to restore the system to the original configuration (undo the change) and verify the problem still exists.

In many situations, we may make a change that we feel should be incorporated into the final project, even if it didn’t affect the problem at hand. If that occurs, be disciplined, undo the change, and proceed with the FINDER’s duty. Once the root cause of the problem is found, we can always come back (wearing the DEVELOPER’s hat) and make additional changes that we might consider good engineering practice.

For example, consider a target system that is exhibiting difficulty communicating with another system. We notice that the communications cable’s shield is ungrounded. We ground the shield, but the communications problem still exists. Even though we might consider it a good engineering practice to ground the shield, the system should still be placed back into the original state until we get to the bottom of the communications problem. Always make just one change at a time. Always go back to the initial configuration that caused the problem.

After making a change to a system and observing the behavior, a useful practice is to write a few notes about what change was made and what behavior was observed. The reasons for this are simple. Humans forget. Humans get confused.

An experiment run a couple of minutes ago might be clear in one’s mind, but after another four hours of tracking down a difficult bug, the experiment and results will be difficult to recall with confidence. Time is the enemy of recollection. Take notes. All the best detectives have a little notepad.

Bob Pease is an internationally revered engineer and author. In his book Troubleshooting Analog Circuits Pease introduces Milligan’s Law—“When you are taking data, if you see something funny, Record Amount of Funny.” This is as important in software troubleshooting or system level troubleshooting as it is in analog troubleshooting. Take copious and clear notes.

The FIXER Hat

Once the root cause of a problem is understood, the engineer wearing the FIXER hat can devise a solution. The duty of the FIXER is the same as that of any design engineer—balance the cost and timeliness of a solution with the effectiveness of the solution.

Beyond repairing the bug, the FIXER has an institutional responsibility to provide feedback to the engineer or engineering group that introduced the bug. Without this feedback, the same bug may be introduced into future products.

Depending on the production status of the defective product, the FIXER may have the additional burden of devising material dispositions for existing stock, work in progress, and systems deployed at customer sites. In some situations, the FIXER may be called upon to do as much diplomacy as engineering.

Avoid Debugging Altogether—Code Smart

These are some guidelines that are useful to both the DEVELOPER and the FIXER. As with any guideline, following these to the extreme is probably not going to be either possible or desirable for embedded systems development. An engineer should keep the spirit of these guidelines in mind.

For example, a guideline might say a NULL pointer check before every pointer access is a good idea. On a PC application this might be acceptable, but a NULL pointer check and special handling of NULL pointer cases can bloat and slow down code too much for an 8-bit system. The programmer must take extra care to make sure the situation doesn’t happen in the first place. The spirit of the guideline is clearly “be careful that pointers are initialized correctly.”

Guideline #1: Use Small Functions

Keep functions to a page or less of code when possible. Minimize their side effects; for example, don’t modify global variables whenever possible. Test functions by themselves to make sure they work as expected for various input values, especially boundary conditions. Use and check return values where invalid input is a possibility. Remember:

  • Baby Steps

  • Test early. Test often.

Guideline #2: Use Pointers with Great Care

An uninitialized or badly initialized pointer can point to anywhere in memory and therefore corrupt any location in RAM during a write. Be careful that pointers are initialized correctly.

Guideline #3: Comment Code Well

Write good descriptions for functions that include what inputs are used for and what output is to be expected. Comment code lines where the purpose of the code isn’t obvious. The code that the programmer writes today may not be so familiar in six months when a bug needs to be fixed.

Guideline #4: Avoid “Magic Numbers”

Use a single macro to define a constant value that is or may be reused so that we only have to change it in one place. For example, the following code uses a macro to define BIGARRAYSIZE, which is then used in more than one place:

#define BIGARRAYSIZE 500
char bigarray[BIGARRAYSIZE];
...
memset (bigarray, 0, BIGARRAYSIZE);

The following code segment is an example of how NOT to define arrays. Use of magic numbers (like 500) often leads to confusion—especially during future code maintenance or debugging.

char bigarray[500];
...
memset (bigarray, 0, 500)

Proactive Debugging

Academics who study software engineering have accumulated an impressive number of statistics about where bugs come from and how many a typical program will have once coding stops but before debugging starts. Amazingly, debugging burns about half the total engineering time for most products. That suggests minimizing debugging is the first place to focus our schedule optimization efforts.

Defect reduction is not rocket science. Start with a well-defined specifications/requirements document, use a formal process of inspections on all work products (from the specifications to the code itself), and give the developers powerful and reliable tools. Skip any of these and bugs will consume too much time and sap our spirits.

However, no process, no matter how well defined, will eliminate all defects. We make mistakes!

Typically 5–10% of the source code will be wrong. Inspections catch 70–80% of those errors. A little 10,000-line program will still, after employing the very best software engineering practices, contain hundreds of bugs we’ve got to chase down before shipping. Use poor practices and the number (and development time) skyrockets.

Debugging is hard, slow, and frustrating. Since we know we’ll have bugs, and we know some will be god-awful hard to find, let’s look for things we can do to our code to catch problems automatically or more easily. I call this proactive debugging, which is the art of anticipating problems and instrumenting the code accordingly.

Stacks and Heaps

Do you use the standard, well-known method for computing stack size? After all, undersize the stack and your program will crash in a terribly-hard-to-find manner. Allocate too much and you’re throwing money away. High-speed RAM is expensive.

The standard stack-sizing methodology is to take a wild guess and hope. There is no scientific approach, nor even a rule of thumb. This isn’t too awful in a single-task environment, since we can just stick the stack at the end of RAM and let it grow downwards, all the time hoping, of course, that it doesn’t bang into memory already used. Toss in an RTOS, though, and such casual and sanguine allocation fails, since every task needs its own stack, each of which we’ll allocate using the “take a guess and hope” methodology.

Take a wild guess and hope. Clearly this means we’ll be wrong from time to time; perhaps even usually wrong. If we’re doing something that will likely be wrong, proactively take some action to catch the likely bug.

Some RTOSes, for instance, include monitors that take an exception when any individual stack grows dangerously small. Though there’s a performance penalty, consider turning these on for initial debug.

In a single task system, when debugging with an emulator or other tool with lots of hardware breakpoints, configure the tool to automatically (every time you fire it up) set a memory-write breakpoint near the end of the stack.

As soon as I allocate a stack I habitually fill each one with a pattern, like 0x55aa, even when using more sophisticated stack monitoring tools. After running the system for a minute or a week—whatever’s representative for that application—I’ll stop execution with the debugger and examine the stack(s). Thousands of words loaded with 0x55aa means I’m wasting RAM, and few or none of these words means the end is near, the stack is too small, and a crash is imminent.

Heaps are even more problematic. Malloc() is a nightmare for embedded systems. As with stacks, figuring the heap size is tough at best, a problem massively exacerbated by multitasking. Malloc() leads to heap fragmentation—though it may contain vast amounts of free memory, the heap may be so broken into small, unusable chunks that malloc() fails.

In simpler systems it’s probably wise to avoid malloc() altogether. When there’s enough RAM, allocating all variables and structures statically yields the fastest and most deterministic behavior, though at the cost of using more memory.

When dynamic allocation is unavoidable, by all means remember that malloc() has a return value! I look at a tremendous amount of firmware yet rarely see this function tested. It must be a guy thing. Testosterone. We’re gonna malloc that puppy, by gawd, and that’s that! Fact is, it may fail, which will cause our program to crash horribly. If we’re smart enough—proactive enough—to test every malloc() then an allocation error will still cause the program to crash horribly, but at least we can set a debug trap, greatly simplifying the task of finding the problem.

An interesting alternative to malloc() is to use multiple heaps. Perhaps a heap for 100-byte allocations, one for 1000 bytes, and another for 5000. Code a replacement malloc() that takes the heap identifier as its argument. Need 643 bytes? Allocate a 1000-byte block from the 1000-byte heap. Memory fragmentation becomes extinct, your code runs faster, though some RAM will be wasted for the duration of the allocation. A few commercial RTOSes do provide this sort of replacement malloc().

Finally, if you do decide to use the conventional malloc(), at least for debugging purposes link in code to check the success of each allocation.

Walter Bright’s memory allocation test code, put into public domain many years ago can be found at www.snippets.org/MEM.TXT (the link is case-sensitive) along with companion files. MEM is a few hundred lines of C that replaces the library’s standard memory functions with versions that diagnose common problems.

MEM looks for out-of-memory conditions, so if you’ve inherited a lot of poorly written code that doesn’t properly check malloc()’s return value, use MEM to pick up errors. It verifies that frees match allocations. Before returning a block it sets the memory to a nonzero state to increase the likelihood that code expecting an initialized data set fails.

An interesting feature is that it detects pointer over- and underruns. By allocating a bit more memory than you ask for, and writing a signature pattern into your pre- and post-buffer memory, when the buffer is freed MEM can check to see if a pointer wandered beyond the buffer’s limits.

Geodesic Systems (www.geodesic.com) builds a commercial and much more sophisticated memory allocation monitor targeted at desktop systems. They claim that 99% of all PC programs suffer from memory leaks (mostly due to memory that is allocated but never freed). I have no idea how true this statement really is, but the performance of my PC sure seems to support their proposition. On a PC a memory leak isn’t a huge problem, since the programs are either closed regularly or crash sufficiently often to let the OS reclaim that leaked resource.

Firmware, though, must run for weeks, months, even years without crashing. If 99% of PC applications suffer from leaks, I’d imagine a large number of embedded projects share similar problems. One of MEM’s critical features is that it finds these leaks, generally before the system crashes and burns.

MEM is a freebie and requires only a small amount of extra code space, yet will find many classes of very common problems. The wise developer will link it, or other similar tools, into every project proactively, before the problems surface.

Seeding Memory

Bugs lead to program crashes. A “crash,” though, can be awfully hard to find. First we notice a symptom—the system stops responding. If the debugger is connected, stopping execution shows that the program has run amok. But why? Hundreds of millions of instructions might elapse between ours seeing the problem and starting troubleshooting. No trace buffer is that large. So we’re forced to recreate the problem—if we can—and use various strategies to capture the instant when things fall apart. Yet “crash” often means that the code branches to an area of memory where it simply should not be. Sometimes this is within the body of code itself; often it’s in an address range where there’s neither code nor data.

Why do we continue to leave our unused ROM space initialized to some default value that’s a function of the ROM technology and not what makes sense for us? Why don’t we make a practice of setting all unused memory, both ROM and RAM, to a software interrupt instruction that immediately vectors execution to an exception handler?

Most CPUs have single byte or single word opcodes for a software interrupt. The Z80’s RST7 was one of the most convenient, as its 0xff, which is the defaults state of unprogrammed EPROM. x86 processors, all support the single byte INT3 software interrupt. Motorola’s 68k family, and other processors, have an illegal instruction word.

Set all unused memory to the appropriate instruction, and write a handler that captures program flow if the software interrupt occurs. The stack often contains a wealth of clues about where things were and what was going on when the system crashed, so copy it to a debug area. In a multitasking application the OS’s task control block and other data structures will have valuable hints. Preserve this critical tidbits of information.

Make sure the exception handler stops program flow; lock up in an infinite loop or something similar, ensure all interrupts and DMA are off, to stop the program from wandering away.

There’s no guarantee that seeding memory will capture all crashes, but if it helps in even a third of the cases you’ve got a valuable bit of additional information to help diagnose problems.

But there’s more to initializing memory than just seeding software interrupts. Other kinds of crashes require different proactive debug strategies. For example, a modern microprocessor might support literally hundreds of interrupt sources, with a vector table that dispatches ISRs for each. Yet the average embedded system might use a few, or perhaps a dozen, interrupts. What do we do with the unused vectors in the table?

Fill them, of course, with a vector aimed at an error handler! It’s ridiculous to leave the unused vectors aimed at random memory locations. Sometime, for sure, you’ll get a spurious interrupt, something awfully hard to track down. These come from a variety of sources, such as glitchy hardware (you’re probably working on a barely functional hardware prototype, after all).

More likely is a mistake made in programming the vectors into the interrupting hardware. Peripherals have gotten so flexible that they’re often impossible to manage. I’ve used parts with hundreds of internal registers, each of which has to be set just right to make the device function properly. Motorola’s TPU, which is just a lousy timer, has a 142-page data book that documents some 36 internal registers. For a timer. I’m not smart enough to set them correctly first try, every time. Misprogramming any of these complex peripherals can easily lead to spurious interrupts.

The error handler can be nothing more than an infinite loop. Be sure to set up your debug tool so that every time you load the debugger it automatically sets a breakpoint on the handler. Again, this is nothing more than anticipating a tough problem, writing a tiny bit of code to capture the bug, and then configuring the tools to stop when and if it occurs.

Wandering Code

Embedded code written in any language seems determined to exit the required program flow and miraculously start running from data space or some other address range a very long way from code store. Sometimes keeping the code executing from ROM addresses feels like herding a flock of sheep, each of whom is determined to head off in its own direction.

In assembly a simple typographical error can lead to a jump to a data item; C, with support for function pointers, means state machines not perfectly coded might execute all over the CPU’s address space. Hardware issues—like interrupt service routines with improperly initialized vectors and controllers—also lead to sudden and bizarre changes in program context.

Over the course of a few years I checked a couple of dozen embedded systems sent into my lab. The logic analyzer showed writes to ROM (surely an exercise in futility and a symptom of a bug) in more than half of the products.

Though there’s no sharp distinction between wandering code and wandering pointers (as both often come from the same sorts of problems), diagnosing the problems requires different strategies and tools.

Quite a few companies sell products designed to find wandering code, or that can easily be adapted to this use. Some emulators, for instance, let you set up rules for the CPU’s address space: a region might be enabled as execute-only, another for data read-writes but no executions, and a third tagged as no accesses allowed. When the code violates a rule the emulator stops, immediately signaling a serious bug. If your emulator includes this sort of feature, use it!

One of the most frustrating parts of being a tool vendor is that most developers use 10% of a tool’s capability. We see engineers fighting difficult problems for hours, when a simple built-in feature might turn up the problem in seconds. I found that less than 1% of people I’ve worked with use these execution monitors, yet probably 100% run into crashes stemming from code flaws that the tools would pick up instantly.

Developers fall into four camps when using an execution-monitoring device: the first bunch don’t have the tool. Another group has one but never uses it, perhaps because they have simply not learned its fundamentals. To have unused debugging power seems a great pity to me. A third segment sets up and arms the monitoring tool only when it’s obvious the code indeed wanders off somewhere, somehow.

The fourth, and sadly tiny, group builds a configuration file loaded by their ICE or debugger on every startup, that profiles what memory is where. These, in my mind, are the professional developers, the ones who prepare for disaster long before it inevitably strikes. Just like with make files, building configuration files takes tens of minutes so is too often neglected.

If your debugger or ICE doesn’t come with this sort of feature, then adapt something else! A simple trick is to monitor the address bus with a logic analyzer programmed to look for illegal memory references. Set it to trigger on accesses to unused memory (most embedded systems use far less than the entire CPU address space; any access to an unused area indicates something is terribly wrong), or data-area executes, and so on.

A couple of years ago I heard an interesting tale: an engineer, searching out errant writes, connected an analyzer to his system and quickly found the bug. Most of us would stop there, disconnect the tool, and continue traditional debugging. Instead, he left the analyzer connected for a number of weeks till completing debug. In that time he caught seven—count ‘em—similar problems that may very well have gone undetected. These were weird writes to code or unused address spaces, bugs with no immediately apparent symptoms.

What brilliant engineering! He identified a problem, then developed a continuous process to always find new instances of the issue. Without this approach the unit would have shipped with these bugs undetected.

I believe that one of the most insidious problems facing firmware engineers are these lurking bugs. Code that does things it shouldn’t is flawed, even if the effects seem benign.

Various studies suggest that up to half the code in a typical system never gets tested. Deeply nested ifs, odd exception/error conditions, and similar ills defy even well-designed test plans. A rule of thumb predicts that (uninspected) code has about a 5% error rate. This suggests that a 10,000 line project (not big by any measure) likely has 250 bugs poised to strike at the worst possible moment.

Special Decoders

Another option is to build a special PAL or PLD, connected to address and status lines that flags errant bus transactions. Trigger an interrupt or a wait state when, say, the code writes to ROM.

If your system already uses a PAL, PLD or FPGA to decode memory chip selects, why not add an output that flags an error? The logic device—required anyway to enable memory banks—can also flash an LED or interrupt the CPU when it finds an error. A single added bit creates a complete self-monitoring system. More than just a great debug aid, it’s a very intriguing adjunct to high-reliability systems. Cheap, too.

A very long time ago—way back in the 20th century, actually—virtually all microprocessor designs used an external programmable device to decode addresses. These designs were thus all ripe for this simple yet powerful addition. Things have changed; now CPUs sport sometimes dozens of chip select outputs. With so many selects the day of the external decoder is fading away.

Few modern designs actually use all of those chip selects, though, opening up some interesting possibilities. Why not program one to detect accesses to unused memory? Fire off an interrupt or generate a wait state when such a condition occurs. The cost: zero. Effort needed: a matter of minutes. Considering the potential power of this oh-so-simple tool, the cost/benefit ratio is pretty stunning.

If you’re willing to add a trivial amount of hardware, then exploit more of the unused chip selects. Have one monitor the bottom of the stack; AND it with write to detect stack overflows.

Be sure to AND the ROM chip select with write to find those silly code overwrites as well.

MMUs

If your system has an integrated memory management unit, more options exist. Many MMUs both translate addresses and provide memory protection features. Though MMUs are common only on higher end CPUs, where typically people develop pretty big applications, most of the systems I see pretty much bypass the MMU.

The classic example of this is protected mode in the x86 architecture. Intel’s 16 bit x86 processors are all limited to a 1 Mb address range, using what’s called “real mode.” It’s a way of getting direct access to 64 k chunks of memory; with a little segment register, fiddling the entire 1 Mb space is available.

But segmenting memory into 64 k blocks gets awfully awkward as memory size increases. “Protected mode,” Intel’s answer to the problem, essentially lets you partition 4 Gb of memory into thousands of variable size segments.

Protected mode resulted from a very reasonable design decision, that of preserving as much software compatibility as possible with the old real mode code. The ugly feelings about segments persisted, though, so even now most of the x86 protected mode embedded applications I see use map memory into a single, huge, 4 Gb segment. Doing this—a totally valid way of using the device’s MMU—indeed gives the designer the nice flat address space we’ve all admired in the 68 k family. But it’s a dumb idea.

The idea of segments, especially as implemented in protected mode, offers some stunning benefits. Break tasks into their own address spaces, with access rights for each. For the x86 MMU associates access rules for each segment. A cleverly written program can ensure that no one touches routines or data unless entitled to do so. The MMU checks each and every memory reference automatically, creating an easily-trapped exception when a violation occurs.

Most RTOSes targeted at big x86 applications explicitly support segmentation for just this reason. Me, I’d never build a 386 or bigger system without a commercial RTOS.

Beyond the benefits accrued from using protected mode to ensure data and code doesn’t fall victim to cruddy code, why not use this very powerful feature for debugging?

First, do indeed plop code into segments whose attributes prohibit write accesses. Write an exception handler that traps these errors, and that logs the source of the error.

Second, map every unused area of memory into its own segment, which has access rights guaranteeing that any attempt to read or write from the region results in an exception.

It’s a zero cost way to both increase system reliability in safety-critical applications, and to find those pesky bugs that might not manifest themselves as symptoms for a very long time.

Conclusion

With the exception of the logic analyzer—though a tool virtually all labs have anyway—all of the suggestions above require no tools. They are nearly zero-cost ways to detect a very common sort of problem. We know we’ll have these sorts of problems; it’s simply amateurish to not instrument your system before the problems manifest a symptom.

Solving problems is a high-visibility process; preventing problems is low-visibility. This is illustrated by an old parable: In ancient China there was a family of healers, one of whom was known throughout the land and employed as a physician to a great lord. The physician was asked which of his family was the most skillful healer. He replied, “I tend to the sick and dying with drastic and dramatic treatments, and on occasion someone is cured and my name gets out among the lords.”

“My elder brother cures sickness when it just begins to take root, and his skills are known among the local peasants and neighbors.”

“My eldest brother is able to sense the spirit of sickness and eradicate it before it takes form. His name is unknown outside our home.”

Great developers recognize that their code will be flawed, so instrument their code, and create tool chains designed to sniff out problems before a symptom even exists.

Implementing Downloadable Firmware with Flash Memory

The problem with any approach to in-situ firmware updates is that when such a feature contains a flaw, the target system may become an expensive doorstop—and perhaps injure the user in the process. Many of the potential pitfalls are obvious and straightforward to correct, but other, insidious defects may not appear until after a product has been deployed in its application environment.

Users are unequaled in their abilities to expose and exploit product defects, and to make matters worse, users also generally fail to heed warnings like “system damage will occur if power is interrupted while programming is underway.” They will happily attempt to reboot an otherwise functional system in the middle of the update process, and then file a warranty claim for the now “defective” product.

Any well-designed, user-ready embedded system must include the ability to recover from user errors and other catastrophic events to the fullest extent possible. The best way to accomplish this is to implement a fundamentally sound software update strategy that avoids these problems entirely. This chapter presents one such design.

The Microprogrammer

The following sections describe a microprogrammer-based approach to the implementation of a downloadable firmware feature for embedded systems. This approach is suitable for direct implementation as described, but can also be modified for use in situations where some of its features are not needed, or must be avoided.

The definition of a microprogrammer is a system-level description of how the embedded system behaves before, during, and after the firmware update process. This behavior is carefully defined to help avoid many of the problems associated with other approaches to downloadable firmware. A careful implementation of this behavior eliminates the remaining concerns.

The first step in a microprogrammer-based firmware update process is to place the embedded system into a state where it is expecting the download process to occur. The transition to this state could be caused by the user pressing a button marked UPGRADE on the device’s interface panel, by the system detecting the start or end of a file transfer, or by some other means. In any case, the target system now realizes that its firmware will soon be updated, and brings any controlled processes to a safe and stable halt configuration.

Next, the target system is sent a small application called a microprogrammer. The microprogrammer assumes control of the system, and begins receiving the new application firmware and programming it into flash. At the conclusion of this process, the target system begins running the new firmware.

Advantages of Microprogrammers

One of the biggest selling points of a microprogrammer-based approach to downloadable firmware is its flexibility. The microprogrammer’s implementation can be changed even in the final moments before the firmware update process begins, which allows bug fixes and enhancements to be applied retroactively to deployed systems.

A microprogrammer does not consume resources in the target system except when programming is actually underway. Furthermore, since an effective microprogrammer can be as small as 10 K of code or less, the target system does not require a bulky, sophisticated communications protocol in order to receive the microprogrammer at the start of the process—a simple text file transfer is often reliable enough.

The safety of a microprogrammer-based firmware update process is sometimes its most compelling advantage. When the target system’s flash chip is not used for any purpose other than firmware storage, the code needed to erase and program the flash chip does not exist in the system until it is actually needed. As such, the system is highly unlikely to accidentally erase its own firmware, even in severe cases of program runaway.

Disadvantages of Microprogrammers

One of the drawbacks of microprogrammers is that the microprogrammer itself is usually implemented as a stand-alone program, which means that its code is managed separately from both the application that downloads it to the target system, and the application that the microprogrammer delivers to the target system. This management effort requires additional developer resources, the quantity of which is highly dependent on how closely coupled these applications are to each other. Careful attention to program modularity helps to minimize the workload.

A microprogrammer is generally downloaded to and run from RAM, which means that the target system needs some quantity of memory available to hold the microprogrammer. This memory can be shared with the preexisting target application when necessary, but an embedded system that only has a few hundred bytes of RAM in total will probably need a different strategy.

And finally, in a single-flash system the microprogrammer approach requires the target system to be able to run code from RAM. This simply isn’t possible for certain microprocessor architectures, in particular the venerable 8051 and family. Hardware-oriented workarounds exist for these cases, but the limitations they impose often outweigh their benefits.

Receiving a Microprogrammer

The code in Listing 5.1 illustrates the functionality needed by a target system to download and run a microprogrammer. In the example, the target is sent a plain text Motorola S Record file from some I/O channel (perhaps a serial port), which is decoded and written to RAM. The target then activates the microprogrammer by jumping into the downloaded code at the end of the transfer.

Notice that the programmer_buf[] memory space is allocated as an automatic variable, which means that it has no fixed location in the target system’s memory image. This implies both that the addresses in the incoming S Records are relative rather than absolute, and that the incoming code is position-independent. If your compiler cannot produce position-independent code, then programmer_buf[] must be assigned to a fixed location in memory and the addresses in the incoming S Records must be located within that memory space.

The incoming microprogrammer image can be placed on top of other data if the target system does not have the resources to permanently allocate programmer_buf[]. At this point the embedded system has suspended normal operations anyway, making the bulk of its total RAM space available for use by the microprogrammer.

Example 5.1. Code that Downloads and Runs a Microprogrammer

enum srec_type_t {
   SREC_TYPE_S1, SREC_TYPE_S2, SREC_TYPE_S3,
   SREC_TYPE_S7, SREC_TYPE_S9
};
typedef void (*entrypoint_t)(void);

void microprogrammer()
{
   char programmer_buf[8192];
   int len;
   char sbuf[256];
   unsigned long addr;
   enum srec_type_t type;
   entrypoint_t entrypoint;

   while (1) {
      if (read_srecord(&type, &len, &addr, sbuf)) {
          switch (type) {
              case SREC_TYPE_S1:
              case SREC_TYPE_S2:
              case SREC_TYPE_S3:
                 /* record contains data (code) */
                 memcpy(programmer_buf + addr, sbuf, len);
                 break;

              case SREC_TYPE_S7:
                 /* record contains address of downloaded main() */
                 entrypoint = (entrypoint_t)(programmer_buf + addr);
                 break;

              case SREC_TYPE_S9:
                 /* record indicates end of data (code)— run it */
                 entrypoint();
                 break;

         }
      }
   }
}

A Basic Microprogrammer

The top-level code for a microprogrammer is shown in Listing 5.2. For consistency with the previous example, this code also receives an S Record file from some source and decodes it. The microprogrammer writes the incoming data to flash, and the system is rebooted at the end of the file transfer. Although overly simplistic (a plain text file transfer is probably not reliable enough for large programs), this code illustrates all the important features of a microprogrammer.

Example 5.2. A Basic Microprogrammer

void programmer()
{
   int len;
   char buf[256];
   unsigned long addr;
   enum srec_type_t type;

   while (1) {
      if (read_srecord(&type, &len, &addr, buf)) {
         switch (type) {
             case SREC_TYPE_S1:
             case SREC_TYPE_S2:
             case SREC_TYPE_S3:
                /* record contains data or code— program it */
                if (!is_section_erased(addr, len))
                   erase_flash(addr, len);
                write_flash(addr, len, buf);
                break;

             case SREC_TYPE_S9:
                /* this record indicates end of data—
                   execute system reset to run new application */
                reset();
                break;
         }
      }
   }
}

In addition to actually erasing flash, the function erase_flash() also manages a simple data structure that keeps track of which sections of the flash chip need to be erased, and which ones have already been erased. This data structure is checked by the is_section_erased() function, which prevents multiple erasures of flash sections when data arrives out of order—which is a common occurrence in an S Record file.

Common Problems and Their Solutions

Regardless of how you modify the microprogrammer-based system description to fit your own requirements, you will encounter some common problems in the final implementation. These problems, along with their solutions, are the subject of this section.

Debugger Doesn’t Like Writeable Code Space

Some debuggers, emulators in particular, struggle with the idea of code space that can be written to by the target microprocessor. Most debugging tools treat code space as read-only, and some will generate error messages or simply disallow a writing operation when they detect one. In general, the debugger’s efforts to protect code space are well-intentioned. A program writing to its code space usually signals a serious programming error, except when a firmware update is in progress. The debugger can’t tell the difference, of course. The remedy is to implement a memory alias for the flash chip, in a different memory space that is not considered read-only by the debugger.

Consider a 512 KB flash chip that starts at address 0 in the target’s memory space. By utilizing a chip select line that responds to an address in the range of 0–1024 KB, you can access the first byte in the flash chip by writing to address 0x80000 instead of address 0, and simultaneously avoid any intrusion from the debugger or emulator.

The memory region between 512 KB and 1024 KB is sometimes called an alias of the region 0–512 KB, because the underlying physical hardware cannot distinguish between the two addresses and thus maps them to the same physical location in the flash chip. The debugger can distinguish between the two address ranges, however, and can therefore be configured to ignore (and thereby permit) write accesses in the alias region.

The typical implementation of a memory alias is straightforward: simply double the size of the chip select used to activate the device, and apply an offset to any address that refers to a location in the region designated as the “physical” address region, to move it to the “alias” region. The best place to apply this offset is usually in the flash chip driver itself, as shown in the hypothetical write_flash() function in Listing 5.3.

Example 5.3. Implementing Memory Aliasing in a write_flash() Function

#define PHYS_BASE_ADDRESS 0 // physical base address
#define ALIAS_BASE_ADDRESS 0x80000 // alias base address
void write_flash (long addr, unsigned char* data, int len)
{
   addr = (addr - PHYS_BASE_ADDRESS + ALIAS_BASE_ADDRESS);
   while (length) {
      ...
   }
}

Debugger Doesn’t Like Self-relocating Code

One variation on the microprogrammer approach is to build the microprogrammer’s functionality into the target system—a so-called integral programmer—instead of downloading it to RAM as the first step of the firmware update process. This strategy has its advantages, but it creates the need to copy the programmer’s code from flash to RAM before the flash chip is erased and reprogrammed. In other words, the code must self-relocate to RAM at some point during its operation, so that it can continue to run after the flash chip is erased. This approach also requires that the code involved be position-independent.

The code in Listing 5.4 illustrates how to copy the integral programmer’s code into RAM, and how to find the RAM version of the function programmer(). The symbols RAM_PROG_START, PROG_LEN and ROM_PROG_START mark the regions in RAM and ROM where the programmer’s code (which may be a subset of the total application) is located, and can often be computed automatically by the program’s linker. The complicated-looking casting in the entry point calculation forces the compiler to do byte-sized address calculations when computing the value of entrypoint.

Example 5.4. Copying Code into RAM

typedef int(*entrypoint_t)(void);
entrypoint_t relocate_programmer()
{
   entrypoint_t entrypoint;

   /* relocate the code */
   memcpy(RAM_PROG_START, ROM_PROG_START, PROG_LEN);

   /* find programmer() in ram: its location is the same
      offset from RAM_PROG_START as the rom version is
      from ROM_PROG_START */
   entrypoint = (entrypoint_t)((char*)programmer
      -(char*)ROM_PROG_START + (char*)RAM_PROG_START);

   return entrypoint;
}

When the caller invokes the function at the address returned by relocate_programmer(), control passes to the RAM copy of the microprogrammer code—and your debugger, if in use, stops showing any symbolic information related to the programmer() function. Why? Because programmer() is now running from an address that is different from the address it was originally located at, so the symbol information provided to the debugger by the linker is now meaningless.

One solution to this problem is to relink the application with programmer() at the RAM address, and then import this symbol information into the debugger. This would be a convenient fix, except that not all debuggers support incremental additions to their symbol table. Another option is to simply suffer until the debugging of programmer() is complete, at which point you don’t need to look at the code any more, in theory at least.

If the development environment is based on a hardware emulator rather than a self-hosted debugging agent, then you can completely avoid the hassles of code relocation by simply not relocating programmer() when an emulator is in use. When an emulator is present such relocation is, in fact, unnecessary: the opcodes associated with programmer() are actually located in the emulator’s memory, rather than flash, so there is no worry that these instructions will disappear when the flash chip is erased and reprogrammed. This may also be the case for a self-hosted debugger setup, if the code being debugged and the debugging agent itself are both running from RAM.

Listing 5.5 illustrates an enhanced relocate_programmer() that does not copy code to RAM when an emulator is in use. Instead of using a #if compilation block to selectively enable or disable code copying, the function checks for an emulator at runtime, and skips the code relocation steps if it finds one.

Example 5.5. A Smarter Code Relocation Strategy, Which Does Not Move Code Except When Necessary

typedef int(*entrypoint_t)(void);
entrypoint_t relocate_programmer()
{
   entrypoint_t entrypoint;

   /* test for an emulator, and only relocate code if necessary */
   if (memcmp(FLASH_START, FLASH_START + FLASH_SIZE, FLASH_SIZE))
      entrypoint = programmer;
   else {
      /* no emulator; copy programmer's memory section to ram */
      memcpy(RAM_PROG_START, ROM_PROG_START, PROG_LEN);
      entrypoint = (entrypoint_t)((char*)programmer
           -(char*)ROM_PROG_START + (char*)RAM_PROG_START);
   }
   return entrypoint;
}

The test for the presence of an emulator exploits the nature of the memory alias strategy discussed in the previous section. Without an emulator attached, the contents of memory in the region FLASH_START to (FLASH_START+FLASH_SIZE) must be identical to the memory in the region’s memory alias, in this case assumed to start at (FLASH_START+FLASH_SIZE), because the two address regions actually resolve to the same physical addresses in the flash chip.

With an emulator attached, however, a portion of flash memory is remapped to the emulator’s internal memory space, so differences in comparisons can and do occur. These differences cause the memcmp() call to return nonzero, disclosing the presence of the emulator to relocate_programmer(). To improve performance, the region used for comparison can be reduced to just a handful of bytes that are known to change during compilation (a text string containing the time and date of the compilation, for example), or a known blank location in the flash chip that is preinitialized in the emulator’s memory space to a value other than 0xff (the value of an erased flash memory cell).

Testing a blank flash memory cell also discloses the presence of an emulator when a memory alias of the flash chip is not available, which is useful in cases where the flash chip is so large that it occupies more than half the target processor’s total address space—thereby making a complete memory alias impossible.

Can’t Generate Position-independent Code

Not all embedded development tool chains can produce code that is relocatable to arbitrary locations at runtime. Such position-dependent code must run from the address it was placed at by the linker, or the program will crash.

When a position-dependent microprogrammer is all that is available, then it obviously must be downloaded to the location in RAM that the linker intended it to run from. This implies that the memory reserved to hold the microprogrammer must be allocated at a known location.

With an integral programmer approach, there are two options. The first option is to compile the programmer code as a stand-alone program, located at its destination address in RAM. This code image is then included in the application image (perhaps by translating its binary image into a constant character array), and copied to RAM at the start of the firmware update process. This is a lot like a microprogrammer implementation, with the microprogrammer “downloaded” from the target’s onboard memory instead of from a serial port.

The second option is to handle the integral programmer code as initialized data, and use the runtime environment’s normal initialization procedures to copy the code into RAM. The GNU compiler supports this using its__attribute__language extension, and several commercial compilers provide this capability as well. The only limitation of this strategy is that it requires enough RAM space to hold the integral programmer code plus the program’s other data.

No Firmware at Boot Time

Even in the most carefully designed downloadable firmware feature, the possibility still exists that the target hardware could attempt a startup using an accidentally blank flash chip. The outcome to avoid in this case is the sudden activation of the application’s rotating machinery, which subsequently gobbles up an unprepared user.

The prescription for this case—which is best applied before a user is actually eaten—involves a careful study of the target processor’s reaction to the illegal instructions and/or data represented by the 0xff’s of an unprogrammed section of flash memory. Many embedded processors eventually halt processing, and tri-state their control signals to let them float to whatever value is dictated by external hardware. Without pull-up resistors or other precautions in place to force these uncontrolled signals to safe states, unpredictable and potentially lethal results are likely.

Persistent Watchdog Time-out

In systems that support downloadable firmware, an unavoidable application defect that forces a watchdog time-out and system reset can lock the microprogrammer out of the embedded system. The extreme case is an accidental while(1); statement in a program’s main()function: the loop-within-a-loop that results (the infinite program loop, wrapped by the infinite watchdog time-out and system reset loop) keeps the target from responding to the UPGRADE button, because the system restarts before the button is even checked.

Systems that support downloadable firmware must carefully examine all available status circuitry to determine the reason the system is starting up, and force a transition to UPGRADE mode in the event that an excessive number of watchdog or other resets are detected. Many embedded systems do not interrupt power to RAM when a watchdog timeout occurs, so it is safe to store a “magic number” and count of the number of resets there; the counter is incremented on each reset, and once a certain number is reached the system halts the application in an effort to avoid the code that forces the reset.

Unexpected Power Interruption

If power is lost unexpectedly during flash reprogramming, then the target’s flash chip is left in a potentially inconsistent state when power is restored: maybe the programming operation finished and everything is fine, but probably not. The best case is when the system’s boot code is intact, but portions of application code are missing; the worst case is where the flash chip is completely blank.

In the first case, a checksum can be used to detect the problem, and the system can force a transition to UPGRADE mode whether the user is requesting one or not. The only solution to the second case is to avoid ever having it happen.

One way to avoid producing a completely blank flash chip is to never erase the section of flash that contains the system’s boot and programmer firmware. By leaving this code intact at all times, a hopefully-adequate environment is maintained even if power is interrupted. This approach may not always be an option, however: the flash chip may be a single-sector part that can only do an all-or-nothing erase, or the “boot sector” of the flash chip may be larger than the boot code it contains, and the wasted space is needed for application code.

A careful strategy for erasing and reprogramming the boot region of a flash chip can minimize the risk of damage to a system from unexpected power interruption, and in some cases eliminate it entirely.

The strategy works as follows: When a request to reprogram the section of flash containing boot code is detected, the programmer first creates a copy of the position-independent code in that section to another section in flash, or assumes that a prepositioned clone of the system’s boot code exists. The boot code is then erased, and the target’s reset vector is quickly restored to point to the temporary copy of the boot code. Once the boot sector is programmed, the reset vector is rewritten to point to the new startup code.

One of the two keys to the success of this strategy hinges on careful selection of the addresses used for the temporary and permanent copies of the boot code. If the permanent address is lower than the temporary address by a power of two, then the switch from the temporary reset vector to the permanent one can take place by only changing bits from one to zero, which makes a new erase of the flash sector unnecessary. This makes the switch from the temporary reset vector to the permanent one an atomic operation: no matter when power is interrupted, the vector still points to either one location or the other.

Obviously, the other critical element of this strategy is to eliminate the risk of power interruption in the moment between when the boot sector is erased and when the temporary reset vector is written. The amount of energy required to complete this operation can be computed by looking at the power consumption and timing specifications in the datasheet for the flash chip and microprocessor, and is usually in the range where additional capacitance in the system’s power supply circuitry can bridge the gap if a power loss occurs. By checking that power has not been already lost at the start of the operation, and running without interruption until the sector is erased and the temporary reset vector is written, the remaining opportunity for damage is eliminated.

The limitation of this strategy is that it depends on a microprocessor startup process that reads a reset vector to get the value of the initial program counter. In processors that simply start running code from a fixed address after a reset, it may be possible to modify this strategy to accomplish the same thing with clever combinations of jmp and similar opcodes.

Hardware Alternatives

The microprogrammer-based firmware update strategy and its variations are all firmware-based, because they require that some code exist in the target system before that code can be reprogrammed. This creates a sort of chicken-and-egg problem: if you need code in the embedded system to put new code into the embedded system, how do you get the initial code there in the first place?

At least two existing hardware-based methods can be used to jump-start this process: BDM and JTAG.

A processor with a BDM port provides what is in essence a serial port tied directly to the guts of the microprocessor itself. By sending the right commands and data through this port, you can push a copy of your microprogrammer into the target’s RAM space and hand over control to it. The BDM port can also be used to stimulate the I/O lines of the flash chip, thus programming it directly. Many BDM-based development systems include scripts and programs that implement this functionality, but it can also be implemented by hand with a few chips, a PC’s printer port, and a careful study of the processor’s datasheets.

JTAG is a fundamentally different technology designed to facilitate reading and writing of a chip’s I/O lines, usually while the chip’s microprocessor (if it has one) is held in reset. Like BDM, however, this capability can be used to stimulate a RAM or flash chip to push a microprogrammer application into it. And also like BDM, a JTAG interface can be built with just a few components and some persistent detective work in the target processor’s manual.

A JTAG bus transceiver chip, versions of which are available from several vendors, can be added to systems that lack JTAG support.

Separating Code and Data

The ultimate goal of any downloadable firmware implementation effort is exactly that: a working downloadable firmware feature. Once this capability is safely in place, however, it is important to consider some other capabilities that the system can suddenly offer.

By separating an application’s code and data into separate flash sectors, the possibility exists to update the two independently. This is useful if an application uses custom data like tuned parameters or accumulated measurements that cannot be easily replaced if lost. Such data tables must contain version information, however, so that later versions of an application can read old table formats—and so that old applications are not confused by new table formats. The code in Listing 5.6 demonstrates one way to define a data table containing version information, and how to select from one of several data table formats at runtime.

Flexible and Safe

A microprogrammer-based downloadable firmware feature, when properly implemented, can safely add considerable flexibility to an embedded system that uses flash memory. The techniques described here will help you avoid common mistakes and reap the uncommon benefits that microprogrammers and flash memory can offer.

Example 5.6. Supporting Multiple Data Table Format

/* the original table format,
   a.k.a. version 1 */
typedef struct {
  int x;
  int y;
} S_data_ver_1;

/* version 2 of the table format,
   which adds a 'z' field */
typedef struct {
  int x;
  int y;
  int z;
} S_data_ver_2;

/* the data, which always starts
   with a version identifier */
typedef struct{
  const int ver;
  union {
    S_data_ver_1 olddata[N];
    S_data_ver_2 newdata[N];
  };
} data_table;

void foo ( data_table* dt )
{
  int x, y, z, wdata;
  S_data_ver_1* dv1;
  S_data_ver_2* dv2;
  switch(dt->ver) {
  case 1:
    for( wdata = 0,
         dv1 = dt->olddata;
         wdata < N; wdata++, dv1++ ) {
      x = dv1->x;
      y = dv1->y;
      /* old data format did not include 'z',
         impose a default value */
      z = 0;
    }
    break;
  case 2:
    for( wdata = 0,
         dv2 = dt->newdata;
         wdata < N; wdata++, dv2++ ) {
      x = dv2->x;
      y = dv2->y;
      z = dv2->z;
    }
    break;
  default:
    /* unsupported format,
       select reasonable defaults */
    x = y = z = 0;
  }
}

Memory Diagnostics

In “A Day in the Life” John Lennon wrote, “He blew his mind out in a car; he didn’t notice that the lights had changed.” As a technologist this always struck me as a profound statement about the complexity of modern life. Survival in the big city simply doesn’t permit even a very human bit of daydreaming. Twentieth-century life means keeping a level of awareness and even paranoia that our ancestors would have found inconceivable.

Since this song’s release in 1967, survival has become predicated on much more than the threat of a couple of tons of steel hurtling though a red light. Software has been implicated in many deaths, for example, plane crashes, radiation overexposures, and pacemaker misfires. Perhaps a single bit, something so ethereal that it is nothing more than the charge held in an impossibly small well, is incorrect—that’s all it takes to crash a system. Today’s version of the Beatles song might include the refrain “He didn’t notice that the bit had flipped.”

Beyond software errors lurks the specter of a hardware failure that causes our correct code to die. Many of us write diagnostic code to help contain the problem.

ROM Tests

It doesn’t take much to make at least the kernel of an embedded system run. With a working CPU chip, memories that do their thing, perhaps a dash of decoder logic, you can count on the code starting off . . . perhaps not crashing until running into a problem with I/O.

Though the kernel may be relatively simple, with the exception of the system’s power supply it’s by far the most intolerant portion of an embedded system to any sort of failure. The tiniest glitch, a single bit failure in a huge memory array, or any problem with the processor pretty much guarantees that nothing in the system stands a change of running.

Nonkernel failures may not be so devastating. Some I/O troubles will cause just part of the system to degrade, leaving much of the rest up. My car’s black box seems to have forgotten how to run the cruise control, yet it still keeps the fuel injection and other systems running.

In the minicomputer era, most booted with a CPU test that checked each instruction. That level of paranoia is no longer appropriate, as a highly integrated CPU will generally fail disastrously. If the processor can execute any sort of a self test, it’s pretty much guaranteed to be intact.

Dead decoder logic is just as catastrophic. No code will execute if the ROMs can’t be selected.

If your boot ROM is totally misprogrammed or otherwise nonfunctional, then there’s no way a ROM test will do anything other than crash. The value of a ROM test is limited to dealing with partially programmed devices (due, perhaps, to incomplete erasure, or inadvertently removing the device before completion of programming).

There’s a small chance that ROM tests will pick up an addressing problem, if you’re lucky enough to have a failure that leaves the boot and ROM test working. The odds are against it, and somehow Mother Nature tends to be very perverse.

Some developers feel that a ROM checksum makes sense to insure the correct device is inserted. This works best only if the checksum is stored outside of the ROM under test. Otherwise, inserting a device with the wrong code version will not show an error, as presumably the code will match the (also obsolete) checksum.

In multiple-ROM systems a checksum test can indeed detect misprogrammed devices, assuming the test code lives in the boot ROM. If this one device functions, and you write the code so that it runs without relying on any other ROM, then the test will pick up many errors.

Checksums, though, are passé. It’s pretty easy for a couple of errors to cancel each other out. Compute a CRC (Cyclic Redundancy Check), a polynomial with terms fed back at various stages. CRCs are notoriously misunderstood but are really quite easy to implement. The best reference I have seen to date is “A Painless Guide to CRC Error Detection Algorithms,” by Ross Williams. It’s available via anonymous FTP from ftp.adelaide.edu.au/pub/rocksoft/crc_v3.txt.

The following code computes the 16 bit CRC of a ROM area (pointed to by rom, of size length) using the x16+ x12+ x5+ 1 CRC:

#define CRC_P 0x8408
WORD rom_crc(char *rom, WORD length)
{
  unsigned char i;
  unsigned int value;
  unsigned int crc = 0xffff;

  do
  {
     for (i=0, value=(unsigned int)0xff & *rom++;
        i < 8;
        i++, value >>= 1)
     {
        if ((crc & 0x0001) ^ (value & 0x0001))
              crc = (crc >> 1) ^ CRC_P;
        else crc >>= 1;
     }
} while (—length);
         crc = ~crc;
         value = crc;
         crc = (crc << 8) | ((value >> 8) & 0xff);

         return (crc);
     }

It’s not a bad idea to add death traps to your ROM. On a Z80 0xff is a call to location 38. Conveniently, unprogrammed areas of ROMs are usually just this value. Tell your linker to set all unused areas to 0xff; then, if an address problem shows up, the system will generate lots of spurious calls. Sure, it’ll trash the stack, but since the system is seriously dead anyway, who cares? Technicians can see the characteristic double write from the call, and can infer pretty quickly that the ROM is not working.

Other CPUs have similar instructions. Browse the op code list with a creative mind.

RAM Tests

Developers often adhere to beliefs about the right way to test RAM that are as polarized as disparate feelings about politics and religion. I’m no exception, and happily have this forum for blasting my own thoughts far and wide . . . so will I shamelessly do so.

Obviously, a RAM problem will destroy most embedded systems. Errors reading from the stack will surely crash the code. Problems, especially intermittent ones, in the data areas may manifest bugs in subtle ways. Often you’d rather have a system that just doesn’t boot, rather than one that occasionally returns incorrect answers.

Some embedded systems are pretty tolerant of memory problems. We hear of NASA spacecraft from time to time whose core or RAM develops a few bad bits, yet somehow the engineers patch their code to operate around the faulty areas, uploading the corrections over the distances of billions of miles.

Most of us work on systems with far less human intervention. There are no teams of highly trained personnel anxiously monitoring the health of each part of our products. It’s our responsibility to build a system that works properly when the hardware is functional.

In some applications, though, a certain amount of self-diagnosis either makes sense or is required; critical life support applications should use every diagnostic concept possible to avoid disaster due to a submicron RAM imperfection.

So, my first belief about diagnostics in general, and RAM tests in particular, is to define your goals clearly. Why run the test? What will the result be? Who will be the unlucky recipient of the bad news in the event an error is found, and what do you expect that person to do?

Will a RAM problem kill someone? If so, a very comprehensive test, run regularly, is mandatory.

Is such a failure merely a nuisance? For instance, if it keeps a cell phone from booting, if there’s nothing the customer can do about the failure anyway, then perhaps there’s no reason for doing a test. As a consumer I could care less why the damn phone stopped working, if it’s dead I’ll take it in for repair or replacement.

Is production test—or even engineering test—the real motivation for writing diagnostic code? If so, then define exactly what problems you’re looking for and write code that will find those sorts of troubles.

Next, inject a dose of reality into your evaluation. Remember that today’s hardware is often very highly integrated. In the case of a microcontroller with onboard RAM the chances of a memory failure that doesn’t also kill the CPU is small. Again, if the system is a critical life support application it may indeed make sense to run a test as even a minuscule probability of a fault may spell disaster.

Does it make sense to ignore RAM failures? If your CPU has an illegal instruction trap, there’s a pretty good chance that memory problems will cause a code crash you can capture and process. If the chip includes protection mechanisms (like the x86 protected mode), count on bad stack reads immediately causing protection faults your handlers can process. Perhaps RAM tests are simply not required given these extra resources.

Too many of us use the simplest of tests—writing alternating 0x55 and 0xAA values to the entire memory array, and then reading the data to ensure it remains accessible. It’s a seductively easy approach that will find an occasional problem (like, someone forgot to load all of the RAM chips), but that detects few real world errors.

Remember that RAM is an array divided into columns and rows. Accesses require proper chip selects and addresses sent to the array—and not a lot more. The 0x55/0xAA symmetrical pattern repeats massively all over the array; accessing problems (often more common than defective bits in the chips themselves) will create references to incorrect locations, yet almost certainly will return what appears to be correct data.

Consider the physical implementation of memory in your embedded system. The processor drives address and data lines to RAM—in a 16 bit system there will surely be at least 32 of these. Any short or open on this huge bus will create bad RAM accesses. Problems with the PC board are far more common than internal chip defects, yet the 0x55/0xAA test is singularly poor at picking up these, the most likely failures.

Yet, the simplicity of this test and its very rapid execution have made it an old standby used much too often. Isn’t there an equally simple approach that will pick up more problems? If your goal is to detect the most common faults (PCB wiring errors and chip failures more substantial than a few bad bits here or there), then indeed there is. Create a short string of almost random bytes that you repeatedly send to the array until all of memory is written. Then, read the array and compare against the original string.

I use the phrase “almost random” facetiously, but in fact it little matters what the string is, as long as it contains a variety of values. It’s best to include the pathological cases, like 00, 0xaa, ox55, and 0xff. The string is something you pick when writing the code, so it is truly not random, but other than these four specific values you fill the rest of it with nearly any set of values, since we’re just checking basic write/read functions (remember: memory tends to fail in fairly dramatic ways). I like to use very orthogonal values—those with lots of bits changing between successive string members—to create big noise spikes on the data lines.

To make sure this test picks up addressing problems, ensure the string’s length is not a factor of the length of the memory array. In other words, you don’t want the string to be aligned on the same low-order addresses, which might cause an address error to go undetected. Since the string is much shorter than the length of the RAM array, you ensure it repeats at a rate that is not related to the row/column configuration of the chips.

For 64 k of RAM, a string 257 bytes long is perfect. 257 is prime, and its square is greater than the size of the RAM array. Each instance of the string will start on a different low order address. 257 has another special magic: you can include every byte value (00 to 0xff) in the string without effort. Instead of manually creating a string in your code, build it in real time by incrementing a counter that overflows at 8 bits.

Critical to this, and every other RAM test algorithm, is that you write the pattern to all of RAM before doing the read test. Some people like to do nondestructive RAM tests by testing one location at a time, then restoring that location’s value, before moving onto the next one. Do this and you’ll be unable to detect even the most trivial addressing problem.

This algorithm writes and reads every RAM location once, so is quite fast. Improve the speed even more by skipping bytes, perhaps writing and reading every third or fifth entry. The test will be a bit less robust yet will still find most PCB and many RAM failures.

Some folks like to run a test that exercises each and every bit in their RAM array. Though I remain skeptical of the need since most semiconductor RAM problems are rather catastrophic, if you do feel compelled to run such a test, consider adding another iteration of the algorithm just described, with all of the data bits inverted.

Sometimes, though, you’ll want a more thorough test, something that looks for difficult hardware problems at the expense of speed.

When I speak to groups I’ll often ask “What makes you think the hardware really works?” The response is usually a shrug of the shoulders, or an off-the-cuff remark about everything seeming to function properly, more or less, most of the time.

These qualitative responses are simply not adequate for today’s complex systems. All too often, a prototype that seems perfect harbors hidden design faults that may only surface after you’ve built a thousand production units. Recalling products due to design bugs is unfair to the customer and possibly a disaster to your company.

Assume the design is absolutely ridden with problems. Use reasonable methodologies to find the bugs before building the first prototype, but then use that first unit as a test bed to find the rest of the latent troubles.

Large arrays of RAM memory are a constant source of reliability problems. It’s indeed quite difficult to design the perfect RAM system, especially with the minimal margins and high speeds of today’s 16 and 32 bit systems. If your system uses more than a couple of RAM parts, count on spending some time qualifying its reliability via the normal hardware diagnostic procedures. Create software RAM tests that hammer the array mercilessly.

Probably one of the most common forms of reliability problems with RAM arrays is pattern sensitivity. Now, this is not the famous pattern problems of yore, where the chips (particularly DRAMs) were sensitive to the groupings of ones and zeroes. Today the chips are just about perfect in this regard. No, today pattern problems come from poor electrical characteristics of the PC board, decoupling problems, electrical noise, and inadequate drive electronics.

PC boards were once nothing more than wiring platforms, slabs of tracks that propagated signals with near perfect fidelity. With very high speed signals, and edge rates (the time it takes a signal to go from a zero to a one or back) under a nanosecond, the PCB itself assumes all of the characteristics of an electronic component—one whose virtues are almost all problematic. It’s a big subject (refer to High Speed Digital Design—a Handbook of Black Magic by Howard Johnson and Martin Graham [1993 PTR Prentice Hall, NJ] for the canonical words of wisdom on this subject), but suffice to say a poorly designed PCB will create RAM reliability problems.

Equally important are the decoupling capacitors chosen, as well as their placement. Inadequate decoupling will create reliability problems as well.

Modern DRAM arrays are massively capacitive. Each address line might drive dozens of chips, with 5 to 10 pf of loading per chip. At high speeds the drive electronics must somehow drag all of these pseudo-capacitors up and down with little signal degradation. Not an easy job! Again, poorly designed drivers will make your system unreliable.

Electrical noise is another reliability culprit, sometimes in unexpected ways. For instance, CPUs with multiplexed address/data buses use external address latches to demux the bus. A signal, usually named ALE (Address Latch Enable) or AS (Address Strobe) drives the clock to these latches. The tiniest, most miserable amount of noise on ALE/AS will surely, at the time of maximum inconvenience, latch the data part of the cycle instead of the address. Other signals are also vulnerable to small noise spikes.

Many run-of-the-mill RAM tests, run for several hours, as you cycle the product through it’s design environment (temperature and so forth) will show intermittent RAM problems. These are symptoms of the design faults I’ve described, and always show a need for more work on the product’s engineering.

Unhappily, all too often the RAM tests show no problem when hidden demons are indeed lurking. The algorithm I’ve described, as well as most of the others commonly used, trade-off speed versus comprehensiveness. They don’t pound on the hardware in a way designed to find noise and timing problems.

Digital systems are most susceptible to noise when large numbers of bits change all at once. This fact was exploited for data communications long ago with the invention of the Gray Code, a variant of binary counting, where no more than one bit changes between codes. Your worst nightmares of RAM reliability occur when all of the address and/or data bits change suddenly from zeroes to ones.

For the sake of engineering testing, write RAM test code that exploits this known vulnerability. Write 0xffff to 0x0000 and then to 0xffff, and do a read-back test. Then write zeroes. Repeat as fast as your loop will let you go.

Depending on your CPU, the worst locations might be at 0x00ff and 0x0100, especially on 8 bit processors that multiplex just the lower 8 address lines. Hit these combinations, hard, as well.

Other addresses often exhibit similar pathological behavior. Try 0x5555 and 0xaaaa, which also have complementary bit patterns.

The trick is to write these patterns back-to-back. Don’t test all of RAM, with the understanding that both 0x0000 and 0xffff will show up in the test. You’ll stress the system most effectively by driving the bus massively up and down all at once.

Don’t even think about writing this sort of code in C. Any high level language will inject too many instructions between those that move the bits up and down. Even in assembly the processor will have to do fetch cycles from wherever the code happens to be, which will slow down the pounding and make it a bit less effective.

There are some tricks, though. On a CPU with a prefetcher (all x86, 68 k, and so on) try to fill the execution pipeline with code, so the processor does back-to-back writes or reads at the addresses you’re trying to hit. And, use memory-to-memory transfers when possible. For example:

mov si, 0xaaaa
mov di, 0x5555
mov [si], 0xff
mov [di],[si]

Nonvolatile Memory

Many of the embedded systems that run our lives try to remember a little bit about us, or about their application domain, despite cycling power, brownouts, and all of the other perils of fixed and mobile operation. In the bad old days before microprocessors we had core memory, a magnetic medium that preserved its data when powered or otherwise.

Today we face a wide range of choices. Sometimes Flash or EEPROM is the natural choice for nonvolatile applications. Always remember, though, that these devices have limited numbers of write cycles. Worse, in some cases writes can be very slow.

Battery-backed up RAMs still account for a large percentage of nonvolatile systems. With robust hardware and software support they’ll satisfy the most demanding of reliability fanatics; a little less design care is sure to result in occasional lost data.

Supervisory Circuits

In the early embedded days we were mostly blissfully unaware of the perils of losing power. Virtually all reset circuits were nothing more than a resistor/capacitor time constant. As Vcc ramped from 0 to 5 volts, the time constant held the CPU’s reset input low—or lowish—long enough for the system’s power supply to stabilize at 5 volts.

Though an elegantly simple design, RC time constants were flawed on the back end, when power goes away. Turn the wall switch off, and the 5 volt supply quickly decays to zero. Quickly only in human terms, of course, as many milliseconds went by while the CPU was powered by something between 0 and 5. The RC circuit is, of course, at this point at a logic one (not-reset), so it allows the processor to run.

And run they do! With Vcc down to 3 or 4 volts most processors execute instructions like mad. Just not the ones you’d like to see. Run a CPU with out-of-spec power and expect random operation. There’s a good chance the machine is going wild, maybe pushing and calling and writing and generally destroying the contents of your battery backed up RAM.

Worse, brown-outs, the plague of summer air conditioning, often cause small dips in voltage. If the AC mains decline to 80 volts for a few seconds a power supply might still crank out a few volts. When AC returns to full rated values the CPU is still running, back at 5 volts, but now horribly confused. The RC circuit never notices the dip from 5 to 3 or so volts, so the poor CPU continues running in its mentally unbalanced state. Again, your RAM is at risk.

Motorola, Maxim, and others developed many ICs designed specifically to combat these problems. Though features and specs vary, these supervisory circuits typically manage the processor’s reset line, battery power to the RAM, and the RAM’s chip selects.

Given that no processor will run reliably outside of its rated Vcc range, the first function of these chips is to assert reset whenever Vcc falls below about 4.7 volts (on 5 volt logic). Unlike an RC circuit that limply drools down as power fails, supervisory devices provide a snappy switch between a logic zero and one, bringing the processor to a sure, safe stopped condition.

They also manage the RAM’s power, a tricky problem since it’s provided from the system’s Vcc when power is available, and from a small battery during quiescent periods. The switchover is instantaneous to keep data intact.

With RAM safely provided with backup power and the CPU driven into a reset state, a decent supervisory IC will also disable all chip selects to the RAM. The reason? At some point after Vcc collapses you can’t even be sure the processor, and your decoding logic, will not create rogue RAM chip selects. Supervisory ICs are analog beasts, conceived outside of the domain of discrete ones and zeroes, and will maintain safe reset and chip select outputs even when Vcc is gone.

But check the specs on the IC. Some disable chip selects at exactly the same time they assert reset, asynchronously to what the processor is actually doing. If the processor initiates a write to RAM, and a nanosecond later the supervisory chip asserts reset and disables chip select, which write cycle will be one nanosecond long. You cannot play with write timing and expect predictable results. Allow any write in progress to complete before doing something as catastrophic as a reset.

Some of these chips also assert an NMI output when power starts going down. Use this to invoke your “oh_my_god_we’re_dying” routine.

Since processors usually offer but a single NMI input, when using a supervisory circuit never have any other NMI source. You’ll need to combine the two signals somehow; doing so with logic is a disaster, since the gates will surely go brain dead due to Vcc starvation.

Check the specifications on the parts, though, to ensure that NMI occurs before the reset clamp fires. Give the processor a handful of microseconds to respond to the interrupt before it enters the idle state.

There’s a subtle reason why it makes sense to have an NMI power-loss handler: you want to get the CPU away from RAM. Stop it from doing RAM writes before reset occurs. If reset happens in the middle of a write cycle, there’s no telling what will happen to your carefully protected RAM array. Hitting NMI first causes the CPU to take an interrupt exception, first finishing the current write cycle if any. This also, of course, eliminates troubles caused by chip selects that disappear synchronously to reset.

Every battery-backed up system should use a decent supervisory circuit; you just cannot expect reliable data retention otherwise. Yet, these parts are no panacea. The firmware itself is almost certainly doing things destined to defeat any bit of external logic.

Multibyte Writes

There’s another subtle failure mode that afflicts all too many battery-backed up systems. He observed that in a kinder, gentler world than the one we inhabit all memory transactions would require exactly one machine cycle, but here on Earth 8 and 16 bit machines constantly manipulate large data items. Floating point variables are typically 32 bits, so any store operation requires two or four distinct memory writes. Ditto for long integers.

The use of high-level languages accentuates the size of memory stores. Setting a character array, or defining a big structure, means that the simple act of assignment might require tens or hundreds of writes.

Consider the simple statement:

a=0x12345678;

An x86 compiler will typically generate code like:

mov[bx], 5678

mov[bx+2], 1234

which is perfectly reasonable and seemingly robust.

In a system with a heavy interrupt burden it’s likely that sooner or later an interrupt will switch CPU contexts between the two instructions, leaving the variable “a” half-changed, in what is possibly an illegal state. This serious problem is easily defeated by avoiding global variables—as long as “a” is a local, no other task will ever try to use it in the half-changed state.

Power-down concerns twist the problem in a more intractable manner. As Vcc dies off a seemingly well-designed system will generate NMI while the processor can still think clearly. If that interrupt occurs during one of these multibyte writes—as it eventually surely will, given the perversity of nature—your device will enter the power-shutdown code with data now corrupt. It’s quite likely (especially if the data is transferred via CPU registers to RAM) that there’s no reasonable way to reconstruct the lost data.

The simple expedient of eliminating global variables has no benefit to the power-down scenario.

Can you imagine the difficulty of finding a problem of this nature? One that occurs maybe once every several thousand power cycles, or less? In many systems it may be entirely reasonable to conclude that the frequency of failure is so low the problem might be safely ignored. This assumes you’re not working on a safety-critical device, or one with mandated minimal MTBF numbers.

Before succumbing to the temptation to let things slide, though, consider implications of such a failure. Surely once in a while a critical data item will go bonkers. Does this mean your instrument might then exhibit an accuracy problem (for example, when the numbers are calibration coefficients)? Is there any chance things might go to an unsafe state? Does the loss of a critical communication parameter mean the device is dead until the user takes some presumably drastic action?

If the only downside is that the user’s TV set occasionally—and rarely—forgets the last channel selected, perhaps there’s no reason to worry much about losing multibyte data. Other systems are not so forgiving.

It was suggested to implement a data integrity check on power-up, to insure that no partial writes left big structures partially changed. I see two different directions this approach might take.

The first is a simple power-up check of RAM to make sure all data is intact. Every time a truly critical bit of data changes, update the CRC, so the boot-up check can see if data is intact. If not, at least let the user know that the unit is sick, data was lost, and some action might be required.

A second, and more robust, approach is to complete every data item write with a checksum or CRC of just that variable. Power-up checks of each item’s CRC then reveals which variable was destroyed. Recovery software might, depending on the application, be able to fix the data, or at least force it to a reasonable value while warning the user that, while all is not well, the system has indeed made a recovery.

Though CRCs are an intriguing and seductive solution I’m not so sanguine about their usefulness. Philosophically it is important to warn the user rather than to crash or use bad data. But it’s much better to never crash at all.

We can learn from the OOP community and change the way we write data to RAM (or, at least the critical items for which battery back-up is so important).

First, hide critical data items behind drivers. The best part of the OOP triptych mantra “encapsulation, inheritance, polymorphism” is “encapsulation.” Bind the data items with the code that uses them. Avoid globals; change data by invoking a routine, a method that does the actual work. Debugging the code becomes much easier, and reentrancy problems diminish.

Second, add a “flush_writes” routine to every device driver that handles a critical variable. “Flush_writes” finishes any interrupted write transaction. Flush_writes relies on the fact that only one routine—the driver—ever sets the variable.

Next, enhance the NMI power-down code to invoke all of the flush_write routines. Part of the power-down sequence then finishes all pending transactions, so the system’s state will be intact when power comes back.

The downside to this approach is that you’ll need a reasonable amount of time between detecting that power is going away, and when Vcc is no longer stable enough to support reliable processor operation. Depending on the number of variables needed flushing this might mean hundreds of microseconds.

Firmware people are often treated as the scum of the earth, as they inevitably get the hardware (late) and are still required to get the product to market on time. Worse, too many hardware groups don’t listen to, or even solicit, requirements from the coding folks before cranking out PCBs. This, though, is a case where the firmware requirements clearly drive the hardware design. If the two groups don’t speak, problems will result.

Some supervisory chips do provide advanced warning of imminent power-down. Maxim’s (www.maxim-ic.com) MAX691, for example, detects Vcc failing below some value before shutting down RAM chip selects and slamming the system into a reset state. It also includes a separate voltage threshold detector designed to drive the CPU’s NMI input when Vcc falls below some value you select (typically by selecting resistors). It’s important to set this threshold above the point where the part goes into reset. Just as critical is understanding how power fails in your system. The capacitors, inductors, and other power supply components determine how much “alive” time your NMI routine will have before reset occurs. Make sure it’s enough.

I mentioned the problem of power failure corrupting variables to Scott Rosenthal, one of the smartest embedded guys I know. His casual “yeah, sure, I see that all the time” got me interested. It seems that one of his projects, an FDA-approved medical device, uses hundreds of calibration variables stored in RAM. Losing any one means the instrument has to go back for readjustment. Power problems are just not acceptable.

His solution is a hybrid between the two approaches just described. The firmware maintains two separate RAM areas, with critical variables duplicated in each. Each variable has its own driver.

When it’s time to change a variable, the driver sets a bit that indicates “change in process.” It’s updated, and a CRC is computed for that data item and stored with the item. The driver unasserts the bit, and then performs the exact same function on the variable stored in the duplicate RAM area.

On power-up the code checks to insure that the CRCs are intact. If not, that indicates the variable was in the process of being changed, and is not correct, so data from the mirrored address is used. If both CRCs are OK, but the “being changed” bit is asserted, then the data protected by that bit is invalid, and correct information is extracted from the mirror site.

The result? With thousands of instruments in the field, over many years, not one has ever lost RAM.

Testing

Good hardware and firmware design leads to reliable systems. You won’t know for sure, though, if your device really meets design goals without an extensive test program. Modern embedded systems are just too complex, with too much hard-to-model hardware/firmware interaction, to expect reliability without realistic testing.

This means you’ve got to pound on the product, and look for every possible failure mode. If you’ve written code to preserve variables around brown-outs and loss of Vcc, and don’t conduct a meaningful test of that code, you’ll probably ship a subtly broken product.

In the past I’ve hired teenagers to mindlessly and endlessly flip the power switch on and off, logging the number of cycles and the number of times the system properly comes to life. Though I do believe in bringing youngsters into the engineering labs to expose them to the cool parts of our profession, sentencing them to mindless work is a sure way to convince them to become lawyers rather than techies.

Better, automate the tests. The Poc-It, from Microtools (www.microtoolsinc.com/products.htm) is an indispensable $250 device for testing power-fail circuits and code. It’s also a pretty fine way to find uninitialized variables, as well as isolating those awfully hard to initialize hardware devices like some FPGAs.

The Poc-It brainlessly turns your system on and off, counting the number of cycles. Another counter logs the number of times a logic signal asserts after power comes on. So, add a bit of test code to your firmware to drive a bit up when (and if) the system properly comes to life. Set the Poc-It up to run for a day or a month; come back and see if the number of power cycles is exactly equal to the number of successful assertions of the logic bit. Anything other than equality means something is dreadfully wrong.

Conclusion

When embedded processing was relatively rare, the occasional weird failure meant little. Hit the reset button and start over. That’s less of a viable option now. We’re surrounded by hundreds of CPUs, each doing its thing, each affecting our lives in different ways. Reliability will probably be the watchword of the next decade as our customers refuse to put up with the quirks that are all too common now.

The current drive is to add the maximum number of features possible to each product. I see cell phones that include games. Features are swell . . . if they work, if the product always fulfills its intended use. Cheat the customer out of reliability and your company is going to lose. Power cycling is something every product does, and is too important to ignore.

Building a Great Watchdog

Launched in January 1994, the Clementine spacecraft spent two very successful months mapping the moon before leaving lunar orbit to head toward near-Earth asteroid Geographos.

A dual-processor Honeywell 1750 system handled telemetry and various spacecraft functions. Though the 1750 could control Clementine’s thrusters, it did so only in emergency situations; all routine thruster operations were under ground control.

On May 7 the 1750 experienced a floating point exception. This wasn’t unusual; some 3000 prior exceptions had been detected and handled properly. But immediately after the May 7 event downlinked data started varying wildly and nonsensically. Then the data froze. Controllers spent 20 minutes trying to bring the system back to life by sending software resets to the 1750; all were ignored. A hardware reset command finally brought Clementine back online.

Alive, yes, even communicating with the ground, but with virtually no fuel left.

The evidence suggests that the 1750 locked up, probably due to a software crash. While hung the processor turned on one or more thrusters, dumping fuel and setting the spacecraft spinning at 80 RPM. In other words, it appears the code ran wild, firing thrusters it should never have enabled; they kept firing till the tanks ran nearly dry and the hardware reset closed the valves. The mission to Geographos had to be abandoned.

Designers had worried about this sort of problem and implemented a software thruster time-out. That, of course, failed when the firmware hung.

The 1750’s built-in watchdog timer hardware was not used, over the objections of the lead software designer. With no automatic “reset” button, success of the mission rested in the abilities of the controllers on Earth to detect problems quickly and send a hardware reset. For the lack of a few lines of watchdog code the mission was lost.

Though such a fuel dump had never occurred on Clementine before, roughly 16 times before the May 7 event hardware resets from the ground had been required to bring the spacecraft’s firmware back to life. One might also wonder why some 3000 previous floating point exceptions were part of the mission’s normal firmware profile.

Not surprisingly, the software team wished they had indeed used the watchdog, and had not implemented the thruster time-out in firmware. They also noted, though, that a normal, simple, watchdog may not have been robust enough to catch the failure mode.

Contrast this with Pathfinder, a mission whose software also famously hung, but which was saved by a reliable watchdog. The software team found and fixed the bug, uploading new code to a target system 40 million miles away, enabling an amazing roving scientific mission on Mars.

Watchdog timers (WDTs) are our fail-safe, our last line of defense, an option taken only when all else fails—right? These missions (Clementine had been reset 16 times prior to the failure) and so many others suggest to me that WDTs are not emergency outs, but integral parts of our systems. The WDT is as important as main() or the runtime library; it’s an asset that is likely to be used, and maybe used a lot.

Outer space is a hostile environment, of course, with high intensity radiation fields, thermal extremes, and vibrations we’d never see on Earth. Do we have these worries when designing Earth-bound systems?

Maybe so. Intel revealed that the McKinley processor’s ultra fine design rules and huge transistor budget means cosmic rays may flip on-chip bits. The Itanium 2 processor, also sporting an astronomical transistor budget and small geometry, includes an onboard system management unit to handle transient hardware failures. The hardware ain’t what it used to be—even if our software were perfect.

But too much (all?) firmware is not perfect. Consider this unfortunately true story from Ed VanderPloeg:

The world has reached a new embedded software milestone: I had to reboot my hood fan. That’s right, the range exhaust fan in the kitchen. It’s a simple model from a popular North American company. It has six buttons on the front: 3 for low, medium, and high fan speeds and 3 more for low, medium, and high light levels. Press a button once and the hood fan does what the button says. Press the same button again and the fan or lights turn off. That’s it. Nothing fancy. And it needed rebooting via the breaker panel.

Apparently the thing has a micro to control the light levels and fan speeds, and it also has a temperature sensor to automatically switch the fan to high speed if the temperature exceeds some fixed threshold. Well, one day we were cooking dinner as usual, steaming a pot of potatoes, and suddenly the fan kicks into high speed and the lights start flashing. “Hmm, flaky sensor or buggy sensor software,” I think to myself.

The food happened to be done so I turned off the stove and tried to turn off the fan, but I suppose it wanted things to cool off first. Fine. So after ten minutes or so the fan and lights turned off on their own. I then went to turn on the lights, but instead they flashed continuously, with the flash rate depending on the brightness level I selected.

So just for fun I tried turning on the fan, but any of the three fan speed buttons produced only high speed. “What ‘smart’ feature is this?,” I wondered to myself. Maybe it needed to rest a while. So I turned off the fan and lights and went back to finish my dinner. For the rest of the evening the fan and lights would turn on and off at random intervals and random levels, so I gave up on the idea that it would self-correct. So with a heavy heart I went over to the breaker panel, flipped the hood fan breaker to and fro, and the hood fan was once again well-behaved.

For the next few days, my wife said that I was moping around as if someone had died. I would tell everyone I met, even complete strangers, about what happened: “Hey, know what? I had to reboot my hood fan the other night!” The responses were varied, ranging from “Freak!” to “Sounds like what happened to my toaster . . .” Fellow programmers would either chuckle or stare in common disbelief.

What’s the embedded world coming to? Will programmers and companies everywhere realize the cost of their mistakes and clean up their act? Or will the entire world become accustomed to occasionally rebooting everything they own? Would the expensive embedded devices then come with a “reset” button, advertised as a feature? Or will programmer jokes become as common and ruthless as lawyer jokes? I wish I knew the answer. I can only hope for the best, but I fear the worst.

One developer admitted to me that his consumer products company could care less about the correctness of firmware. Reboot—who cares? Customers are used to this, trained by decades of desktop computer disappointments. Hit the reset switch, cycle power, remove the batteries for 15 minutes, even preteens know the tricks of coping with legions of embedded devices.

Crummy firmware is the norm, but in my opinion is totally unacceptable. Shipping a defective product in any other field is like opening the door to torts. So far the embedded world has been mostly immune from predatory lawyers, but that Brigadoon-like isolation is unlikely to continue. Besides, it’s simply unethical to produce junk.

But it’s hard, even impossible, to produce perfect firmware. We must strive to make the code correct, but also design our systems to cleanly handle failures. In other words, a healthy dose of paranoia leads to better systems.

A Watchdog Timer is an important line of defense in making reliable products. Well-designed watchdog timers fire off a lot, daily and quietly saving systems and lives without the esteem offered to other, human, heroes. Perhaps the developers producing such reliable WDTs deserve a parade. Poorly-designed WDTs fire off a lot, too, sometimes saving things, sometimes making them worse. A simple-minded watchdog implemented in a nonsafety critical system won’t threaten health or lives, but can result in systems that hang and do strange things that tick off our customers. No business can tolerate unhappy customers, so unless your code is perfect (whose is?) it’s best in all but the most cost-sensitive applications to build a really great WDT.

An effective WDT is far more than a timer that drives reset. Such simplicity might have saved Clementine, but would it fire when the code tumbles into a really weird mode like that experienced by Ed’s hood fan?

Internal WDTs

Internal watchdogs are those that are built into the processor chip. Virtually all highly integrated embedded processors include a wealth of peripherals, often with some sort of watchdog. Most are brain-dead WDTs suitable for only the lowest-end applications.

Let’s look at a few. Toshiba’s TMP96141AF is part of their TLCS-900 family of quite nice microprocessors, which offers a wide range of extremely versatile onboard peripherals. All have pretty much the same watchdog circuit. As the data sheet says, “The TMP96141AF is containing watchdog timer of Runaway detecting.”

Ahem. And I thought the days of Jinglish were over. Anyway, the part generates a nonmaskable interrupt when the watchdog times out, which is either a very, very bad idea or a wonderfully clever one. It’s clever only if the system produces an NMI, waits a while, and only then asserts reset, which the Toshiba part unhappily cannot do. Reset and NMI are synchronous.

A nice feature is that it takes two different I/O operations to disable the WDT, so there are slim chances of a runaway program turning off this protective feature.

Motorola’s widely-used 68332 variant of their CPU32 family (like most of these 68 k embedded parts) also includes a watchdog. It’s a simple-minded thing meant for low-reliability applications only. Unlike a lot of WDTs, user code must write two different values (0x55 and 0xaa) to the WDT control register to ensure the device does not time out. This is a very good thing—it limits the chances of rogue software accidentally issuing the command needed to appease the watchdog. I’m not thrilled with the fact that any amount of time may elapse between the two writes (up to the time-out period). Two back-to-back writes would further reduce the chances of random watchdog tickles, though once would have to ensure no interrupt could preempt the paired writes. And the 0x55/0xaa twosome is often used in RAM tests; since the 68 k I/O registers are memory mapped, a runaway RAM test could keep the device from resetting.

The 68332’s WDT drives reset, not some exception handling interrupt or NMI. This makes a lot of sense, since any software failure that causes the stack pointer to go odd will crash the code, and a further exception-handling interrupt of any sort would drive the part into a “double bus fault.” The hardware is such that it takes a reset to exit this condition.

Motorola’s popular Coldfire parts are similar. The MCF5204, for instance, will let the code write to the WDT control registers only once. Cool! Crashing code, which might do all sorts of silly things, cannot reprogram the protective mechanism. However, it’s possible to change the reset interrupt vector at any time, pretty much invalidating the clever write-once design.

Like the CPU32 parts, a 0x55/0xaa sequence keeps the WDT from timing out, and back-to-back writes aren’t required. The Coldfire datasheet touts this as an advantage since it can handle interrupts between the two tickle instructions, but I’d prefer less of a window. The Coldfire has a fault-on-fault condition much like the CPU32’s double bus fault, so reset is also the only option when WDT fires—which is a good thing.

There’s no external indication that the WDT timed out, perhaps to save pins. That means your hardware/software must be designed so at a warm boot the code can issue a from-the-ground-up reset to every peripheral to clear weird modes that may accompany a WDT time-out.

Philip’s XA processors require two sequential writes of 0xa5 and 0x5a to the WDT. But like the Coldfire there’s no external indication of a time-out, and it appears the watchdog reset isn’t even a complete CPU restart—the docs suggest it’s just a reload of the program counter. Yikes—what if the processor’s internal states were in disarray from code running amok or a hardware glitch?

Dallas Semiconductor’s DS80C320, an 8051 variant, has a very powerful WDT circuit that generates a special watchdog interrupt 128 cycles before automatically—and irrevocably—performing a hardware reset. This gives your code a chance to safe the system, and leave debugging breadcrumbs behind before a complete system restart begins. Pretty cool.

External WDTs

Many of the supervisory chips we buy to manage a processor’s reset line include built-in WDTs.

TI’s UCC3946 is one of many nice power supervisor parts that does an excellent job of driving reset only when Vcc is legal. In a nice small 8 pin SMT package it eats practically no PCB real estate. It’s not connected to the CPU’s clock, so the WDT will output a reset to the hardware safeing mechanisms even if there’s a crystal failure. But it’s too darn simple: to avoid a time-out just wiggle the input bit once in a while. Crashed code could do this in any of a million ways.

TI isn’t the only purveyor of simplistic WDTs. Maxim’s MAX823 and many other versions are similar. The catalogs of a dozen other vendors list equally dull and ineffective watchdogs.

But both TI and Maxim do offer more sophisticated devices. Consider TI’s TPS3813 and Maxim’s MAX6323. Both are “Window Watchdogs.” Unlike the internal versions described above that avoid time-outs using two different data writes (like a 0x55 and then 0xaa), these require tickling within certain time bands. Toggle the WDT input too slowly, too fast, or not at all, and a time-out will occur. That greatly reduces the chances that a program run amok will create the precise timing needed to satisfy the watchdog. Since a crashed program will likely speed up or bog down if it does anything at all, errant strobing of the tickle bit will almost certainly be outside the time band required.

Error handling and debuggingExternal WDTsTI’s TPS3813Error handling and debuggingExternal WDTsWindow timing of Maxim’s equally cool MAX6323TI’s TPS3813 Is Easy to Use and Offers a Nice Windowing WDT Feature

Figure 5.1. TI’s TPS3813 Is Easy to Use and Offers a Nice Windowing WDT Feature

Window Timing of Maxim’s Equally Cool MAX6323

Figure 5.2. Window Timing of Maxim’s Equally Cool MAX6323

Characteristics of Great WDTs

What’s the rationale behind an awesome watchdog timer? The perfect WDT should detect all erratic and insane software modes. It must not make any assumptions about the condition of the software or the hardware; in the real world anything that can go wrong will. It must bring the system back to normal operation no matter what went wrong, whether from a software defect, RAM glitch, or bit flip from cosmic rays.

It’s impossible to recover from a hardware failure that keeps the computer from running properly, but at the least the WDT must put the system into a safe state. Finally, it should leave breadcrumbs behind, generating debug information for the developers. After all, a watchdog time-out is the yin and yang of an embedded system. It saves the system, keeping the customer happy, yet demonstrates an inherent design flaw that should be addressed. Without debug information, troubleshooting these infrequent and erratic events is close to impossible.

What does this mean in practice?

An effective watchdog is independent from the main system. Though all WDTs are a blend of interacting hardware and software, something external to the processor must always be poised, like the sword of Damocles, ready to intervene as soon as a crash occurs. Pure software implementations are simply not reliable.

There’s only one kind of intervention that’s effective: an immediate reset to the processor and all connected peripherals. Many embedded systems have a watchdog that initiates a nonmaskable interrupt. Designers figure that firing off NMI rather than reset preserves some of the system’s context. It’s easy to seed debugging assets in the NMI handler (like a stack capture) to aid in resolving the crash’s root cause. That’s a great idea, except that it does not work.

All we really know when the WDT fires is that something truly awful happened. Software bug? Perhaps. Hardware glitch? Also possible. Can you ensure that the error wasn’t something that totally scrambled the processor’s internal logic states? I worked with one system where a motor in another room induced so much EMF that our instrument sometimes went bonkers. We tracked this down to a subnanosecond glitch on one CPU input, a glitch so short that the processor went into an undocumented weird mode. Only a reset brought it back to life.

Some CPUs, notably the 68 k and ColdFire, will throw an exception if a software crash causes the stack pointer to go odd. That’s not bad, except that any watchdog circuit that then drives the CPU’s nonmaskable interrupt will unavoidably invoke code that pushes the system’s context, creating a second stack fault. The CPU halts, staying halted till a reset, and only a reset, comes along.

Drive reset; it’s the only reliable way to bring a confused microprocessor back to lucidity. Some clever designers, though, build circuits that drive NMI first, and then after a short delay pound on reset. If the NMI works then its exception handler can log debug information and then halt. It may also signal other connected devices that this unit is going offline for a while. The pending reset guarantees an utterly clean restart of the code. Don’t be tempted to use the NMI handler to safe dangerous hardware; that task always, in every system, belongs to a circuit external to the possibly confused CPU.

Don’t forget to reset the whole computer system; a simple CPU restart may not be enough. Are the peripherals absolutely, positively, in a sane mode? Maybe not. Runaway code may have issued all sorts of I/O instructions that placed complex devices in insane modes. Give every peripheral a hardware reset; software resets may get lost in all of the I/O chatter.

Consider what the system must do to be totally safe after a failure. Maybe a pacemaker needs to reboot in a heartbeat (so to speak), or maybe backup hardware should issue a few ticks if reboots are slow.

One thickness gauge that beams high energy gamma rays through 4 inches of hot steel failed in a spectacular way. Defective hardware crashed the code. The WDT properly closed the protective lead shutter, blocking off the 5 curie cesium source. I was present, and watched incredulously as the engineering VP put his head in path of the beam; the crashed code, still executing something, tricked the watchdog into opening the shutter, beaming high intensity radiation through the veep’s forehead. I wonder to this day what eventually became of the man.

A really effective watchdog cannot use the CPU’s clock, which may fail. A bad solder joint on the crystal, poor design that doesn’t work well over temperature extremes, or numerous other problems can shut down the oscillator. This suggests that no WDT internal to the CPU is really safe. All (that I know of) share the processor’s clock.

Under no circumstances should the software be able to reprogram the WDT or any of its necessary components (like reset vectors, I/O pins used by the watchdog, and so on). Assume runaway code runs under the guidance of a malevolent deity.

Build a watchdog that monitors the entire system’s operation. Don’t assume that things are fine just because some loop or ISR runs often enough to tickle the WDT. A software-only watchdog should look at a variety of parameters to insure the product is healthy, kicking the dog only if everything is OK. What is a software crash, after all? Occasionally the system executes a HALT and stops, but more often the code vectors off to a random location, continuing to run instructions. Maybe only one task crashed. Perhaps only one is still alive—no doubt that which kicks the dog.

Think about what can go wrong in your system. Take corrective action when that’s possible, but initiate a reset when it’s not. For instance, can your system recover from exceptions like floating point overflows or divides by zero? If not, these conditions may well signal the early stages of a crash. Either handle these competently or initiate a WDT time-out. For the cost of a handful of lines of code you may keep a 60 Minutes camera crew from appearing at your door.

It’s a good idea to flash an LED or otherwise indicate that the WDT kicked. A lot of devices automatically recover from time-outs; they quickly come back to life with the customer totally unaware a crash occurred. Unless you have a debug LED, how do you know if your precious creation is working properly, or occasionally invisibly resetting? One outfit complained that over time, and with several thousand units in the field, their product’s response time to user inputs degraded noticeably. A bit of research showed that their system’s watchdog properly drove the CPU’s reset signal, and the code then recognized a warm boot, going directly to the application with no indication to the users that the time-out had occurred. We tracked the problem down to a floating input on the CPU, that caused the software to crash—up to several thousand times per second. The processor was spending most of its time resetting, leading to apparently slow user response. An LED would have shown the problem during debug, long before customers started yelling.

Everyone knows we should include a jumper to disable the WDT during debugging. But few folks think this through. The jumper should be inserted to enable debugging, and removed for normal operation. Otherwise if manufacturing forgets to install the jumper, or if it falls out during shipment, the WDT won’t function. And there’s no production test to check the watchdog’s operation.

Design the logic so the jumper disconnects the WDT from the reset line (possibly though an inverter so an inserted jumper sets debug mode). Then the watchdog continues to function even while debugging the system. It won’t reset the processor but will flash the LED. The light will blink a lot when break pointing and single stepping, but should never come on during full-speed testing.

Using an Internal WDT

Most embedded processors that include high integration peripherals have some sort of built-in WDT. Avoid these except in the most cost-sensitive or benign systems. Internal units offer minimal protection from rogue code. Runaway software may reprogram the WDT controller, many internal watchdogs will not generate a proper reset, and any failure of the processor will make it impossible to put the hardware into a safe state. A great WDT must be independent of the CPU it’s trying to protect.

However, in systems that really must use the internal versions, there’s plenty we can do to make them more reliable. The conventional model of kicking a simple timer at erratic intervals is too easily spoofed by runaway code.

A pair of design rules leads to decent WDTs: kick the dog only after your code has done several unrelated good things, and make sure that erratic execution streams that wander into your watchdog routine won’t issue incorrect tickles.

This is a great place to use a simple state machine. Suppose we define a global variable named “state.” At the beginning of the main loop set state to 0x5555. Call watchdog routine A, which adds an offset—say 0x1111—to state and then ensures the variable is now 0x66bb. Return if the compare matches; otherwise halt or take other action that will cause the WDT to fire.

Later, maybe at the end of the main loop, add another offset to state, say 0x2222. Call watchdog routine B, which makes sure state is now 0x8888. Set state to zero. Kick the dog if the compare worked. Return. Halt otherwise.

This is a trivial bit of code, but now runaway code that stumbles into any of the tickling routines cannot errantly kick the dog. Further, no tickles will occur unless the entire main loop executes in the proper sequence. If the code just calls routine B repeatedly, no tickles will occur because it sets state to zero before exiting.

Add additional intermediate states as your paranoia or fear of litigation dictates.

Normally I detest global variables, but this is a perfect application. Cruddy code that mucks with the variable, errant tasks doing strange things, or any error that steps on the global will make the WDT time-out.

Do put these actions in the program’s main loop, not inside an ISR. It’s fun to watch a multitasking product crash—the entire system might be hung, but one task still responds to interrupts. If your watchdog tickler stays alive as the world collapses around the rest of the code, then the watchdog serves no useful purpose.

If the WDT doesn’t generate an external reset pulse (some processors handle the restart internally) make sure the code issues a hardware reset to all peripherals immediately after start-up. That may mean working with the EEs so an output bit resets every resetable peripheral.

If you must take action to safe dangerous hardware, well, since there’s no way to guarantee the code will come back to life, stay away from internal watchdogs. Broken hardware will obviously cause this—but so can lousy code. A digital camera was recalled recently when users found that turning the device off when in a certain mode meant it could never be turned on again. The code wrote faulty information to flash memory that created a permanent crash.

Pseudo Code for Handling an Internal WDT

Figure 5.3. Pseudo Code for Handling an Internal WDT

An External WDT

The best watchdog is one that doesn’t rely on the processor or its software. It’s external to the CPU, shares no resources, and is utterly simple, thus devoid of latent defects.

Use a PIC, a Z8, or other similar dirt-cheap processor as a system health monitor. These parts have an independent clock, onboard memory, and the built-in timers we need to build a truly great WDT. Being external, you can connect an output to hardware interlocks that put dangerous machinery into safe states.

But when selecting a watchdog CPU check the part’s specifications carefully. Tying the tickle to the watchdog CPU’s interrupt input, for instance, may not work reliably. A slow part—like most PICs—may not respond to a tickle of short duration. Consider TI’s MSP430 family or processors. They’re a very inexpensive (half a buck or so) series of 16 bit processors that use virtually no power and no PCB real estate.

The MSP430—a 16 Bit Processor that Uses No PCB Real Estate. For Metrically-Challenged Readers, This Is about 1/4″ x 1/8″

Figure 5.4. The MSP430—a 16 Bit Processor that Uses No PCB Real Estate. For Metrically-Challenged Readers, This Is about 1/4″ x 1/8″

Tickle it using the same sort of state-machine described above. Like the windowed watchdogs (TI’s TPS3813 and Maxim’s MAX6323), define min and max tickle intervals, to further limit the chances that a runaway program deludes the WDT into avoiding a reset.

Perhaps it seems extreme to add an entire computer just for the sake of a decent watchdog. We’d be fools to add extra hardware to a highly cost-constrained product. Most of us, though, build lower volume higher margin systems. A fifty cent part that prevents the loss of an expensive mission, or that even saves the cost of one customer support call, might make a lot of sense.

In a multiprocessor system it’s easy to turn all of the processors into watchdogs. Have them exchange “I’m OK” messages periodically. The receiver resets the transmitter if it stops speaking. This approach checks a lot of hardware and software, and requires little circuitry.

Watchdog for a Dual-Processor System—Each CPU Watches the Other

Figure 5.5. Watchdog for a Dual-Processor System—Each CPU Watches the Other

WDTs for Multitasking

Tasking turns a linear bit of software into a multidimensional mix of tasks competing for processor time. Each runs more or less independently of the others, which means each can crash on its own, without bringing the entire system to its knees.

You can learn a lot about a system’s design just by observing its operation. Consider a simple instrument with a display and various buttons. Press a button and hold it down; if the display continues to update, odds are the system multitasks.

Yet in the same system a software crash might go undetected by conventional watchdog strategies. If the display or keyboard tasks die, the main line code or a WDT task may continue to run.

Any system that uses an ISR or a special task to tickle the watchdog, but that does not examine the health of all other tasks, is not robust. Success lies in weaving the watchdog into the fabric of all of the system’s tasks, which is happily much easier than it sounds.

First, build a watchdog task. It’s the only part of the software allowed to tickle the WDT. If your system has an MMU, mask off all I/O accesses to the WDT except those from this task, so rogue code traps on an errant attempt to output to the watchdog.

Next, create a data structure that has one entry per task, with each entry being just an integer.

When a task starts it increments its entry in the structure. Tasks that only start once and stay active forever can increment the appropriate value each time through their main loops.

Increment the data atomically—in a way that cannot be interrupted with the data half-changed. ++TASKi (if TASK is an integer array) on an 8 bit CPU might not be atomic, though it’s almost certainly OK on a 16 or 32 bitter. The safest way to both encapsulate and ensure atomic access to the data structure is to hide it behind another task. Use a semaphore to eliminate concurrent shared accesses. Send increment messages to the task, using the RTOS’s messaging resources.

As the program runs the number of counts for each task advances. Infrequently but at regular intervals the watchdog task runs. Perhaps once a second, or maybe once a msec—it’s all a function of your paranoia and the implications of a failure.

The watchdog task scans the structure, checking that the count stored for each task is reasonable. One that runs often should have a high count; another which executes infrequently will produce a smaller value. Part of the trick is determining what’s reasonable for each task; stick with me—we’ll look at that shortly.

If the counts are unreasonable, halt and let the watchdog time-out. If everything is OK, set all of the counts to zero and exit.

Why is this robust? Obviously, the watchdog monitors every task in the system. But it’s also impossible for code that’s running amok to stumble into the WDT task and errantly tickle the dog; by zeroing the array we guarantee it’s in a “bad” state.

I skipped over a critical step—how do we decide what’s a reasonable count for each task? It might be possible to determine this analytically. If the WDT task runs once a second, and one of the monitored tasks starts every 50 msec, then surely a count of around 20 is reasonable.

Other activities are much harder to ascertain. What about a task that responds to asynchronous inputs from other computers, say data packets that come at irregular intervals? Even in cases of periodic events, if these drive a low-priority task they may be suspended for rather long intervals by higher-priority problems.

The solution is to broaden the data structure that maintains count information. Add minimum (min) and maximum (max) fields to each entry. Each task must run at least min, but no more than max times.

Now redesign the watchdog task to run in one of two modes. The first is the one already described, and is used during normal system operation.

The second mode is a debug environment enabled by a compile-time switch that collects min and max data. Each time the WDT task runs it looks at the incremented counts and sets new min and max values as needed. It tickles the watchdog each time it executes.

Run the product’s full test suite with this mode enabled. Maybe the system needs to operate for a day or a week to get a decent profile of the min/max values. When you’re satisfied that the tests are representative of the system’s real operation, manually examine the collected data and adjust the parameters as seems necessary to give adequate margins to the data.

What a pain! But by taking this step you’ll get a great watchdog—and a deep look into your system’s timing. I’ve observed that few developers have much sense of how their creations perform in the time domain. “It seems to work” tells us little. Looking at the data acquired by this profiling, though might tell a lot. Is it a surprise that task A runs 400 times a second? That might explain a previously-unknown performance bottleneck.

In a real time system we must manage and measure time; it’s every bit as important as procedural issues, yet is oft ignored until a nagging problem turns into an unacceptable symptom. This watchdog scheme forces you to think in the time domain, and by its nature profiles—admittedly with coarse granularity—the time-operation of your system.

There’s yet one more kink, though. Some tasks run so infrequently or erratically that any sort of automated profiling will fail. A watchdog that runs once a second will miss tasks that start only hourly. It’s not unreasonable to exclude these from watchdog monitoring. Or, we can add a bit of complexity to the code to initiate a watchdog time-out if, say, the slow tasks don’t start even after a number of hours elapse.

Summary and Other Thoughts

I remain troubled by the fan failure described earlier. It’s easy to dismiss this as a glitch, an unexplained failure caused by a hardware or software bug, cosmic rays, or meddling by aliens. But others have written about identical situations with their vent fans, all apparently made by the same vendor.

When we blow off a failure, calling it a “glitch” as if that name explains something, we’re basically professing our ignorance. There are no glitches in our macroscopically deterministic world. Things happen for a reason.

The fan failures didn’t make the evening news and hurt no one. So why worry? Surely the customers were irritated, and the possible future sales of that company at least somewhat diminished. The company escalated the general rudeness level of the world, and thus the sum total incipient anger level, by treating their customers with contempt. Maybe a couple more Valiums were popped, a few spouses yelled at, some kids cowered until dad calmed down. In the grand scheme of things perhaps these are insignificant blips. Yet we must remember the purpose of embedded control is to help people, to improve lives, not to help therapists garner new patients.

What concerns me is that if we cannot even build reliable fan controllers, what hope is there for more mission-critical applications?

I don’t know what went wrong with those fan controllers, and I have no idea if a WDT—well designed or not—is part of the system. I do know, though, that the failures are unacceptable and avoidable. But maybe not avoidable by the use of a conventional watchdog. A WDT tells us the code is running. A windowing WDT tells us it’s running with pretty much the right timing. No watchdog, though, flags software executing with corrupt data structures, unless the data is so bad it grossly affects the execution stream.

Why would a data structure become corrupt? Bugs, surely. Strange conditions the designers never anticipated will also create problems, like the never-ending flood of buffer overflow conditions that plague the net, or unexpected user inputs (“We never thought the user would press all 4 buttons at the same time!”).

Is another layer of self-defense, beyond watchdogs, wise? Safety critical applications, where the cost of a failure is frighteningly high, should definitely include integrity checks on the data. Low threat equipment—like this oven fan—can and should have at least a minimal amount of code for trapping possible failure conditions.

Some might argue it makes no sense to “waste” time writing defensive code for a dumb fan application. Yet the simpler the system, the easier and quicker it is to plug in a bit of code to look for program and data errors.

Very simple systems tend to translate inputs to outputs. Their primary data structures are the I/O ports. Often several unrelated output bits get multiplexed to a single port. To change one bit means either reading the port’s current status, or maintaining a copy of the port in RAM. Both approaches are problematic.

Computers are deterministic, so it’s reasonable to expect that, in the absence of bugs, they’ll produce correct results all the time. So it’s apparently safe to read a port’s current status, AND off the unwanted bits, OR in new ones, and output the result. This is a state machine; the outputs evolve over time to deal with changing inputs. But the process works only if the state machine never incorrectly flips a bit. Unfortunately, output ports are connected to the hostile environment of the real world. It’s entirely possible that a bit of energy from starting the fan’s highly inductive motor will alter the port’s setting. I’ve seen this happen many times.

So maybe it’s more reliable to maintain a memory image of the port. The downside is that a program bug might corrupt the image. Most of the time these are stored as global variables, so any bit of sloppy code can accidentally trash the location. Encapsulation solves that problem, but not the one of a wandering pointer walking over the data, or of a latent reentrancy issue corrupting things. You might argue that writing correct code means we shouldn’t worry about a location changing, but we added a WDT to, in part, deal with bugs. Similar concerns about our data are warranted.

In a simple system look for a design that resets data structures from time to time. In the case of the oven fan, whenever the user selects a fan speed reset all I/O ports and data structures. It’s that simple.

In a more complicated system the best approach is the oldest trick in software engineering: check the parameters passed to functions for reasonableness. In the embedded world we chose not to do this for three reasons: speed, memory costs, and laziness. Of these, the third reason is the real culprit most of the time.

Note

Cycling power is the oldest fix in the book; it usually means there’s a lurking bug and a poor WDT implementation. Embedded developer Peter Putnam wrote:

Last November, I was sitting in one of a major airline’s newer 737-900 aircraft on the ramp in Cancun, Mexico, waiting for departure when the pilot announced there would be a delay due to a computer problem. About twenty minutes later a group of maintenance personnel arrived. They poked around for a bit, apparently to no avail, as the captain made another announcement. “Ladies and Gentlemen,” he said, “we’re unable to solve the problem, so we’re going to try turning off all aircraft power for thirty seconds and see if that fixes it.”

Sure enough, after rebooting the Boeing 737, the captain announced that “All systems are up and running properly.”

Nobody saw fit to leave the aircraft at that point, but I certainly considered it

 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.206.68