Some Light Relief—Making an Exception for You

In June 2003, NASA launched two robot geologist missions to Mars. After a space voyage of six months, the Spirit Rover was the first to arrive safely and start its exploration, looking for signs of water and other geological features. For two and a half weeks on the surface, Spirit sent a series of stunning pictures. Suddenly and inexplicably, it dropped contact with mission control.

The team back at California's Jet Propulsion Lab who designed and built the rovers were mystified and despondent. The Spirit mission manager, Jennifer Trosper, talked to her husband on the phone. “I asked him first how his day was. He said it was okay. And then he asked me how my day was going,” recalled Trosper. “Well… I think I'm personally responsible for the loss of a $400 million national asset,” she confessed. I hate those kind of “rainy Monday” days, don't you?

The Rover is a reprogrammable embedded system that is directed by remote control instructions over wireless from Earth. If Spirit didn't make contact, it wouldn't get instructions for more tasks, or even to diagnose the failure. Things perked up a little the next day when JPL sent a command, and Spirit acknowledged it before once again falling silent.

The Spirit rover is built from off-the-shelf hardware and software. Wind River's VxWorks real-time embedded OS runs on top of a radiation-hardened RAD6000 CPU chip from Lockheed-Martin. This chip is based on the same Power-PC CPU that IBM uses in its RS/6000 Unix workstations. It's a 32-bit RISC processor clocked at 33MHz with 120MB main memory, which is plenty for the tasks it has to do. Mars is a hostile environment for anything mechanical, so the rover has a 256 MB flash memory filesystem instead of disk.

The flash filesystem stores data files (such as photo files), and also executable programs. It's designed so that when the system boots up, some filesystem data is copied from flash into a main memory cache. Obviously the amount of RAM reserved for the cache places an overall limit on the size and number of files the RAM disk can hold. JPL was aware of this limit, and had calculated that regular operations would stay well within it.

Everything about the software was designed to be maintained over radio links from planet Earth. Indeed, the mission team had uploaded a completely new software revision a week after the June 2003 launch, as the Spirit rover raced towards Mars. You never want to delete the old software until you've had a chance to check that the new software works in all circumstances, so the flash memory now held two sets of executables.

The software upload fixed the bugs that had been identified, and mission control made a note that they'd have to delete the old files after the new ones had been shown to work on the ground. Six months later, Spirit landed on Mars, and started collecting data. Each set of data, each image, each instrument reading was stored in a new file in the flash file system. On Martian day 15 of the mission, the ground team uploaded in two parts a utility that would clean out the obsolete files and free up a lot of room in the flash filesystem.

Unhappily, the second half of the upload failed, and was rescheduled for the next communications window between Earth and Mars, four days later. In the meantime, Spirit continued taking pictures and measurements, and storing the results in the rapidly filling filesystem. Before the file delete utility could be uploaded, the portion of RAM dedicated to the flash filesystem filled up completely.

No new files could be written. The very next task that tried to write a file got a “memory allocation failure” exception instead (this part of the Rover is written in C++, but it was an exception of the same type discussed in this chapter). The exception handler put the task on hold, waiting for space. That in turn eventually led to a system reboot. And when the system rebooted, it tried to mount the flash filesystem and build the RAM cache. Again, it ran out of RAM space while making the attempt. That led to a reboot, and the cycle now repeated over and over again. Spirit was unable to complete a reboot and spent most of the time resetting itself instead of listening to signals from Earth.

As the team on earth searched for possible causes of failure, someone recalled the incomplete upload of the delete utility. Analyzing this further, the team recalled that they had used the VxWorks option that causes a task to be suspended on a memory allocation failure. The team uploaded a new program that instructed Spirit to reboot without mounting the flash memory filesystem. They transmitted it repeatedly until they hit the narrow window when Spirit was able to listen during its reboot attempts. Spirit acknowledged the transmission just as it was supposed to. The message had gotten through!

They then wrote code to go through the flash memory image, and delete the obsolete files there, without using the RAM cache. Finally, they ran an fsck (filesystem check) utility. To everyone's great relief, Spirit started functioning again! But, just in case, the software team is updating the exception-handler to recover from a memory allocation failure more gracefully! So there you have it—whether you are in the computer room down the hall, or 1.52 Astronomical Units away on the surface of another planet, exception handlers really matter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.104.242