Appendix 2
Incidents and Horror Stories Involving Software

In 1962, the exploration module Mariner 1 of NASA crashes. It was discovered that a period was placed where a parenthesis should have been in a FORTRAN DO-Loop statement.

In 1979, the problems at the Three Mile Island nuclear plant were caused by a misinterpretation of a user interface.

In 1982, a Russian pipeline explodes because of a defect in the software. This explosion is the largest since the nuclear explosions at the end of World War II.

In 1982, during the Falklands War, the HMS Sheffield was sunk by French Exocet missiles launched by the Argentine army. The radar on the Sheffield was programmed to classify the Exocet as “friend” since these missiles were also used by the British army.

In 1983, a false alarm in a nuclear attack detection system came close to causing a nuclear war.

Between 1985 and 1987, the Therac-25 killed six people with an overdose of radiation.

In 1988, a civilian Airbus aircraft full of passengers was shot down due to the pattern recognition software that had not clearly identified the aircraft.

In 1990, the loss of long distance access at AT&T plunged the United States into an unparalleled telephone crisis.

In 1992, London ambulances changed their call-tracking software and lost control of the situation when the number of calls increased.

The maiden flight of the Ariane 5, which took place June 4, 1996 resulted in failure. About 40 seconds after the start of the flight sequence, the rocket that was then at an altitude of some 3700 meters, deviated from its trajectory, crashed, and exploded. It was the total loss of the guidance and altitude information 37 seconds after starting the ignition sequence of the main engine (30 seconds after takeoff), which was responsible for the failure of Ariane 5. This loss of information was due to errors in specification and design of the inertial reference system software.

The objective of the review process, in which all major partners involved with Ariane 5 participated, was to validate design decisions and get the flight certification. During this process, the limitations of the alignment software were not fully analyzed and the consequences of not maintaining this function in flight were not measured. Trajectory data of Ariane 5 were not specifically provided for in the specification for the inertial reference system or in testing at the equipment level. Consequently, the realignment function was not tested under simulated flight conditions for Ariane 5 and the design error was not detected.

Based on its analysis and conclusions, the Commission of Inquiry issued the following recommendations:

  • organize a special qualification review for all equipment that includes software;
  • redefine critical components taking into account failures that can originate from software;
  • treat qualification documents with as much attention as the code;
  • improve techniques to ensure consistency between the code and its qualification;
  • include external project participants in the specifications review, code review, and document reviews. Review all flight software;
  • establish a team that will be responsible for developing the software qualification process to propose strict rules to confirm the qualification and ensure that the specification, verification, and software testing will be at a consistently high level of quality in the Ariane 5 program.

In 1996, the crash of China Airlines flight B1816 was caused by a misinterpretation of an interface by the pilot.

In 1998, a member of the crew of the USS Yorktown entered a zero by mistake, which resulted in a division by zero. This error caused a failure of the navigation system and the propulsion system was stopped. This outage lasted several hours as there was no means to restart the system.

In 1999 the orbital module MARS, worth $125 million, was lost because different development teams used different measuring systems (metric vs. imperial).

In January 2000, a US spy satellite stopped working due to the Year 2000 problem.

In March 2000, a communication satellite for cellular telephones was destroyed 8 minutes after launch. The explosion was caused by a logic error in one line of code.

In May 2001, Apple warned its users not to use its new iTunes software because it would erase all contents of their hard drive.

In 2002, the unofficial results count for Franklin County, USA showed Bush with 4258 votes and Kerry with 260 votes. An audit of the results showed that there were only 638 people who voted in this county.

The FDA's analysis of 3140 medical device recalls conducted between 1992 and 1998 reveals that 242 of them (7.7%) are attributable to software failures. Of those software related recalls, 192 (or 79%) were caused by software defects that were introduced when changes were made to the software after its initial production and distribution. Software validation and other related good software engineering practices discussed in the FDA's guidance are a principal means of avoiding such defects and resultant recalls [FDA 02].

In 2003, computers of T-Mobile were hacked and a large number of passwords were stolen.

In May 2004, a Soyuz TMA-1 spaceship carrying a Russian Cosmonaut and two American Astronauts landed nearly 500 kilometers off course after the craft unexpectedly switched to a ballistic re-entry trajectory. Preliminary indications are that the problem was caused by software in the guidance computer in the new, modified version of the spaceship [PAR 03].

In September 2004, the Los Angeles area Air Traffic Controllers lost voice contact with over 400 airplanes in the airspace around LA. The main system used to communicate with pilots (called Voice Communications Systems Unit—VCSU) failed. A backup system also failed. Pilots were essentially flying blind, not knowing what other planes were in their path. There were at least five documented cases where planes came within the minimum separation distance mandated by the FAA. Inside the control system unit is a countdown timer that ticks off time in milliseconds. The VCSU uses the timer as a pulse to send out periodic queries to the VSCS. It starts out at the highest possible number that the system's server and its software can handle. It's a number just over 4 billion milliseconds. When the counter reaches zero, the system runs out of ticks and can no longer time itself. So it shuts down. In order for this system to work, it requires that a technician manually reboot the system every 30 days [GEP 04].

In 2005, it was reported that 160,000 Toyota Prius were recalled to obtain an update of the software because there were 13 cases where car engines stopped abruptly without the driver turning off the ignition. Toyota asked its dealers to install a new software version.

In 2005, the Russian rocket Cryosat was destroyed when the second stage of the launch failed. The control software had a flaw.

In 2006, there were 70 processors in the BMW745i and this car also had to undergo a recall because the software could create an incorrect synchronization of valves and suddenly stop the engine when at full speed.

In 2006, Nicolas Clarke reported in the Herald tribune that a problem in the industrial design software of the Airbus A380 had led to bad lengths and connections of cables and that production delays were due to numerous software problems. The cable length problem was caused by the use of two different versions of design software from two manufacturers.

On December 21, 2007, three businessmen from California, New York, and Florida and users of QuickBooks Pro, filed a motion to institute a class action suit against the manufacturer Intuit, which specializes in the design of accounting and financial software. The plaintiffs accuse Intuit of sending erroneous code on the weekend of December 15–16, 2007, which caused the loss of files and sales data. The problem occurred when the QuickBooks program launched an automatic update for its software that was sent by Intuit, causing the loss of files containing financial information. This represented hundreds of hours of re-work. The same situation occurred for all users of this software, who were trying to close their books at year-end. The plaintiffs claim was for compensation for the loss of their data, as well as the time and money spent trying to recover the data for themselves and all other victims. This case shows us how the wrong code, sent in an automatic update, could cause significant damage on a national level and even globally. Permanent loss of financial data can be disastrous for the survival of most of these small businesses.

On May 16, 2007, five IT projects, worth a total of $325 million in the state of Colorado, did not provide results and were canceled.

On May 16, 2007, 2900 students from the state of Virginia had to take their standardized test again because of a software problem.

On May 18, 2007, in London, a judge admits to not knowing what a website is. This revelation immediately stopped the trial he was working on (a case judgment concerning terrorists who used the internet).

On May 19, 2007, Alcatel-Lucent lost a hard drive containing the detailed data of 200,000 customers.

In August 2007, the Boeing 777 of Air Malaysia leaving Perth for Kuala Lumpur had problems with the autopilot software and the aircraft had to be flown manually throughout the flight.

In 2009, the Committee on Gaming in Ontario refuses to pay $42.9 million to Mr. Kusznirewicz on a slot machine. The signal for the win was wrong and should not have exceeded $9,025. A software error was the cause of this faulty display during play.

On 13 May 2009, a plaintiff managed to capture and analyze the source code of a breathalyzer 7110 MKIII-C equipment he believed was the source of errors. He proved that the calculations were incorrect and that thousands of people were judged severely because of these erroneous results.

In November 2009, the first Airbus A380 flying for Air France had to turn back to New York following a “minor” computer failure. The Airbus A380, which was flying to Paris from New York with 530 passengers on board, had to turn back an hour and half after takeoff. According to the National Union of Airline Pilots (Syndicat National des Pilotes de Ligne—SNPL) at Air France, the problem was with the autopilot. “It would not come on” said the SNPL spokesman.

On December 1, 2009, the Telegraph newspaper reports the “black screen of death” of Windows 7. This general failure of the new operating system from Microsoft was due to an error in a registry entry.

A software error in the Ford Explorer engine controls limited vehicle speed to 110 miles per hour instead of the specified 99 miles per hour. At 110 miles per hour, the Firestone tires on the Ford Explorers had a rated life of 10 minutes [HUM 02].

Ford warned its dealers that software might disable the continuously variable transmissions in some 30,000 of its new Ford Five Hundred sedans and Freestyle sport wagons. The mechanical parts are fine, but a computer control meant to detect dirty transmission fluid was putting some cars into sluggish “limp home” mode. Ford had to rewrite software to fix the problem, which it says was caught before any vehicles reached customers [MOR 05].

A Boeing 777-200 en route from Perth to Kuala Lumpur presented the pilot with contradictory reports of airspeed: that the aircraft was over-speeding and at the same time was at risk of stalling. The pilot disconnected the autopilot and attempted to descend, but the auto-throttle caused the aircraft to climb 2000 feet. He was eventually able to return to Perth and land the aircraft safely. The incident was attributed to a failed accelerometer. The Air Data Inertial Reference Unit (ADIRU) had recorded the failure of the device in its memory, but because of a software flaw, it failed to recheck the device's status after power cycling [JAC 07].

An Airbus A340-642 en route from Hong Kong to London suffered from a failure in a data bus belonging to a computer that monitors and controls fuel levels and flow. One engine lost power and a second began to fluctuate; the pilot diverted the aircraft and landed safely in Amsterdam. The subsequent investigation noted that a backup slave computer was available that was working correctly but that, due to faulty logic in the software, the failing computer remained selected as the master. A second report recommended an independent low-fuel warning system and noted the risks of a computerized management system that might fail to provide crew with appropriate data, preventing them from taking appropriate actions [JAC 07].

Toyota identified a software flaw that caused Prius hybrid cars to stall or shut down when traveling at high speed; 23,900 vehicles were affected [FRE 05].

“In 1998, researchers were bringing two subcritical chunks of plutonium together in a ‘criticality’ experiment that measured the rate of change of neutron flux between the two halves. It would be a Real Bad Thing if the two bits actually got quite close, so they were mounted on small controllable cars, rather like a model railway. An operator uses a joystick to cautiously nudge them toward each other. The experiment proceeded normally for a time, the cars moving at a snail's pace. Suddenly both picked up speed, careening towards each other at full speed. No doubt with thoughts of a mushroom cloud in his head, the operator hit the ‘shut down’ button mounted on the joystick. Nothing happened. The cars kept accelerating. Finally, after he actuated an emergency SCRAM control, the operator's racing heart (happily sans defective embedded pacemaker) slowed when the cars stopped and moved apart. The joystick had failed. A processor reading this device recognized the problem and sent an error message, a question mark, to the main controller. Unhappily, ‘?’ is ASCII 63, the largest number that fits in a 6-bit field. The main CPU interpreted the message as a big number meaning go real fast. Two issues come to mind: the first is to test everything, even exception handlers. The second is that error handling is intrinsically difficult and must be designed carefully.” [GAN 04]

“In 1997 Guidant announced that one of its new pacemakers occasionally drives the patient's heartbeat to 190 beats per minute (BPM). Now, I don't know much about cardiovascular diseases, but suspect 190 BPM to be a really bad thing for a person with a sick heart. The company reassured the pacemaker-buying public that there wasn't really a problem; they had fixed the code and were sending disks across the country to doctors. The pacemaker, however, is implanted subcutaneously. There's no Internet connection, no USB port, no PCMCIA slot. Turns out that it's possible to hold an inductive loop over the implanted pacemaker to communicate with it. A small coil in the device normally receives energy to charge the battery. It's possible to modulate the signal and upload new code into flash. The robopatients were reprogrammed and no one was hurt. The company was understandably reluctant to discuss the problem so it's impossible to get much insight into the nature of what went wrong. But there was clearly inadequate testing. Guidant is far from alone. A study in the August 15, 2001 Journal of the American Medical Association (‘Recalls and Safety Alerts Involving Pacemakers and Implantable Cardioverter-Defibrillator Generators’) showed that more than 500,000 implanted pacemakers and cardioverters were recalled between 1990 and 2000. (This month's puzzler: how do you recall one of these things?) Forty-one percent of those recalls were due to firmware problems. The recall rate increased in the second half of that decade compared with the first. Firmware is getting worse. All five U.S. pacemaker vendors have an increasing recall rate. The study said, ‘engineered (hardware) incidents [are] predictable and therefore preventable, while system (firmware) incidents are inevitable due to complex processes combining in unforeseeable ways.” [GAN 04].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.227.82