Chapter 6

Abstract data types

Abstract

This chapter introduces the concept of abstract data types and explains how they are used to improve software reliability while reducing the cost of development and maintenance. An example is presented for a mixed C and assembly implementation of an ADT to count word frequencies. The chapter ends with an ethical case study, highlighting the importance of structured design techniques.

Keywords

Abstract data types (ADTs); Assembly language; Word frequency counts; Therac-25; Sorting; Indexing; Performance

An Abstract Data Type (ADT) is composed of data and the operations that work on that data. The ADT is one of the cornerstones of structured programming. Proper use of ADTs has many benefits. Most importantly, abstract data types help to support information hiding. An ADT hides information by encapsulating the information into a module or other construct, and providing a well-defined interface. The interface typically consists of the names of data types provided by the ADT and a set of subroutine definitions, or function prototypes, for operating on the data types. The implementation of the ADT is hidden from the client code that uses the ADT.

A common use of information hiding is to hide the physical storage layout for data so that if it is changed, the change is restricted to a small subset of the total program. For example, if a three-dimensional point (x,y,z)Image is represented in a program with three floating point scalar variables and later, the representation is changed to a single array variable of size three, a module designed with information hiding in mind would protect the remainder of the program from such a change.

Information hiding reduces software development risk by shifting the code's dependency on an uncertain implementation onto a well-defined interface. Clients of the interface perform operations purely through the interface, which does not change. If the implementation changes, the client code does not have to change.

Encapsulating software and data structures behind an interface allows the construction of objects that mimic the behavior and interactions of objects in the real world. For example, a simple digital alarm clock is a real-world object that most people can use and understand. They can understand what the alarm clock does, and how to use it through the provided interface (buttons and display) without having to understand every part inside of the clock. If the internal circuitry of the clock were to be replaced with a different implementation, people could continue to use it in the same way, provided that the interface did not change.

6.1 ADTs in assembly language

As with all other structured programming concepts, ADTs can be implemented in assembly language. In fact, most high level compilers convert structured programming code into assembly during compilation. All that is required is that the programmer define the data structure(s), and the set of operations that can be used on the data. Listing 6.1 gives an example of an ADT interface in C. The type Image is not fully defined in the interface. This prevents client software from accessing the internal structure of the image data type. Therefore, programmers using the ADT can modify images only by using the provided functions. Other structured programming and object-oriented programming languages such as C++, Java, Pascal, and Modula 2 provide similar protection for data structures so that client code can access the data structure only through the provided interface. Note that only the Image definition is exposed, indicating to client programs that the red, green, and blue components of a pixel must be a number between 0 and 255. In C, as with other structured programming languages, the implementation of the subroutines can also be hidden by placing them in separate compilation modules. Only the ADT implementation code will have access to the internal structure of the Image data type.

Image

Image

Image

Assembly language does not have the ability to define a data structure as such, but it does provide the mechanisms needed to specify the location of each field with respect to the beginning of a data structure, as well as the overall size of the data structure. With a little thought and effort, it is possible to implement ADTs in Assembly language.

Listing 6.2 shows the private implementation of the Image data type, which is included by the C files which implement the Image abstract data type. A pixel is defined as a C struct with one byte each for the red, blue, and green color components. Since the largest item in the struct is a byte, the C compiler will create a structure that is exactly three bytes long, without any extra bytes for alignment. The image struct contains three integer values, so it will occupy twelve bytes. Listing 6.3 shows how the data structures from the previous listings can be defined in assembly language. With those definitions, any of the functions declared in Listing 6.1 can be written in assembly language.

6.2 Word frequency counts

Counting the frequency of words in written text has several uses. In digital forensics, it can be used to provide evidence as to the author of written communications. Different people have different vocabularies, and use words with differing frequency. Word counts can also be used to classify documents by type. Scientific articles from different fields contain words specific to the field, and historical novels will differ from western novels in word frequency.

Listing 6.4 shows the main function for a simple C program which reads a text file and creates a list of all of the words contained in a file, along with their frequency of occurrence. The program has been divided into two parts: the main program, and an ADT. The ADT is used to keep track the words and their frequencies, and to print a table of word frequencies.

Image

The interface for the ADT is shown in Listing 6.5. There are several ways that the ADT could be implemented. Note that the interface given in the header file does not show the internal fields of the word list data type. Thus, any file which includes this header are allowed to declare pointers to Image data types, but cannot access or modify any internal fields. The list of words could be stored in an array, a linked list, a binary tree, or some other data structure. Also, the subroutines could be implemented in C or in some other language, including assembly. Listing 6.6 shows an implementation in C using a linked list. Note that the function for printing the word frequency list in numerical order has not been implemented. It will be written in assembly language. Since the program is split into multiple files, it is a good idea to use the Image utility to build the executable program. A basic makefile is shown in Listing 6.7.

Image

Image

Image

Suppose we wish to implement one of the functions from Listing 6.6 in AArch64 assembly language. We would

  1. 1.  delete the function from the C file,
  2. 2.  create a new file with the assembly version of the function, and
  3. 3.  modify the makefile so that the new file is included in the program.

The header file and the main program file would not require any changes. The header file provides function prototypes that the C compiler uses to determine how parameters should be passed to the functions. As long as our new assembly function conforms to its C header definition, the program will work correctly.

6.2.1 Sorting by word frequency

The linked list is created in alphabetical order, but the Image function is required to print it sorted in reverse order of number of occurrences. There are several ways in which this could be accomplished, with varying levels of efficiency. Possible approaches include, but are not limited to:

  • •  Re-ordering the linked list using an insertion sort. This approach creates a complete new list by removing each item, one at a time, from the original list, and inserting it into a new list sorted by the number of occurrences rather than the words themselves. The time complexity for this approach would be O(N2)Image, but would require no additional storage. However, if the list were later needed in alphabetical order, or any more words were to be added, then it would need to be re-sorted in the original order.
  • •  Sorting the linked list, using a merge sort algorithm. Merge sort is one of the most efficient sorting algorithms known, and can be efficiently applied to data in files and linked lists. The merge sort works as follows:
    1. 1.  The sub-list size, i, is set to 1.
    2. 2.  The list is divided into sub-lists, each containing i elements. Each sub-list is assumed to be sorted. (A sub-list of length one is sorted by definition.)
    3. 3.  The sub-lists are merged together to create a list of sub-lists of size 2i, where each sub-list is sorted.
    4. 4.  The sub-list size, i is set to 2i.
    5. 5.  The process is repeated from step 2 until iNImage, where N is the number of items to be sorted.
    The time complexity for the merge sort algorithm is O(NlogN)Image, which is far more efficient than the insertion sort. This approach would also require no additional storage. However, if the list were later needed in alphabetical order, or any more words were to be added, then it would need to be re-sorted in the original alphabetical order.
  • •  Create an index, and sort the index rather than rebuilding the list. Since the number of elements in the list is known, we can allocate an array of pointers. Each pointer in the array is then initialized to point to one element in the linked list. The array forms an index, and the pointers in the array can be re-sorted in any desired order, using any common sorting method such as bubble sort (O(N2)Image), in-place insertion sort (O(N2)Image), quick sort (O(NlogN)Image), or others. This approach requires additional storage, but has the advantage that it does not modify the original linked list.

There are many other possibilities for re-ordering the list. Regardless of which method is chosen, the main program and the interface (header file) need not be changed. Different implementations of the sorting function can be substituted without affecting any other code, because the implementation of the ADT is hidden from the code that uses it. This separation of inteface and implementation provides many benefits.

The wl_print_numerical function can be implemented in assembly as shown in Listing 6.8. The function operates by re-ordering the linked list using an insertion sort as described above. Listing 6.9 shows the change that must be made to the makefile. Now, when make is run, it compiles the two C files and the assembly file into object files, then links them all together. The C implementation of Image in Image must be deleted or commented out or the linker will emit an error indicating that it found two versions of Image.

Image

Image

6.2.2 Better performance

The word frequency counter, as previously implemented, takes several minutes to count the frequency of words in this textbook on a Raspberry Pi. Most of the time is spent building the list of words, and re-sorting the list in order of word frequency. Most of the time for both of these operations is spent in searching for the word in the list before incrementing its count or inserting it in the list. There are much more efficient ways to build ordered lists of data.

Since the code is well modularized using an ADT, the internal mechanism of the list can be modified without affecting the main program. A major improvement can be made by changing the data structure from a linked list to a binary tree. Fig. 6.1 shows an example binary tree storing word frequency counts. The time required to insert into a linked list is O(N)Image, but the time required to insert into a binary tree is O(log2N)Image. To give some perspective, the author's manuscript for this textbook contains about 125,000 words. Since log2(125,000)<17Image, we would expect the linked list implementation to require about 125000177353Image times as long as a binary tree implementation to process the author's manuscript for this textbook. In reality, there is some overhead to the binary tree implementation. Even with the extra overhead, we should see a significant speedup. Listing 6.10 shows the C implementation using a balanced binary tree instead of a linked list.

Image
Figure 6.1 Binary tree of word frequencies.

Image

Image

Image

Image

Image

6.2.3 Indexing and sorting by frequency

With the tree implementation, Image could build a new tree, sorted on the word frequency counts. However, it may be more efficient to build a separate index, and sort the index by word frequency counts. The assembly code will allocate an array of pointers, and set each pointer to one of the nodes in the tree, as shown in Fig. 6.2. Then, it will use a quicksort to rearrange the pointers into descending order by word frequency count, as shown in Fig. 6.3. This implementation is shown in Listing 6.11.

Image
Figure 6.2 Binary tree of word frequencies with index added.
Image
Figure 6.3 Binary tree of word frequencies with sorted index.

Image

The tree-based implementation gets most of its speed improvement through using two O(NlogN)Image algorithms to replace O(N2)Image algorithms. These examples show how a small part of a program can be implemented in assembly language, and how to access C data structures from assembly language. The functions could just as easily have been written in C rather than assembly, without greatly affecting performance. Later chapters will show examples where the assembly implementation does have significantly better performance than the C implementation.

6.3 Ethics case study: Therac-25

The Therac-25 was a device designed for radiation treatment of cancer. It was produced by Atomic Energy of Canada Limited (AECL), which had previously produced the Therac-6 and Therac-20 units in partnership with CGR of France. It was capable of treating tumors close to the skin surface using electron beam therapy, but could also be configured for Megavolt X-ray therapy to treat deeper tumors. The X-ray therapy required the use of a tungsten radiation shield to limit the area of the body that was exposed to the potentially lethal radiation produced by the device.

The Therac-25 used a double pass accelerator, which provided more power, in a smaller space, at less cost, compared to its predecessors. The second major innovation was that computer control was a central part of the design, rather than an add-on component as in its predecessors. Most of the hardware safety interlocks that were integral to the designs of the Therac-6 and Therac-20, were seen as unnecessary, because to software would perform those functions. Computer control was intended to allow operators to set up the machine more quickly, allowing them to spend more time communicating with patients and to treat more patients per day. It was also seen as a way to reduce production costs by relying on software, rather than hardware, safety interlocks.

There were design issues with both the software and the hardware. Although this machine was built with the goal of saving lives, between 1985 and 1986, three deaths and other injuries were attributed to the hardware and software design of this machine. Death due to radiation exposure is usually slow and painful, and the problem was not identified until the damage had been done.

6.3.1 History of the Therac-25

AECL was required to obtain US Food and Drug Administration (FDA) approval before releasing the Therac-25 to the US market. They obtained approval quickly by declaring “pre-market equivalence,” effectively claiming that the new machine was not significantly different from its predecessors. This practice was common in 1984, but was overly optimistic, considering that most of the safety features had been changed from hardware to software implementations. With FDA approval, AECL made the Therac-25 commercially available and performed a Fault Tree Analysis to evaluate the safety of the device.

Fault Tree Analysis, as its name implies, requires building a tree to describe every possible fault, and assign probabilities to those faults. After building the tree, the probabilities of hazards, such as overdose, can be calculated. Unfortunately, the engineers assumed that the software (much of which was re-used from the previous Therac models) would operate correctly. This turned out not to be the case, because the hardware interlocks present in the previous models had hidden some of the software faults. The analysts did consider some possible computer faults, such as an error being caused by cosmic rays, but assigned extremely low probabilities to those faults. As a result, the assessment was very inaccurate.

When the first report of an overdose was reported to AECL in 1985, they sent an engineer to the site to investigate. They also filed a report with the FDA and the Canadian Radiation Protection Board (CRPB). AECL also notified all users of the fact that there had been a report and recommended that operators should visually confirm hardware settings before each treatment. The AECL engineers were unable to reproduce the fault, but suspected that it was due to the design and placement of a microswitch. They redesigned the microswitch and modified all of the machines that had been deployed. They also retracted their recommendation that operators should visually confirm hardware settings before each treatment.

Later that year, a second incident occurred. In this case, there is no evidence that AECL took any action. In January of 1986, AECL received another incident report. An employee at AECL responded by denying that the Therac-25 was at fault, and stated that no other similar incidents had been reported. Another incident occurred in March of that year. AECL sent an engineer to investigate. The engineer was unable to determine the cause, and suggested that it was due to an electrical problem, which may have caused an electrical shock. An independent engineering firm was called to examine the machine and reported that it was very unlikely that the machine could have delivered an electrical shock to the patient. In April of 1986, another incident was reported. In this case, the AECL engineers, working with the medical physicist at the hospital, were able to reproduce the sequence of events that lead to the overdose.

As required by law, AECL filed a report with the FDA. The FDA responded by declaring the Therac-25 defective. AECL was ordered to notify all of the sites where the Therac-25 was in use, investigate the problem, and file a corrective action plan. AECL notified all sites, and recommended removing certain keys from the keyboard on the machines. The FDA responded by requiring them to send another notification with more information about the defect and the consequent hazards. Later in 1986, AECL filed a revised corrective action plan.

Another overdose occurred in January 1987, and was attributed to a different software fault. In February, the FDA and CRPB, both ordered that all Therac-25 unites be shut down, pending effective and permanent modifications. AECL spent six months developing a new corrective action plan, which included a major overhaul of the software, the addition of mechanical safety interlocks, and other safety-related modifications.

6.3.2 Overview of design flaws

The Therac-25 was controlled by a DEC PDP-11 computer. The PDP-11 was the most popular minicomputer ever produced. Around 600,000 were produced between 1970 and 1990 and used for a variety of purposes, including embedded systems, education, and general data processing. It was a 16-bit computer and was far less powerful than a Raspberry Pi. The Therac-25 computer was programmed in assembly language by one programmer and the source code was not documented. Documentation for the hardware components was written in French. After the faults were discovered, a commission concluded that the primary problems with the Therac-25 were attributable to poor software design practices, and not due to any one of several specific coding errors. This is probably the best known case where a poor overall software design, and insufficient testing, led to loss of life.

The worst problems in the design and engineering of the machine were:

  • •  The code was not subjected to independent review.
  • •  The software design was not considered during the assessment of how the machine could fail or malfunction.
  • •  The operator could ignore malfunctions and cause the machine to proceed with treatment.
  • •  The hardware and software were designed separately and not tested as a complete system until the unit was assembled at the hospitals where it was to be used.
  • •  The design of the earlier Therac-6 and Therac-20 machines included hardware interlocks which would ensure that the X-ray mode could not be activated unless the tungsten radiation shield was in place. The hardware interlock was replaced with a software interlock in the Therac-25.
  • •  Errors were displayed as numeric codes, and there was no indication of the severity of the error condition.

The operator interface consisted of a keyboard and text-mode monitor, which was common in the early 1980s. The interface had a data entry area in the middle of the screen and a command line at the bottom. The operator was required to enter parameters in the data entry area, then move the cursor to the command line to initiate treatment. When the operator moved the cursor to the command line, internal variables were updated and a flag variable was set to indicate that data entry was complete. That flag was cleared when a command was entered on the command line. If the operator moved the cursor back to the data entry area without entering a command, then the flag was not cleared, and any subsequent changes to the data entry area did not affect the internal variables.

A global variable was used to indicate that the magnets were currently being adjusted. This variable was modified by two functions, and did not always contain the correct value. Adjusting the magnets required about eight seconds, and the flag was correct for only a small period at the beginning of this time period.

Due to the errors in the design and implementation of the software, the following sequence of events could result in the machine causing injury to, or even death of, the patient:

  1. 1.  The operator mistakenly specified high power mode during data entry.
  2. 2.  The operator moved the cursor to the command line area.
  3. 3.  The operator noticed the mistake, and moved the cursor back to the data entry area without entering a command.
  4. 4.  The operator corrected the mistake and moved the cursor back to the command line.
  5. 5.  The operator entered the command line area, left it, made the correction, and returned within the eight-second window required for adjusting the magnets.

If the above sequence occurred in less than eight seconds, then the operator screen could indicate that the machine was in low power mode, although it was actually set in high power mode. During a final check before initiating the beam, the software would find that the magnets were set for high power mode but the operator setting was for low power mode. It displayed a numeric error code and prevented the machine from starting. The operator could clear the error code by resetting the computer (which only required one key to be pressed on the keyboard). This caused the tungsten shield to withdraw but left the machine in X-ray mode. When the operator entered the command to start the beam, the machine could be in high power mode without having the tungsten shield in place. X-rays were applied to the unprotected patient.

It took some time for this critical flaw to appear. The failure only occurred when the operator initially made a one-keystroke mistake in entering the prescription data, moved to the command area, and then corrected the mistake within eight seconds. Initially, operators were slow to enter data, and spent a lot of time making sure that the prescription was correct before initiating treatment. As they became more familiar with the machine, they were able to enter data, and correct mistakes more quickly. Eventually, operators became familiar enough with the machine that they could enter data, make a correction, and return to the command area within the critical eight-second window. Also, the operators became familiar with the machine reporting numeric error codes, without any indication of the severity of the code. The operators were given a table of codes and their meanings. The code reported was “no dose” and indicated “treatment pause.” There is no reason why the operator should consider that to be a serious problem; they had become accustomed to frequent malfunctions that did not have any consequences to the patient.

Although the code was written in assembly language, that fact was not cited as an important factor. The fundamental problems were poor software design and over-confidence. The re-use of code in an application for which it was not initially designed also may have contributed to the system flaws. A proper design using established software design principles, including structured programming and abstract data types, would almost certainly have avoided these fatalities.

6.4 Chapter summary

The abstract data type is a structured programming concept which contributes to software reliability, eases maintenance, and allows for major revisions to be performed in as safe way. Many high-level languages enforce, or at least facilitate, the use of ADTs. Assembly language does not. However, the ethical assembly language programmer will give the extra effort to write code that conforms to the standards of structured programming, using abstract data types to help ensure safety, reliability, and maintainability.

ADTs also facilitate the implementation of software modules in more than one language. The interface specifies the components of the ADT, but not the implementation. The implementation can be in any language. As long as assembly programmers and compiler authors generate code that conforms to a well-known standard, their code can be linked to code written in other languages.

Poor coding practices and poor design can lead to dire consequences, including loss of life. It is the responsibility of the programmer, regardless of the language used, to make ethical decisions in the design and implementation of software. Above all, the programmer must be aware of the possible consequences of the decisions they make.

Exercises

  1. 6.1.  What are the advantages of designing software using abstract data types?
  2. 6.2.  Why is the Image data type hidden from client code in Listing 6.2?
  3. 6.3.  High level languages provide mechanisms for information hiding, but assembly does not. Why should the assembly programmer not simply bypass all information hiding and access the internal data structures directly?
  4. 6.4.  The assembly code in Image accesses the internal structure of the Image data type. Why is it allowed to do so? Should it be allowed to do so?
  5. 6.5.  Given the following definitions for a stack ADT:
Image
Image
  1. Write the Image function in AArch64 assembly language.
  2. 6.6.  Referring to the previous question, write the Image function in AArch64 assembly language.
  3. 6.7.  Referring to the previous two questions, write the Image function in AArch64 assembly language.
  4. 6.8.  Referring to the previous three questions, write the Image function in AArch64 assembly language.
  5. 6.9.  Referring to the previous three questions, write the Image function in AArch64 assembly language.
  6. 6.10.  Re-implement all of the previous stack functions using a linked list rather than a static array.
  7. 6.11.  The “Software Engineering Code of Ethics And Professional Practice” states that a responsible software engineer should “Approve software only if they have well-founded belief that it is safe, meets specifications, passes appropriate tests...” (sub-principle 1.03) and “Ensure adequate testing, debugging, and review of software...on which they work.” (sub-principle 3.10). Unfortunately, defects did make their way into the system.
  8. The software engineering code of ethics also states that a responsible software engineer should “Treat all forms of software maintenance with the same professionalism as new development.”
    1. a.  Explain how the Software Engineering Code of Ethics And Professional Practice were violated by the Therac 25 developers.
    2. b.  How should the engineers and managers at AECL have responded when problems were reported?
    3. c.  What other ethical and non-ethical considerations may have contributed to the deaths and injuries?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.173.112