Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Safety-Critical Development

Mark Kraeling CTO Office, GE Transportation, Melbourne, FL, United States

Abstract

Embedded systems that are sold or upgraded may need to comply with a variety of safety standards based on the market and intended use. These standards can also outline requirements that need to be met based on international standards. Standards, such as ones based on IEC, attempt to develop a common set of guidelines, so that each individual country and/or market doesn’t have separate requirements.

Keywords

Safety-defensive strategies; Fault; Error; Hazard; Risk analysis; Safety architecture; Static code analysis; Failure mode and effects analysis (FMEA)

1 Introduction

This chapter is devoted to looking at various safety-critical software development strategies that could be used with a variety of safety requirements. Some of the strategies may not make sense for your product or market segment.

The first part of the chapter goes over some basic strategies that can be used for the up-front project planning for a safety-critical project.

The second part discusses fault, hazard, and risk analyses. For safety-critical projects the early and continuous focus should be on what fault scenarios exist, the hazard that could occur if failures occur, and what risk it poses to the product and its environment.

The third part of the chapter goes over the basics of safety-critical architectures that are used and the pros/cons of each.

The last part concentrates on strategies in software development and implementation.

Getting a clear picture of the various standards that your project will need to meet up-front, following the appropriate implementation strategies listed, and watching out for the certification “killers” will help to make safety-critical product launch more successful.

1.1 Which Safety Requirements?

One of the most important aspects of developing safety-critical software is determining which requirements and standards are going to be followed.

Depending on the understanding of the safety requirements of your product or the intended market, you may need to get outside help to determine what needs to be met. Consider following the steps below to aid in your safety certification effort:

(1) Customer interaction—If you are entering a new market, the intended customer for that market probably knows the starting point of which safety requirements need to be met. They may be able to provide information on the safety standards that a similar product already meets. If the end customer is just using the product without a lot of technical background, then this step should be skipped.
(2) Similar product in same intended market—It may be more straightforward to see what safety requirements and standards a similar product meets. For instance, if your company or a partner already sells a medical device to the same market and your product is similar, this may be a good place to start.
(3) Competitive intelligence—Doing basic research on the Internet or from freely available open information from marketing materials may help determine a good starting point as well. Often, paperwork needs to be filed with agencies on which specific standards should be met.
(4) Professional assistance—Each market or market segment normally has agencies or contract facilities that can aid in determining which standards need to be met. Paying a little up-front, especially after gathering necessary information from steps 1–3, will help make this step pay off in the long run.

After gathering this information you should have a good idea about which standards need to be met. During this investigation also determine whether it is a self-certification activity, a standardized assessment activity, or a full-fledged independent assessment certification.

For the sets of requirements that need to be met the team should develop a strategy and initial analysis of how they will comply with the requirements. This means of compliance could be by design, by analysis, or by testing.

As an example, for design there may be requirements for redundancy. The design could include a dual-processor design or redundant communications paths. As an example, for analysis, if a certain bit error rate needs to be detected, then an appropriate length cyclic redundancy check (CRC) could be calculated. Through a mathematical equation the bit error rate could be determined. Using testing as a means of compliance is self-explanatory, as each of the requirements listed would have a corresponding test plan and procedure.

Finally, the team should determine how the evidence should be organized and presented.

All standards and delivery dates need to be listed and agreed on regardless of whether a self-certification, utilizing an auditor, or full-fledged independent assessment is performed. If this specific list can be put into a contract and signed, then it should be done to protect the team. If there is no customer contract that would list this type of documentation, then even an agreement between the project team and management could be written.

If using an independent assessor (which is normally paid for by the product team), then agree to the set of documentation, the means of compliance, and the evidence that needs to be provided up-front. Also agree on which party will determine if a newer standard is released while in the early project stages. Also agree in principle (and in writing) on when the project is far enough along so that the standards list and specification dates can be frozen. If this is all discussed and agreed upon up-front, safety certification becomes much easier.

1.2 Certification Killers

There are also items to watch out for during the safety certification process. These were learned the same way as the “Key Strategies” listed above. Many of these were lost battles on multiple projects historically in the past. Through multiple products and assessor providers these are the items that will most certainly hinder or even kill your certification effort:

• Failure to recognize safety requirements as real.
• Unclear requirements or requirements never agreed upon up-front.
• Lack of clear evidence of compliance.
• Not doing homework up-front and finding more safety standards that need to be met throughout the development process.
• Lack of dedicated resources, or resources that jump between projects.
• Scope and requirements creep.
• Trying to safety-certify too many things—not developing a boundary diagram and having everyone agree to it.
• Not accounting for enough resources to document the safety case and test based on those requirements.
• Not using a single contact to interface with the assessor (too many cooks!).
• Not being honest with the weaknesses of the proposed system.
• Waiting until the last minute to submit documentation.
• Failure to develop a relationship with the local country where the product will be deployed.
• Failure to properly sequence certification tasks.
• Qualification of software tools and OS to the appropriate safety level.

2 Project-Planning Strategies

The following rules can be applied to help safety-critical software development projects. The strategies listed are typically looked at very early in the project development life cycle before the software is written. These strategies were developed and refined during multiple product certification efforts and following these helps reduce the amount of money and resources spent on the overall effort.

2.1 Strategy 1: Determine the Project Certification Scope Early

Following some of the guidelines and directives listed in Section 1.1, identify which standards your product needs to meet. Determining whether it is a consumer product, has safety implications for the public, and/or which particular certification guidelines satisfy the customer are all part of this step.

2.2 Strategy 2: Determine the Feasibility of Certification

Answer questions up-front whether the product and solution are technically and commercially feasible. By evaluating the top-level safety hazards and the safety objectives, basic defensive strategies can be brainstormed and developed up-front. Involve engineering to determine the type of architecture that is required to meet those defensive strategies, because drastic architectural differences from the base product’s current architecture could increase risk and cost.

2.3 Strategy 3: Select an Independent Assessor (if Used)

Find an assessor who has experience with your market segment. Various assessors have specialty industries and areas, so find out if they have experience in certifying products in your industry. Once an assessor becomes comfortable with your overall process and the development procedures it makes certification of subsequent products much easier.

2.4 Strategy 4: Understand Your Assessor’s Role (if Used)

The assessor’s job is to assess your product with respect to compliance to standards and norms. Do not rely on the assessor to help design your system; the assessor is neither responsible nor obligated to tell you that you are heading down a wrong path! The assessor’s role is to determine if the safety requirements and objectives have been met resulting in a report of conformity at the end of the project.

2.5 Strategy 5: Assessment Communication is Key

Having a clear line of communication between your team and the group controlling the standards that need to be met is extremely important. Be sure to document all meetings and action items. Document decisions that have been mutually decided during the development process, so that the assessor and team stay on the same page. Ask for a position on any unclear issues or requirements as early as possible. Insist on statements of approval for each document or artifact that will be used.

2.6 Strategy 6: Establish a Basis of Certification

List all the standards and directives that your product needs to comply with, including issue dates of the documents. In conjunction with your assessment agree, disagree, or modify on a paragraph-by-paragraph basis. Consider placing all the requirements in a compliance matrix, so they can be tracked with the project team. Finally, do not be afraid to propose an “alternate means of compliance” if required.

2.7 Strategy 7: Establish a “Fit and Purpose” for Your Product

Establishing a fit and purpose up-front will prevent future headaches! The “fit” for your product is the space that you plan on selling into. If selling a controller for an overhead crane, then state that up-front and don’t incorporate requirements needed for an overhead lighting system. The “purpose” is what the product is supposed to do or how it is going to be used. Clearly define the system boundary and what portions of your overall system and product are included in the certification. Consider things such as user environment, operating environment, and integration with other products. Also, considerations such as temperature and altitude can impact the circuit design, so those should be defined well in advance for successful product certification.

2.8 Strategy 8: Establish a Certification Block Diagram

Generate a hardware block diagram of the system with the major components such as modules and processing blocks. Include all the communication paths as well as a summary of the information flow between the blocks. Identify all the external interfaces including the “certification boundary” for the system on the diagram. The certification boundary shows what is being certified and what is not.

2.9 Strategy 9: Establish Communication Integrity Objectives

Before the system design determine up-front the “residual error” rate objectives for each digital communication path. Defining CRCs and hamming distance requirements for the paths also helps determine the integrity levels required. Also discuss with the assessor up-front how the residual error rate will be calculated, as this could drive specific design constraints or necessary features.

2.10 Strategy 10: Identify All Interfaces Along the Certification Boundary

Generate up-front a boundary “Interface Control Document.” From this document identify all the required safety integrity levels for each of the interfaces. At this point research with the potential parties that own the source or destination side of the interface can begin, to make sure they can comply. Quantify and qualify the interface, including definitions of acceptable ranges, magnitudes, CRC requirements, and error checking.

2.11 Strategy 11: Identify the Key Safety-Defensive Strategies

Identify and implement the safety-defensive strategies to achieve the safety objectives for the program. Define key terms such as fault detection, fault accommodation, and fail-safe states. During initial architecture and design keep track of early failure scenarios that could occur. It is difficult to find all of them early in the project, but changes in the architecture and system design are easier made on the front end of the project.

2.12 Strategy 12: Define Built-in-Test (BIT) Capability

Identify the planned BIT coverage including initialization and periodic, conditional, and user-initiated tests. Define a manufacturing test strategy to check for key safety hardware components before shipping to the end user. After identifying each of these built-in-test functions review with the assessor and get agreement.

2.13 Strategy 13: Define Fault Annunciation Coverage

While keeping the system and user interface in mind, define which faults get annunciated. Determine when they should be announced to the operator or put into the log. Determine the level of information that is given to the end user and what is logged. Define the conditions that spawn a fault and what clears that particular fault. Define any fault annunciation color, text, sound, etc. After these are defined make sure the assessor agrees!

2.14 Strategy 14: Define Reliance and Expectation of the Operator/User

Clearly define any reliance that is placed on the operator or user to keep the system safe. Determine the user’s performance and skill level, and the human factors involved with safety and vigilance. When placing safety expectations on the user make sure the end customer agrees with the assessment. And, as stated, make sure your assessor agrees with it as well.

2.15 Strategy 15: Define Plan for Developing Software to Appropriate Integrity Level

For each of the formal methods address the compliance with each objective element of the applicable standard you are certifying to. The software safety strategy should include both control of integrity and the application of programming-defensive strategies. The plan should include coding standards, planned test coverage, use of COTS, software development rules, OS integrity requirements, and development tools. Finally, define and agree on software performance metrics up-front.

2.16 Strategy 16: Define Artifacts to be Used as Evidence of Compliance

List all the documents and artifacts you plan to produce as part of the system safety case. List how you plan to cross-reference them to requirements in the system. Make sure any document used as evidence of compliance is approved for release via your configuration control process. Test documentation must have a signature and date for tracking documentation and execution of each test case. Above all, make sure your assessor agrees with your document and artifact plan up-front.

2.17 Strategy 17: Plan for Labor-Intensive Analyses

Plan on conducting a piece part failure mode and effects analysis (FMEA), which is very labor intensive. Also plan on a system level-FMEA and a software error analysis. It is recommended that probabilistic fault trees are used to justify key defensive strategies and to address systematic failures. More information on FMEAs is in Section 3.4.

2.18 Strategy 18: Create User-Level Documentation

Plan on having a user manual that includes the following information: system narrative, normal operating procedures, abnormal operating procedures, emergency procedures, and safety alerts. Also include a comprehensive maintenance manual that contains the following: safety-related maintenance, required inspections and intervals, life-limited components, dormancy elimination tasks, and instructions on loading software and validating the correct load.

2.19 Strategy 19: Plan on Residual Activity

Any change to your certification configuration must be assessed for the impact on your safety certification. There could be features added to the original product, or part changes that need to be made that could affect the safety case. Some safety certifications also require annual independent inspections of manufacturing and/or quality assurance groups. Residual activity (and thus residual cost) will occur after the certification effort is complete.

2.20 Strategy 20: Publish a Well-Defined Certification Plan

Document a certification plan that includes all previous rules, timeline events, resources, and interdependencies. Include a certification “road map” that can be referred to throughout the development process, to have a snapshot of the documentation that is required for the required certification process.

3 Faults, Failures, Hazards, and Risk Analysis

Once the project-planning phase of the project is complete it is important to make an assessment of where the risks may be for the system being designed. To measure the overall risk for the product a risk analysis needs to be performed.

Before getting to the risk assessment a list of safety-critical terms will be explored.

3.1 Faults, Errors, and Failures

A fault is a characteristic of an embedded system that could lead to a system error. An example of a fault is a software pointer that is not initialized correctly under specific conditions, where use of the pointer could lead to a system error. There are also faults that could exist in software that never manifest themselves as an error and are not necessarily seen by the end user.

An error is an unexpected and erroneous behavior of the system, which is unexpected by the end user. This is the exhibited behavior of the system whenever a fault or multiple faults occur. An example could be a subprocess that quits running within the system from a software pointer that is not initialized correctly. An error may not necessarily lead to a system failure, especially if the error has been mitigated by having a process check to see if this subtask is running and restarting it if necessary.

For an embedded system a failure is best described as a system event not performing its intended function or service as expected by its users at some point in time. Since this is largely based on the user’s perception or usage of the system the issue itself could be in the initial system requirements or customer specification, not necessarily the software. However, a failure could also occur based on an individual error or erroneous system functionality based on multiple errors in the system. Following the example discussed at the start of this section the software pointer initialization fault could result in a subtask running error, which when it fails causes a system failure such as a crash or user interface not performing correctly.

An important aspect is that for the progression of these terms they may not necessarily ever manifest themselves at the next level. An uninitialized software pointer is a fault, but if it is never used then an error would not occur (and neither would the failure). There may also need to be multiple instances of faults and errors, possibly on completely different fault trees, to progress to the next state. Fig. 1 shows the progression for faults, errors, and failures.

Fig. 1 Faults, errors, and failure progression.

For safety-critical systems there are techniques that can be used to minimize the progression of faults to errors to failures. All these techniques impact the reliability of the system, as discussed in the next section.

3.2 Availability and Reliability

Availability and reliability are related terms but are not the same. Availability is a measure of how much the embedded system will be running and delivering the expected services of the design. Examples of high-availability systems include network switches for voice and data, power distribution, and television delivery systems. Reliability is the probability that an embedded system will deliver the requested services at a given point in time. Even though these terms are related a high-availability system does not necessarily mean the system will also be highly reliable.

An example of a system that could have high availability, but low reliability, is a home network system that has faults. In this example, if every 100th packet is dropped causing a retry to occur, to the user it will seem like a very available system. There aren’t any periods of a total outage, but just periods in the background where packets need to be resent. The fault could be causing an error to occur, with delays waiting for processes to restart. The system itself stays up, the browser or whatever user interface stays running, so the user doesn’t perceive it as a system failure.

Safety-critical systems are examples of high-reliability systems and, in the case of systems that are monitoring other safety-critical systems, highly available. Systems that are both highly reliable and highly available are said to be dependable. A dependable system provides confidence to the user that the system will perform when they want it to, as it is supposed to. Addressing how to handle faults, which could lead to a system failure, is the source for system dependability.

3.3 Fault Handling

There are four aspects of faults that should be evaluated as part of a safety-critical system. The four types for faults are avoidance, tolerance, removal, and prediction.

Fault avoidance in a safety-critical system is largely an exercise in developing a system that helps mitigate the introduction of software and hardware faults into the system. Formal design and development practices help developers avoid faults. One approach in fault avoidance is designing a software system with a single thread of execution, as opposed to a multitasking preemptive type of task scheduling. This helps avoid issues of parallelism, or timing issues that could occur if a portion of one section of code is impacted in a negative way by another. For this example, it would be unreasonable to include every timing aspect or order that is a part of normal system testing. Safety-critical programming practices that target fault avoidance are listed in Section 5.

Fault tolerance is a layer of software that can “intercept” faults that occur in the system and address them so that they do not become system failures. An important aspect of safety-critical systems is the characteristic that fault-tolerant systems have excellent fault detection. Once an individual hardware or software component has been evaluated as “failed,” then the system can take appropriate action. Performing fault detection at a high level in the software should only be done when multiple variables need to be evaluated to determine a fault. One example of good fault detection is evaluation of a temperature sensor. If the sensor has an out-of-range low or high value coming into an A/D converter, the software should clearly not use this value. Depending on the criticality of this input to the system there may be a redundant sensor that should be used (higher criticality). If this input is not critical to the system, then another possible solution could be to use another temperature sensor as an approximation to this one. Architecture, hardware, and software designs all have an impact on a system’s ability to be fault tolerant.

Fault removal consists of either modifying the state of the system to account for the fault condition or removing the fault through debugging and testing. Dynamic fault removal is when the system falls back to a less faulty state when a fault is detected. The most difficult aspect of “dynamic” fault removal is to safely determine how to do it. A typical way of doing this is to change the fault state by adopting noncritical data that are part of the safety-critical system. An example of this concept is to make use of a safety-critical system that logs a temperature value for environmental evaluation at a later date. It is not used in any control loops or decisions. If the temperature sensor is an ambient sensor that has failed, it switches over to a less accurate ambient sensor that is integrated into the hardware. For the logs, having a less accurate temperature value is evaluated as being better than having no ambient temperature value at all. Testing and debugging of the system is the primary method for fault removal in safety-critical systems. Systems test procedures cover all the functionality for the safety-critical system, and often require 100% coverage for lines of code in the system as well as different combinations of execution and inputs. Reiterating and addressing faults in this manner is much easier than the complexity involved in dynamically removing the fault.

Finally, fault prediction is an often-missed aspect of safety- critical systems. Being able to predict a fault that may occur in the future and alerting a maintenance person or maintainer is very valuable in increasing the dependability of the system. Examples include sensors that may have spurious out-of-range values, where tossing out those values keeps the system running. However, if the number of out-of-range values increases from a typical rate of one occurrence per day to an unacceptable rate of one occurrence per minute, we are possibly getting nearer to having a failed sensor. Flagging that occurrence and repairing it at a time when the user expects the system to be available is much more dependable than having that sensor fail and cause the system to be unavailable.

3.4 Hazard Analysis

Designing safety-critical systems should address hazards that cause the system to have a failure that leads to tragic accidents or unwanted damage. A hazard is any potential failure that causes damage. Safety-critical systems must be designed where the system operation itself is always safe. Even if an aspect of the system fails, it should still operate in a safe state.

The term fail-safe is used to describe a result where the system always maintains a safe state, even if something goes terribly wrong. A safe state for a locomotive train would be to stop. Some systems, such as aircraft fly-by-wire, do not have a fail-safe state. When dealing with these types of systems multiple levels of redundancy and elimination of single points of failure need to occur as part of system design.

Performing a hazard analysis is key to the design of safety-critical embedded systems. This analysis involves identifying the hazards that exist in your embedded system. It is based on a preliminary design or architecture that has been developed—even if it is just a sketch in its preliminary form.

In this process an architecture for the safety-critical system is proposed and iterated until the architecture could support being highly reliable and available with possible mitigations for the hazards that could be present. Once this is complete, additional hazard analyses will need to be performed on all aspects of the safety-critical system in more detail.

During subsystem design hazard analysis will continue to be performed. When more details are known, one effective way to do this is by performing an FMEA. This is a systematic approach to numerically evaluating each of the failures in the system and provides clarity for the classification of each of the failures. Once the failures are understood with their effects, then mitigations can be performed such as detection, removal, or functional additions to the system to mitigate the condition.

An example work product from an FMEA is tabulated below:

Function	Potential Failure	Potential Effects of Failure	Severity Rating	Potential Cause	Occurrence Rating	Mitigation Plan	Detection Rating	RPN
Vehicle speed sensing	Sensor fails high (out of range)	Cruise control goes off from on	5	Sensor high side shorts high	2	Add overmold to sensor connection	3	30

Unlabelled Table

For an FMEA each function is evaluated for potential failures and each failure condition is evaluated for how often it occurs, the severity of the consequence when it occurs, and how often it can be detected when it occurs. These are typically ranked from 1 to 10, and then an overall score for each failure (risk priority number) is calculated by multiplying these numbers together. This number helps rank the order in which to evaluate the failures, but by no means should any of the failures be ignored! Rules should be set up and put in place for failures that cannot be detected or have serious consequences when they occur. Another important aspect of an FMEA is that it tends to focus on individual failures, where bigger issues could occur when multiple failures happen at the same time.

A fault tree analysis is a top-down approach to doing a hazard analysis. This helps discern how different combinations of individual faults could cause a system failure to occur. The fault tree isn’t focused on just software but includes hardware and user interaction that could cause the failures to occur. Its top-down approach starts with the faults themselves and puts a sequence together of logical paths to address how the eventual failure could occur.

Fig. 2 shows a fault tree.

An event tree analysis is done in the opposite way that a fault tree analysis is done, as it is a bottom-up approach to hard analysis. The analysis starts with the event/failure itself and then analyzes how that particular event/failure could occur from a combination of faults. This type starts with the undesired event itself, such as “engine quits running,” and then determines how this could possibly happen with individual faults and errors that occur.

Fig. 3 shows an example event tree, with the numbers representing the probability that taking each branch could occur:

In safety-critical systems hazards that can result in accidents, damage, or harm are classified as risks and should require a risk analysis.

3.5 Risk Analysis

A risk analysis is a standard method where each of the hazards identified is evaluated more carefully. As part of this process each hazard is evaluated based on the likelihood of the failure occurring, along with the potential for damage or harm when it occurs. A risk analysis helps determine if the given hazard is acceptable, how much risk we are willing to accept, and if there needs to be any mitigation or redesign that needs to occur as a result of that failure.

The initial step for risk analysis is evaluation. In this step potential failures from the FMEA are used as inputs to evaluation to make certain that the risk classification is correct. Things like failure probability, estimated risk of failure to the overall system, and failure severity are evaluated. Discussion regarding how to evaluate a single failure should go on as long as it needs to—primarily because FMEAs tend to go over many singular failures and risk evaluation assesses multiple elements. Once the evaluation is done, the failure should be given an acceptability rating. The rating can have a value of unacceptable, acceptable, or tolerable. Unacceptable means the failure must be eliminated—requiring redesign or further design efforts. Acceptable means the team accepts the hazard and its mitigation as currently designed (accepting risk). The third “tolerable” type means a more careful evaluation of the risk should be done to mitigate it or eliminate it.

The other steps for risk analysis use the same tools as the hazard analysis, including performing a fault tree or event tree analysis as discussed in the previous section. Risk analysis has the added benefit of looking at the system as a whole—whereas an FMEA tends to look at individual failures. Changes in the architecture of the entire system or even just a subsystem may be required to eliminate or mitigate the risks that are identified.

Redundancy is a key strategy used in architectural design to help mitigate or eliminate risks. Redundancy simply means doing the same thing in more than one way. This could include a combination of the same software running on two processors, multiple processors, or even a combination of hardware and software. The next section discusses various safety-critical architectures that could be used to mitigate risk.

4 Safety-Critical Architectures

A large part of creating a safety-critical system is deciding on the system and/or software architecture that is going to be used. Consider the processor architecture shown in Fig. 4.

If we are running safety-critical software in this configuration, what happens if the processor does something that is unexpected? What if the processor runs variable data out of a bad memory location, or there is a latent failure that only exhibits itself after some period of time?

This processor by itself wouldn’t be able to satisfy a truly safety-critical system. Depending on the safety level there may be external components that can be added around the processor to perform the desired safety function in parallel if the processor cannot do so. As the complexity of an interface goes up, replicating with circuitry may not act as a successful mitigation for failures that can happen in your system. This would especially be true if the nature of the critical data is contained within serial messages or Ethernet frames. When the amount of safety-critical data increases or the number of safety mechanisms increases it is time for a different architecture.

The following sections outline various architectures that could be used for a safety-critical system. For each architecture notes are included to outline various aspects including positives and negatives.

4.1 “Do-Er”/“Check-Er”

In the architecture in Fig. 5 one processor is still performing most of the embedded system work. In this case a second processor is added to look at the safety-related data to make assessments about that data. It then looks at the output of the main processor and determines if that processor is doing what it is supposed to do.

For example, say there is a bit of information in the serial stream that means “Stop” and a separate discrete input signal that also means “Stop.” Both processors could be designed to have visibility of both pieces of data. The main processor would process the safety-critical “Stop” logic along with all the other operations it is performing. The secondary processor would simply look to see if the main processor ordered a stopping process based on these data and would act if the main processor did not. Maybe the main processor stops in a more graceful way, whereas the secondary processor does something more abrupt (like turning the driveline off).

This architecture lends itself to systems where there is a “safe” state that the system can reside in. It is also good because the complexity on the secondary processor side is limited to just the safety functions of the system. The main processor still runs all the other nonsafety code (the secondary does not).

When the complexity of safety goes up or the safety-critical level goes up, then a different architecture is needed to process data.

4.2 Two Processors

In the architecture in Fig. 6 there are two processors, which could be identical, that are performing the safety aspects of the system. Each of the processors labeled A and B are performing the same operations and handling the same data. The other processor labeled C is performing cleanup tasks and executing code that has nothing to do with the safety aspects of the system. The two safety processors are operating on the same data.

Various tricks can be done on the two processors to make them a little different. First, the memory maps of the processors can be shifted so that a software error dealing with memory on one processor wouldn’t be the same memory location on the other processor. They could also be clocked and operated separately—maybe there isn’t a requirement to have the processors execute instructions in lockstep with each other. For this architecture, if the processors disagree, then the system would arrive at a safe state for the system. For this and the previous architectures listed the system assumes there is a stop or safe state for the embedded system. If the system must continue to operate, then an even more complex system architecture is needed.

4.3 “Voter”

The architecture in Fig. 7 shows a “voter” type of system.

For this type of system the processors vote on what should be done next. Information is compared between all of them, and the decision with the greatest number of votes wins. Indecision between processors is logged and flagged, so that maintenance can be done on the system. There also needs to be periodic checking of the interpretation of the voting mechanism, so that the voting mechanism itself is known to work and doesn’t have a latent failure.

This type of architecture represents a large jump in complexity. There are numerous test cases that need to be performed to evaluate this system—and the number of possibilities greatly increases. Embedded engineers spend their entire lives dealing with the intricacies of systems like this, and development is neither quick nor regular in terms of time.

Selecting the right architecture up-front based on safety requirements is extremely important. Having to shift from one architecture to another after development has started is expensive and complicated.

5 Software Implementation Strategies

After the project planning, hazard/risk analysis, and architecture are complete there should be a good understanding of which requirements are safety critical. For software development it is important to treat these as special—even following a separate process to make sure they are designed, coded, and unit-tested correctly.

It is a difficult and unreasonable expectation to have a single process that fits every type of safety-critical application or project. This section’s intent is to point out different strategies that should be considered when doing development. The safety requirements for your project may require many of the items listed here, so this provides a good starting point for things to consider. If you are using an independent assessor there may be particular and specific items that need to be included as well.

5.1 Strategy 1: Have a Well-Defined, Repeatable Peer Review Process

A critical part of the development of safety-critical software is having a well-defined peer review process. There must be a process for peer review and consistency regarding what information is provided and the amount of time available to review prior to the review meeting. The reviewers may include people in systems engineering, systems test, safety, and configuration management.

There must also be recognition by the peer review leader that the reviewers may not have had sufficient time to prepare. In this case the meeting should be rescheduled. For safety-critical code development and code sections it is important to have an independent assessment of the source code, so that a single person isn’t walking the group through biased code where their opinion could come into play. Such an independent assessment might involve someone external to the organization or someone who reports in a different chain of command in the organization.

An example software peer review process is shown in Fig. 8.

5.2 Strategy 2: Consider Using Existing Safety Coding Standards

In addition to the strategies listed here safety standards exist that define rules for programmers to follow when implementing safety-critical code.

One standard, called MISRA C, initially established 127 guidelines for using C in safety-critical applications. It checks for mistakes that could be made that are entirely “legal” when using the C programming language but have unintended consequences when executed. Based in the United Kingdom the Motor Industry Software Reliability Association (MISRA) felt there were areas of automobile design where safety was extremely important. Their first standard was developed in 1998 and included 93 required rules of the 127 total guidelines. The remaining 34 were advisory.

The MISRA standards were updated in 2004 to include additional guidelines. MISRA increased the number to 121 required rules and 20 advisory rules to bring the total to 141. This newer version of the standard also split the rules into categories such as “runtime failures.” Currently, the latest standard released in 2012 added more rules and cross-referenced ISO 26262, a safety standard for electrical and electronic systems. The MISRA C standard document is available at their website (http://www.misra.org.uk). There is also a set of guidelines for C ++ in a separate MISRA C ++ document.

Let’s give an example of a rule: “All code shall conform to ISO 9899 standard C, with no extensions permitted.” In simple terms this means using extensions or in-line assembly would be considered nonconformant with this rule. However, accompanying this rule is the following comment: “It is recognized that it may be necessary to raise deviations to permit certain language extensions, for example to support hardware specific features.” So, with this caveat the standard permits low-level hardware manipulation or handling of interrupt routines—as long as it is in a localized, standard area and done in a repeatable way. This standard was written with embedded systems in mind!

All the 93 required rules can be checked using a static code analyzer (see Strategy 18: Static code analysis). Many embedded compiler vendors include support for various sets of rule-checking standards. There are also separate programs that can be run during the build for each of the source files. These programs can check compliance and print reports for software modules. These tools are strictly run to cover compliance and only compliance—there is no MISRA certification process that software can go through.

5.3 Strategy 3: Handle All Combinations of Input Data

For data that are processed by a safety-critical system it is important to address and account for every combination of input value including both external data and intermediate data.

Checking the external data that are coming into your system for all possible values certainly makes sense. For example, say your system (System A) has an interface specification that was written to interface with another system (System B). It may state that a data item can only have certain values, but it is important to check system behavior if it receives different values. This could come about later in the life cycle of the product, where System B’s baseline is updated with new software, hence new data values and System A are missed. Or it could come about because of a misinterpretation of the specification implementing the interface by either party.

For example, if a data element can have value “0” meaning “Stop,” and a value of “1” which means “Go,” then what happens if the variable is any other value? Maybe someone adds a new value “2” at a later time that means “Proceed with caution.” In such an event logic should be put together not only to specifically check for each case, but also to catch the other case as well. In this situation notifying someone and/or logging the mismatch is important to help correct the situation in the future. An example of this is:

if ( input_data_byte == 0 )
{
Movement = STOP;
}
else if ( input_data_byte == 1 )
{

Movement = GO;
}
else
{
Movement = STOP; // Most restrictive case here
Log_Error( INP_DATA_BYTE_INV, “Unknown Value” );
}

For an intermediate variable declared in your system it is also important to do the same type of checking. Every “if” statement should have an “else” clause, and every “switch” statement should have a default case encompassing the values that are unexpected. More complex combinations of conditions for “if” statements should also have the same “else” condition covered. Having these alternate paths that should never be executed helps to better understand the system as a whole. It also helps the programmer explore alternate paths and corner cases that may exist.

5.4 Strategy 4: Specific Variable Value Checking

When writing safety-critical software code it is important to check for a specific value for the permissive condition you are looking for. Consider the following code:

if ( relay_status != RELAY_CLOSED )
{
  DO_Allow_Movement(); // Let the vehicle move, everything OK
 }
 else
 {
  DO_Stop(); //The relay isn’t positioned correctly, stop!
}

In this example the code wishes to look for the relay being open to allow movement. The variable “relay_status” has two values, RELAY_OPEN and RELAY_CLOSED. But, depending on the size of variable that was declared, there are many more values that it can have! What if the memory has a value of something else? With the above code movement would be allowed. This isn’t good practice. For the most permissive state always check for a single value (or range when appropriate). The following code is the correct way to write this code block:

if ( relay_status == RELAY_OPEN )
{
  DO_Allow_Movement(); // Let the vehicle move, everything OK
}
else if ( relay_status == RELAY_CLOSED )

{
  DO_Stop(); // It is closed, so we need to stop
}
else // This case shouldn’t happen -
{
  DO_Stop(); //The relay isn’t positioned correctly, stop!
  Log_Error( REL_DATA_BYTE_INV, “Unknown Value” );
}

Another way that the code block could be written based on how the code is structured is to set the most restrictive case in the code at the start of execution. Then specific steps are taken and values are checked to allow movement. For the simple code block above DO_Stop() would be moved outside the conditional “if” and then the code would allow movement if certain checks passed.

5.5 Strategy 5: Mark Safety-Critical Code Sections

For code sections that are safety critical in your code there should be a special way that the code section is marked. The main purpose of this is to carry out maintenance on the code later—or if the code is used later by another group for their project. The real safety-critical sections should be marked with comment blocks that say why it is safety critical and refer to the specific safety requirements that were written. This would also be an appropriate place to refer to any safety analysis documentation that was done as well.

The following is an example of a header that could be used for a safety-critical code section:

/*****************************************************
*****************************************************
** SAFETY-CRITICAL CODE SECTION
** See SRS for Discrete Inputs for Requirements
** Refer to Document #20001942 for Safety Analysis
**
** This code is the only place that checks to make
** sure the lowest priority task is being allowed
** to run. If it hasn’t run, then our system is
** unstable!
*********** START SAFETY-CRITICAL SECTION **********/
// LOW_PRIO_RUN is defined as 0x5A3C
if ( LP_Flag_Set == LOW_PRIO_RUN )
{
LP_Flag_Set = 0;
}
else

{
// The system is unstable, reset now
Reset_System();
}
/*********** STOP SAFETY-CRITICAL SECTION ***********/

In this example you can see that we are just resetting the system. This may not be appropriate depending on what task your system is performing and where it is installed. Code like this may be appropriate for a message protocol translation device that has safety-critical data passing through it but is likely not appropriate for a vehicle!

5.6 Strategy 6: Timing Execution Checking

For processors that run safety-critical code it is important to check that all intended software can run in a timely manner. For a task-based system the highest priority task should check to make sure that all the other lower priority tasks are able to run. Time blocks can be created for the other tasks such that, if one lower priority task runs every 10 ms and another runs every 1 s, the checking is done appropriately. One method is to check to make sure tasks are not running more than 20% slower or faster than their intended execution rate.

The rate at which the task timings are checked would be dependent on the safety aspect of the code in the task that is being checked.

Another system check is to make sure that the entire clock rate of the system hasn’t slowed down and fooled the entire software baseline. Checks like this need to look at off-core timing sources so that clock and execution rates can be compared. Depending on how different timers are clocked on the system it could come from an on-die internal check—but only if what you are checking against is not running from the same master clock input. For example, if the execution timing of a task is running from an external crystal or other input it could be compared with the real-time clock on the chip. When taking this route there may also be a requirement for the source code not to know how (or is not mapped) to change the clock input for the RTC chip.

5.7 Strategy 7: Stale Data

Another safety-critical aspect of a system is performing operations to ensure that we do not have stale data in the system. Making decisions on data that are older than we expect could have serious consequences while running!

One example of this is an interface between a processor and an external logic device such as an FPGA. The logic device is accessed through a parallel memory bus, and the processor uses the interface to read input data from a memory buffer. If the data are used as an input to any kind of safety-critical checking, we do not want the data to be stale as this could impact the safety of the system. In this example this could occur if the process that collects that data on the FPGA stops or has a hardware fault on its input side. The interface should add a counter for the data blocks or have a handshaking process where the memory is cleared after it is read. Additional checks could need to be put in place as well like the processor having a way to check to make sure the memory block was cleared after the request.

There are many ways to get rid of stale data, largely based on how the data are coming in and where they are coming from. There may be a circular buffer that is filled with DMA or memory accesses. In this case it is important to check to make sure data are still coming in. There may also be serial streams of data that are again placed in a specific memory location. It is here that our safety application comes along and operates on these data. Here are some things to consider:

• First, determine if there is a way to delete incoming data once your safety-critical code has run and generated outputs. Clearing out this memory space is a great way to make sure that the data are gone before the functionality is run again.
• Second, when dealing with serial data or sets of data consider using sequence numbers to order the data. This will allow the software to remember which set was processed last time so that an expected increase in sequence number would show the data are newer.
• Third, for large blocks of data where it is impractical to clear the entire memory space, and there is also no sequence number, things are a little more difficult. For these large blocks there should be a CRC or error check of the data themselves to make sure it is correct. After processing these data selectively modifying multiple bytes can help create a CRC mismatch. Although there is a probability of this every effort should be made to avoid a situation in which data are changed and the CRC is still good.

5.8 Strategy 8: Outputs Comparison

Depending on the processor architecture being used, when there is more than one processor the outputs of the safety-critical functions should be cross-checked. This allows each of the processors in the architecture to make sure the other processor is taking appropriate action based on the inputs. There are a variety of ways to check this.

One of the easier ways is for the outputs of one processor to also be run in parallel as inputs on another processor. Again, depending on the architecture, this would be a check to make sure that the other processor(s) is doing what you expect when presented with the same data. For an output that is a serial stream this could also be run in parallel to the intended target as well as fed back into the other processor as an input. A comparison can be done to make sure the other processor is doing the same thing as your processor (as shown in Fig. 9).

Another way this can occur is to send serial or memory-mapped data directly between the two processors. This allows more checking to be done at more granular, intermediate steps in the software process as opposed to when it comes out of the other processor. If one of the safety-critical outputs was “ignite,” then it is a little late for another processor to be checking for this. In this case having more checks between the processors would be beneficial before ever getting to the final output case. The latency of the communications channel between them directly corresponds with the regularity and periodicity of the checking. Fig. 10 shows the basics of this serial or memory-mapped communication.

Fig. 10 Processor architecture example 2.

5.9 Strategy 9: Initialize data to least permissive state

Initializing data to the least permissive state forces the software and its architecture to continually make decisions on whether to allow any state to be more permissive than the least permissive. In safety-critical systems least permissive means “the safest condition” for the system to be. This starts with initialization of the code itself—it should be set to start in a safe state without any inputs or incoming data streams. Once the inputs are processed and a decision is made consider setting the internal variables back to being most restrictive again. When the software runs the next time it is making the same sort of decision, “can I be more permissive based on inputs,” as opposed to having logic that says, “we are not restrictive, but should we be?”

Architectures that start from a restrictive state tend to be more understandable when the logic is followed than when looking for instances where we should be more restrictive after not being so.

For this case it is permissible to use variables to remember what our last output state was, but that should be used as an input into the logic (“last_output_state”) as opposed to the output that the code is generating (“Output_State”).

5.10 Strategy 10: Order of Execution

If there are requirements for one code section to run before other code sections, safety checks need to be in place to make sure that this has occurred. This certainly comes into play when software is running in different threads of execution or when tasks and an RTOS may be involved.

For a simple safety-critical application there is a task that takes raw data and converts them to a filtered, or more meaningful set of data. There is then a task that takes that data, performs calculations, and produces an output of the embedded system. In this case it is important to process the input data before attempting to come up with a suitable output.

A check should be in place for this and more complex activities to make sure the order of execution is precisely what is expected. Often failure to execute things in order results in unexpected behavior if not handled appropriately. This can happen with interrupts that execute (or don’t execute) when they are expected. These types of errors tend to be very timing dependent—so it may be something that happens every Xth time and is hard to catch.

This can be mitigated by putting together a checker to make sure things are done in order and that the task was allowed to complete (if this is a requirement) before other ordered tasks are run. This can be done using a simple sequence number for the tasks, where it is set to a fixed value to let the next task know that it ran to the appropriate completion point. Then the next task (illustrating the importance of this order) checks the value and proceeds only if it matches what it expects.

Another way to mitigate these types of errors is to use more of the features in the RTOS to help with ordered execution. Use of semaphores and/or flags may be used as an alternate. Be careful with this type of solution—because when it comes to the safety case your dependency on the operating system will go up with the more features you depend on.

Finally, depending on the safety nature of the code another idea is to use a simple timer and run the tasks in frames with your own homespun scheduler. If all the task priorities are the same and you are comfortable writing interrupts for the code that needs to “run right now,” then insuring execution order becomes as simple as function calls.

5.11 Strategy 11: Volatile Data Checking

Any data that are received from another source off board the processor should have their integrity checked. Common ways of doing this involve CRCs of various lengths. Depending on the safety criticality level of the software a different CRC other than an established standard may be needed, as described below.

In embedded networks the parameter that is the most looked at using CRCs is the hamming distance. This property specifies the minimum number of bit inversions that can be injected into a message without it being detected by CRC calculation. For a given message bit length with a hamming distance of 4 that means there exists no combination of 1-, 2-, or 3-bit errors in that message that would be undetectable by CRC calculation.

The use of CRCs and other types of data checking is ideal for streams of data that are coming into your system. When data arrive a check can be performed to make sure it is a good message before it is placed into memory. Additional, periodic checks should be made on the data after they are placed in memory to make sure they aren’t altered by errant pointers or memory conditions.

As a strategy all volatile data considered safety critical, meaning they can alter the safety mechanisms of the system depending on its value, should be checked. Data updated by calculation should either be set to the least permissive state before calculation or have a check performed to make sure the variable is updated during execution when we expect it to. This could involve setting the variable to an invalid state and then checking at the end of the code to make sure it is not in an invalid state.

Data that are considered safety critical but cannot be updated via calculation and their associated variables should be checked using a CRC. For example, let’s say there is a test variable that is set to “on” to output data to a maintenance port. A remote tool can set this variable to “on” or “off” as requested. What happens when this volatile memory region becomes corrupted? With no calculation to change it back to the correct value we could start sending data out the maintenance port. Again, if this variable is safety critical in nature, we need to have a check to keep that from happening. Including this variable with others and having a CRC for the set is a good way to see if this memory has become corrupted from an errant pointer or other situation. Then our program can periodically check the CRCs for these data blocks to know that they are set correctly. Having these CRC calculations on data blocks is, as discussed, especially important for data that are not updated continuously.

Lastly, safety-critical data should be sanity-checked before they are used in calculations throughout the code. This wouldn’t include a variable that was set to one value or another in the previous statements, but rather variables that could have influence outside the current function. This certainly would be the case if the variable is modified by other threads of execution. For example, say we want to execute a section of code every six times a function is called. This can be done by setting a maximum of five or six (depending on decrement logic), and then decrementing when the function is called. If we are executing code that decides whether we should perform our task (value of zero), what should we also check? Does it make sense to remember the “last” value this variable had to make sure it is different from the “current” value? It makes sense to make sure the variable is currently set to no higher than six!

A large part of how the volatile data are checked depends on the safety criticality level of the application. Keep these strategies in mind to lower the chance of dealing with stale, corrupted, or invalid data.

5.12 Strategy 12: Nonvolatile Data Checking

Nonvolatile data are a little easier to check because they aren’t supposed to change. A useful strategy is to consider having your makefile calculate a CRC for the program image at build time. If a single CRC for the entire image doesn’t provide enough bit error checking for the length of the image, then use multiple CRCs for various sections of the code space. One approach could be to have a CRC cover the first third of the image, another to cover the first two-thirds, and another to cover the whole image. Different variations of this could be used as well.

The primary reason for using multiple CRCs like this is to be able to keep the CRC length the same as the atomic size of the processing unit. This will help speed CRC processing.

The safety case will drive how often the image is checked. Inputs to the safety case include the MTBF data for the hardware involved, how often the system is executing, and the safety criticality of the code on that single processor itself. If image checking is assigned to the lowest priority task (which is typical), then there should be a check at a higher priority to make sure that it is able to run and that it completes in the time expected.

Another point of nonvolatile data checking is to check the image before running it. A bootloader or initial program function should check the CRCs upon initialization and only run the image if the integrity is verified.

5.13 Strategy 13: Make Sure Entire System Can Run

For a safety-critical system it may not make sense for a real-time operating system to be running. Depending on the safety requirements for the software that is being written it may be too cost prohibitive or complicated to include an RTOS where additional complexities are introduced. Maybe a simple scheduler could also meet the needs, with interrupts to handle time-critical data that are coming in or out. Regardless of what tasking type of system is being used the system needs to be checked to make sure everything is running correctly.

For a task-based RTOS-type system this involves making sure that all the tasks are being allowed to run. It is straightforward to have a high-priority task to make sure the lowest priority task is running. It gets a little more difficult to make sure that all the tasks in the system are running correctly, have enough time to run, and are not hung and doing something unexpectedly. More run-type checking will need to be performed with tasks that have safety-critical code within them. Tasks that contain code that is not safety critical or part of the safety case probably don’t need as much checking.

For a simple scheduler system, where the code runs in a constant loop with some sort of delay at the end of the loop waiting for the frame time to end, checking whether everything was able to run is a little easier. If function pointers (with “const” qualifiers for safety-critical systems) are used, then the checking does become a little more difficult. Since this type of software architecture can get held up at a code location in the main loop forever it is important to have a periodic interrupt check to make sure that the main code can run.

For both types of systems it is always good to have an external watchdog circuit that can reset the system (or take other action) if the code appears to quit running altogether.

Another aspect of execution is making sure that the timing you expect is real or not. In other words, if you are running an important sequence of code and it needs to run every 10 ms, how do you know it is really 10 ms or plus/minus some margin? For this type of case it is a good idea to have an external timing circuit that provides a reference that can be checked. A waveform that is an input to your processor could be checked to make sure there is a match. For instance, if you have a simple scheduler that runs every 10 ms, you could have a signal with a period of 10 ms. A mathematical calculation can be done, based on the acceptable margin, of how many “lows” or “highs” are read in a row at the start of the loop for it to be “accurate enough.” When the input is read it should be either low or high for several consecutive samples, and then could shift to the other value for a consecutive number of samples. Any condition where the input is changing more often than our consecutive samples could constitute a case where a small-time shift is required because the clock input is synced with our loop timing.

If timing accuracy has some flexibility, using an output at the start of the main loop to charge an RC circuit could also be used. Based on tolerances and accuracy, if it isn’t recharged through the output, then an input could be latched showing time expiration like an external watchdog circuit without the reset factor.

Any of these or other methods could be used. However, checking to make sure all the code can run and that its execution rate matches what is required is important.

5.14 Strategy 14: Remove “Dead” Code

Another strategy is to remove any code and/or functions that are not currently being called by the system. The reason for this is to ensure that these functions cannot start executing accidentally; they are not covered by testing that is being done so it could certainly lead to unexpected results!

The easiest way to remove “dead” code that is not currently executed is to put conditional compiles around the block of code. It is possible there is special debug or unit test code that you want to include for internal builds, but you never intend for this code to be released in the final product. Consider the following block of code:

#if defined (LOGDEBUG)
index = 20;
LOG_Data_Set( *local_data, sizeof( data_set_t ));
#endif

This code block is created whenever a debug version is created, whereas the conditional definition “LOGDEBUG” is defined at build time. However, a situation could arise where a developer defines this elsewhere for another purpose and then this code gets included unexpectedly! In situations where there are multiple conditional compile situations associated with internal releases consider doing something like the following code block:

#if defined (LOGDEBUG)
#if !defined(DEBUG)
neverCompile
#else
index = 20;
LOG_Data_Set( *local_data, sizeof( data_set_t ));
#endif
#endif

This code block helps when multiple conditional compiles exist for different features. If “LOGDEBUG” gets defined and is not part of an overall “DEBUG” build, then there will be a compiler error when it gets compiled. A good way to make sure that code segments do not end up in the final deliverable is if “DEBUG” is never allowed to be defined in external release software deliverables. This is an excellent way to add extra protection to conditional compiles.

5.15 Strategy 15: Fill Unused Memory

For nonvolatile memory that contains program code filling unused memory with meaningful data is a good idea. One older processor family decided to have the machine opcode “0xFF” equate to a “no operation” instruction, where it would use a clock cycle then go on to execute the next instruction! For any processor architecture it is good to protect yourself in case there is an unexpected condition where the program counter gets set to an invalid address.

When the program image is built and linked it is a good strategy to fill the memory with instructions that cause the processor to reset. There are opcodes that can be used to cause an illegal or undefined interrupt. When the interrupt routine receives such an interrupt it does a reset because the unexpected interrupt has code that executes this. Or you could use instructions that do a software reset depending on the processor core. Executing in invalid locations isn’t a good situation—so for your safety case determine the best course of action!

5.16 Strategy 16: Static Code Analysis

The last strategy to use with safety-critical code development is to run a static code analyzer when the code is compiled. Many different static code analysis packages exist for C and C ++ that also conform to published standards such as MISRA C (discussed in Strategy 4: Specific variable value checking). Irrespective of the checking done as part of static code analysis there shouldn’t be any warnings when the analysis is complete.

Static code checkers typically include a way to “ignore” certain warnings in the code. Since many checks that are done can be considered “optional” or “good practice” there may be instances that the code written is really intended to be the way it is and fixing it to match the checking standard is not optimum or feasible. In these situations it is important to document in the code exactly why you are doing it the way you are doing it and then include the appropriate “ignore” directive immediately preceding the offending line of code.

Exercises

1. Q: Describe the progression leading to system failures.
A: Faults could lead to errors which could lead to failures.
2. Q: Why would you want to add an additional processor to a safety-critical system’s architecture?
A:To provide a way for functionality to continue in case the primary processor becomes unusable or corrupted.
3. Q: What does the word “fail-safe” mean?
A:The term “fail-safe” is used to describe a result where the system always maintains a safe state, even if something goes terribly wrong.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 11: Safety-Critical Development

Create new playlist

Sign In

Sign Up