Antony Gratus Varuvel and Rajendra Prasath
Indian Institute of Information Technology (IIIT), Sri City, Chittoor, Andhra Pradesh, India
Reliability, Availability, Maintainability, and System Safety (RAMS) are the driving factors in modern scenarios for sustenance and market share, wherein an equal amount of technological expertise is assumed to be available within competing concerns/firms. Notwithstanding this, RAMS in military, nuclear, medical, and aerospace sectors are always treated as equally important to functionality [1]. A design primarily relies on the technical expertise of the people involved. The inherent reliability and safety of the system are both largely determined during early design phase of a capital project. Though the RAMS requirements are not stated explicitly by the customer/end user, in most cases, the expectation is to incorporate built‐in RAMS in the products by design. Involvement of RAMS at the early stage of the design would support the organization in achieving strategic goals, ensuring core competency, maintaining excellence in service and improving cost‐effectiveness [2]. On the other hand, not having the RAMS concept in place in the early stages of design could lead to overloading in terms of the following factors: a) rejection by customer; b) requiring upgrades to meet RAMS demand; c) failing to comply with certification norms; d) loss of market share and competitiveness; e) higher modification cost; f) loss of trust and loyalty among customers; g) time overrun; h) cost overrun; i) inability to attain break‐even point; and j) increased warranty and support cost.
Hence, it is very much essential to cater for RAMS, as part of design, in the early stages of design phase through development to production, and even during phase‐out of the products. Extensive usage and involvement of RAMS as a part of the design process would have a positive effect, which would be much more than one can imagine and anticipate. With this perspective, details of RAMS in the area of optical system and networks are elaborated in the following sections. An extensive outline of the usage of optical networks for very high speed transfer of data is provided in [3]. This chapter aims to incorporate RAMS capabilities within the system, so as to ensure that the optical network/system is highly dependable.
Need for speed in the domain of data transfer with the highest possible accuracy and without any data dropout has necessitated deep research and development in the digital communication arena. While the improvement of existing physical layers with more secure and deterministic data is undertaken (e.g., Ethernet media), the requirement for transfer of extremely high density data mandated a new physical communication medium, which can transfer data at the speed of electromagnetic waves. With these demanding requirements, the exploration of the optical medium was undertaken and is being standardized. Many technological advancements have been achieved by researchers in this domain. Irrespective of the fact that the speed and bandwidth of optical communication system is higher when compared to other demonstrated data transfer protocols and media such as High Speed Deterministic Time‐Sensitive Ethernet, the reliability of optical networks ensures compliance with user/usage requirements. Unlike other physical media, a single strand of optical fiber can cater for many numbers of connected terminals/users. There are many technologies in the optical domain, from the perspective of physical components and communication protocols. This chapter is dedicated to analysing all such technologies of optical networks through the RAMS lens. More attention is paid to Reliability, which is the critical determinant for other disciplines. This means that realisation of a product with higher achievable reliability would bear significant advantages in other Availability, Maintainability and Safety aspects. A trade‐off study between energy efficiency and reliability for optical networks has been carried out in [4]. Detailed emphasis on RAMS attributes are given in this chapter.
As the lack of adherence to, and ignorance of, the tasks which are necessarily required to be carried out during the different stages of a product life cycle with RAMS capabilities/features will result in a reduction in the usefulness of the products, it is essential to identify clearly the scope of any RAMS project, in line with the other functional requirements. This chapter focuses on the following objectives:
Most commercially available products are designed for a short life span, whereas in the case of optical networks and components, the entire life cycle of the product is treated with the utmost importance owing to its critical functionality and the impacts resulting from downtime. Phasing out of products is also a prime concern, for products whose constituents are hazardous and/or not environmental friendly in nature. RAMS aspects are applicable and to be considered in every phase of the product or project life cycle.
Building a product with the RAMS concept is found to be a challenge in comparison to building a product with functionality and features. Hence, this chapter does not address functionality features and their incorporation as part of the product life cycle. Instead, detailed study and analysis of the RAMS concepts during the life cycle of the product is given importance and supplemented. Treatment of RAMS problems at a later stage of a project calls for exorbitant resources. The curve shown in Figure 15.2 depicts the cost associated with any minor change in RAMS program at a later stage of a project. Typical cost to be incurred to achieve high reliability is discussed in [5]. A flowchart showing the interrelations among the RAMS and the realization process is given in Figure 15.1.
Where the objective is to improve the cost‐effectiveness of any program which is constrained by cost and timeline, the adaptation of the RAMS concept will find the way out. The positive advantages of carrying out RAMS activities in parallel with design activities are: a) uncovering design deficiencies; b) improving cost‐effectiveness; c) increasing system availability; d) achieving optimal design; e) reducing system down time; f) meeting or exceeding user requirements; and g) complying with certification norms. The subsections of this section deal with the attributes of RAMS and their importance.
By definition, reliability is the probability that an item will perform its intended function, under the stated environmental conditions, for a specified period of time. One has to consider all the variables in the above definition, such as environmental conditions (usually, a combination of variables) and duration. Achieving the reliability goals requires strategic vision, proper planning, sufficient organizational resource allocation and the integration of reliability practices into development projects.
Figure 15.2 shows the relationship between Reliability and the associated Cost at various phases of a product timeline, extending from T0 to product maturity. From the figure, it is clearly evident that the difference between the additional budget is exorbitantly higher in comparison with the smaller increase in reliability, as the cost of improvement in terms of reliability increases exponentially over time. Also, it could be noted that the reliability could not be improved to 1, which is idealistic in nature. This is pictorially represented in Figure 15.2. The total cost of improvement in reliability is higher and higher as the present system reliability increases from 0. From the curve it is clearly evident that the incremental change in reliability will be met with an additional budget of C1 initially and C2 at a later stage in the program. Hence the reliability improvement could be initiated, studied and analyzed only by considering the following critical factors: a) present system reliability; b) economic zone of operation; c) minimum essential reliability required; d) maximum desirable reliability feasible; and e) optimum achievable reliability.
Availability is often signifies the state of the system “On Demand”. In a simpler form, availability will always be higher if the reliability of the component is higher. Availability is directly proportional and linearly related to reliability if the failure distribution is exponential. But reliability is not the only factor which will ensure the availability of the system. Maintainability is the other important factor which is to be considered and accounted for when predicting/estimating/assessing system availability. Availability could be pulled downwards to “0” if maintainability is not ensured, resulting in very high Mean Down Time (MDT). Availability is always greater than or equal to reliability. For items which show more failures, availability can be improved by reducing the repair/replacement time.
The third parameter of RAMS is Maintainability. Maintainability and Reliability are the basic design drivers in addition to the functionality. All the three parameters (Functionality, Reliability & Maintainability) are considered to be in parallel. Ignoring Maintenance factors in the initial phases of design will result in time and cost overrun at a later stage, which would not be acceptable to customers beyond a certain threshold. As a rule of thumb, a lesser MTTR could be allocated to those components whose failure rates are very much higher and vice versa, implying that components with a higher failure rate (lesser MTBF, in general) are most likely to fail frequently statistically, and hence those components are to be repaired/replaced within no time/shorter time to maintain availability of the system. For an optical system, maintenance would pose a major challenge and hence enough design features are to be builtin to identify, locate, confine and report the fault, preferably well in advance of the onset of failure.
From the perspective of reliability analyses, proper functioning of any system should be ensured. However, improper/inadvertent functioning of the components should also be eliminated/mitigated to ensure the effectiveness of the system. Safety analysis/assessment plays a vital role in this context. To ensure safe operation, all the unsafe conditions are to be mitigated by implementing various safety features such as serial or parallel redundancy of components, fault tolerance techniques, safety interlocks etc. As a result, reliability will be higher if the safety is higher, whereas the reverse case is not true. In any safety‐critical systems/applications, where the risk and causalities involved are higher, it is often the case that a risk‐based decision would be considered for the implementation of the proposed design.
Risk treatment is an important aspect as regards the acceptability of the system design, which is safety‐critical in nature. Detailed exploration in terms of severity (consequences) and estimation/prediction of likelihood of occurrence is necessary to quantify the risk. With the increasing trend towards use of optical networks, even for safety‐critical applications, aspects related to risk should not be ignored.
Faults and failures downgrade reliability and hence the dependability measure of optical switches and networks [6]. There are many physical components utilized in an optical network, depending on the technology adopted. Irrespective of the technology adopted, the basic circuitry at the source and destination of an optical network is assumed to be same for all the types of optical network.
Physical or optical damage to optical devices and components is very important. Due to the very nature of the sensitivity of light to the defects, repair and/or replacement of devices is a costly and time‐consuming depending on the location of the fault. Any mechanical break in the circuitry is generally easily detected using supervisory techniques (automated) and repairs can be initiated quickly, if the fault is localized. When extensive damage to a link has occurred in long lengths of optical fibre, this will be unfit for carrying traffic and the repairs could be almost impossible. Under these conditions, replacement is often undertaken. This would result in major downtime of the system, and severe economic impact, based on the application.
Propagation delay, power dissipation, and crosstalk noise are affected due to less reliable interconnects as discussed in [7]. The reliability of optical fiber and waveguide sensors becomes increasingly important as they are more frequently used in applications where a failure of the (often inaccessible) sensor might have greater consequences on cost and/or safety. The reliability of the system is directly related to the operational effectiveness of the system. System/cost‐effectiveness is a measure of the ability of an item to meet service requirements of defined quantitative characteristics.
Optical networks are usually chosen for applications where high bandwidth and fast communication of data are required. The fields adopting to optical switches and system will usually demand high reliability. The initial cost of optical switches and system will be comparably higher than the conventional method of communication and switching. Due to relatively more costly investment, it is expected that the operational effectiveness and failures will be greatly enhanced compared to conventional switches and transmission media. As regards the bandwidth and users/systems connected, the optical network and switches are expected to be more reliable.
In summary, reliability is highly significant in optical networks and switches due to the following aspects:
Reliability could be defined as the successful operation of all functional components. The type and number of components required to establish the systemic transformation will determine the reliability of the functional block. Hence, in order to establish the reliability of a given circuitry of functional block, the first step is to list all the components associated with it. The significance of the reliability of individual components varies based on the configuration within which the component is embodied. Lesser reliability in a serial connection causes greater deterioration in the system reliability than if the same part is part of a parallel connection. A typical list of components of an optical system is shown in Figure 15.3 (representative).
As most of the raw data are occurring in analog domain, an electrical signal is expected to be transmitted through the media. Electrical signals that are to be transmitted to some destination through an optical interconnect system must be converted to the optical domain for transmission. A brief description of each of the functional components is given below, for ready reference.
Electrical input is the raw data which is either sensorized or obtained through other physical behavioural changes. It is converted into an electrical signal. This is the signal or data which is required to be transferred through the media.
Driver circuitry processes the electrical signal and coverts it into bit stream with suitable encoding techniques. Data are ready to get transmitted if electrical, or undergo optical modulation, if optical.
Optical modulator converts the (often digitized and encoded) electrical signal into optical signal according to bit sequence in electrical signal. After the optical signal has been generated, it is fed into the optical path of transmission. Multiplexing techniques, typically WDM, are employed before feeding the signal to the optical path.
Optical couplers are structures that are used to inject the light into the optical system. The wavelength of the optical signal is changed in order to enable the receiver to selectively respond to the transmitted signal and hence receive only the intended signal.
Optical interconnects use guided wave and free space for signal transmission. Guided wave optics involves the use of waveguides to contain the optical signals within a board or package, or on a chip which consists of materials with a high index of refraction surrounded by a material with a lower refractive index. Free‐space optics utilize diffractive optics and conventional lenses or microlens arrays to guide single or multiple parallel optical beams in free space.
Optical switches are devices that can selectively switch light signals that run through optical fibers or integrated optical circuits from one circuit to another. They are used in optical routing networks to route the light travelling in waveguides to a different location. An optical switch is simply a switch which accepts a photonic signal at one of its ports and sends it out through another port based on the routing decision made. An optical switch has one or more input ports and two or more output ports that we usually call a 1 × N or N × N optical switch.
The receiver side of the optical interconnect system is responsible for reconstruction of the electrical signal. Suitable decoding is required to be carried out at the detector.
An optical detector is the device for detecting the light pulses and converting them to photo current. An amplifier is finally used for amplifying the photo current and providing the digital signal in the form of conventional voltage/electrical output signal.
Based on the strength of the signal demodulated and as per the requirement from the end receiver circuitry, amplification is decided. A suitable amplifier can be decided based on parametric requirements.
To ensure compliance of Optical System with RAMS requirements, it is very much essential to fully understand and comprehend the technologies of the components, possible failure modes, and expected failure rate. Those parameters are typically influenced by the usage profile and environment. Based on the technology, optical the path is often changed based on requirements and capabilities. With the improvements and advancements in technology, the optical path/system is often offered in a single package encompassing optical modulator and switch. While this option is highly miniaturized, the failure modes and effects associated in any of the tiny components fabricated cannot be ruled out. Hence it is highly essential to identify the internal building blocks of the device(s) and the possible failure modes which can creep into it.
In general, there are two kinds of optical switches: O‐E‐O (optical‐electrical‐optical) switch and O‐O‐O (optical‐optical‐optical) switch, also known as all‐optical switch. An OEO switch requires the analog light signal first to be converted to a digital form, then to be processed and routed before being converted back to an analog light signal. An all‐optical fiber‐optic switching device maintains the signal as light from input to output. Traditional switches that connect optical fiber lines are electro‐optic.
Functional and component block diagram within each of the types of the switches are to be critically examined and possible failure modes are to considered well in advance. The failure rate and hence the reliability of each of these types of switches varies dramatically.
Although the optical signal is immune to electromagnetic interference such as electrical noise and lightning, the physical disturbance to any of the constituent parts of the optical switch and hence the system degrades the reliability significantly. The factors that can generally affect Reliability in an optical network are operating temperature, humidity, chemical environment, radiation loads, mechanical stress, vibration levels, and components operational conditions: optical power, current, voltage, frequency, etc.
Optical system components are to be precision‐controlled so as to ensure that the optical parameters are within acceptable limits. The functional components shall ensure that the expected optical wavelength is unaffected either by way of attenuation, distortion, or loss.
For the purposes of carrying out RAMS study, complete working mechanism of the optical components within the system shall be understood along with significant details of interfaces and interconnects. Following details are required further to carry out the reliability tasks.
Effective RAMS studies/analyses/feedback will only be possible with mutual understanding of the processes, failures, and failure modes by both the RAMS Team and the Design Team. Analyses carried out without involvement of either of the teams will not yield fruitful results in terms of optimization and compliance. Following section is aimed at ensuring compliance of design to RAMS requirements.
During various phases of the product life cycle, in this case the optical system, there are number of Reliability, Availability, Maintainability, and System Safety analyses to be carried out which will add value to the development of the product. Activities are categorized under the acronym R A M S. The applicability of each and every activity differs, depending on the phase of the life cycle. The list given below is elaborate but not exhaustive.
Reliability is conventionally defined as the probability of achieving intended functions, for a duration of time, within the operating environmental conditions. Functional capability along with required reliability requirements are to be captured and “designed in” at the conceptual stages of the project/product development, failing which an enormous amount of the life cycle cost will be spent on maintenance and availability enhancements. Probabilistically, it is the minimum amount of time during which there are statistically no failures. Statistically MTBF is used to define life of those components, which is defined as the time by which approximately 33% of the population of components would survive.
Due to ignorance of the failure process and insight into the failure phenomena resulting from the following factors, reliability is treated using both probability and random processes:
In general, failure events (malfunction/broken/open/short/change in value…) are treated as a probability and occurrences (time/cycle/km… to failure) are treated as a random variable.
Random variables are assumed to take values in accordance with some probability distributions. There are two major types of probability distributions. Commonly used probability distributions with which to depict the occurrence of failures are listed below. Details of the each of the distributions can be found in any statistical handbook.
Those distributions are representative of failure phenomena over time. Components may exhibit any/many of the following trend(s) in failure rate:
A typical bathtub curve (Figure 15.4) represents various stages of the component and applicable failure rates.
Typically three types of configurations are possible for optical interconnect systems or optical networks, namely, a) series; b) parallel; and c) complex
Series: An arrangement of functional components in series, which will fulfil the end objective(s), is referred to as a reliability configuration or Reliability Block Diagram. Any or many of the components of the serial link would result in non‐availability of the intended function. Greater the number of physically connected or logically and serially interconnected components, lesser would be the reliability. Mathematically, the reliability of “n” serially connected components is
Reliability is higher if the number of serially connected components is as low as possible. Adding functionally non‐essential components as part of the serial link would adversely affect the reliability of the system. Parallel: Multiple components are added in parallel in order to ensure that the required functionality is achieved even under the conditions of failure of any/many of the success paths. A success path is a list of serially connected non‐failed components which ensures that the end functionality of the system is achieved. There could be many such success paths in a parallel system redundancy. The optimal number of such success paths is decided by the user requirements, with criticality of end system functionality. It should be very clearly understood that multiple parallel paths would require additional resources such as volume, power, weight, space, cost, maintenance efforts, spare parts etc. It is proven and well established that reliability enhancement by virtue of multiple parallel paths will be linear initially and tapers to an exponential pattern.
Complex: There are often requirements wherein simple series and/or parallel reliability configuration would result in either under‐design or over‐design. Complex configurations, which are a mix of interconnections, will utilize lesser resources and are optimized for given target requirements. The inherent reliability of each of the components also plays a major role in determining the requirements of additional redundant components. For the purposes of optical networks, the level of redundancy is determined by the criticality of applications in which the system is employed. Where
The design and configuration for reliability is obtained as given in Figure 15.5.
Various metrics of reliability are to be considered while drafting the reliability requirements for an optical network. Reliability requirements are often ignored at the initial stages of conceptualization. Design corrections during the initial stages would be cost‐effective and easily achievable, rather at the later stage of the project or product development. In the case of non‐specification of the reliability requirements by the customers, it is often considered to be advisable to conceive reliability requirements which are either better than those of competitors or captured using Quality Function Deployment (QFD) methods.
The following are the parameters which are significant to specify:
Reliability:
Reliability of the optical system shall be explicitly specified by specifying the all the relevant factors. Typically, it is better explained and specified if “R” at “t” hours with confidence level “CL” are given. Missing of any of the parameters would consequently lead to misinterpretations and justifications. Clear definition of reliability would ease demonstration of the parameters without ambiguity and comply with contractual obligations.
MTBF:
Although reliability is a useful measure for defining the target, it is more customary to specify the Mean Time Between Failure (MTBF) for a repairable component and Mean Time to Failure (MTTF) in the case of non‐repairable components. Another component called Mean Time Between Critical Failures (MTBCF) is also defined, in cases where minor/acceptable failures are tolerated within the performance.
FFOP:
Failure‐free operating period is the continuous period of operation in which a major unacceptable system failure is eliminated. However, there shall be minor degradations/deviations, which are still within the acceptable limits of operations. Faults are permitted, but failures are not allowed to occur during this FFOP. The major difference between a failure rate based approach and the FFOP is that the conventional scenario accepts no faults, whereas the latter can tolerate errors/faults. But there are no failures in both cases.
MFOP:
Instead of using the more common terms such as MTBM/MTTF/MTBCF, another way of defining the mean time with “acceptable level of degradation” is the Maintenance Free Operating Period (MFOP). There is an increasing trend towards using this key term, especially in the aerospace industry, where fault‐tolerant systems are embodied. If MFOP is specified as one of the reliability metrices, it is to be noted that minimal maintenance can be performed during the MFOP, called the Maintenance Recovery Period (MRP). During the MRP only essential checks and inspections are carried out, such as CLAIR (Clean, Lubricate, Analyse, Inspect & Repair). If detailed checks or repair are required, those checks shall be performed during detailed investigations. The period of duration of extensive checks shall be outside the purview of MFOP (Figure 15.6).
Repair or replacement actions are usually performed when the diagnosis points out that the decision on repair versus replacement from the economical point of view. For a capital item, the repair scheme is often chosen as the optimal solution, rather than replacement. MTTR is the terminology often used to specify how quickly a failed component can be restored back into operational state, when maintenance actions are performed in accordance with the prescribed procedures and processes. There are very well established and documented maintenance schema available which reduce the downtime of the system.
Confidence Level & Limits: Confidence Level is a useful measure to define the expected range of reliability parameters. It is a statistical measure which defines the confidence of the analyst on the statistic. Higher the Confidence Level, wider is the acceptable region and hence the more likely that the statistic will deviate from the true value. A lower Confidence Level ensures that the value estimated is much closer to the true value. Confidence Level is bound by two limits on either side of the statistical distribution, called the confidence limits. There are three types of confidence limits specified in statistical terms:
While any of the reliability parameters are estimated from the field trials, Confidence Level aims to locate the true statistic. From the data, any of the limits mentioned above shall be utilized in approximating the estimated value to the true value of the statistic/parameter. Any estimated parameter will be suspected for its correctness and accuracy, due to the variability involved in data collection, sample selection, model selection, appropriateness of model, assumptions made, analysis parameters chosen, and interpretation of results. It is quite impossible to nullify all the variability, of which some are subjective and some are objective. In statistics, a confidence interval is a type of interval estimate of a population parameter and is used to indicate the reliability (probability of probability) of an estimate, whereas in the case of Reliability and MTBF, the confidence bounds are used to provide the possible acceptable variations in the estimated parameter. The variations could either be single‐sided (tailed) or two‐sided (tailed). Figures 15.7 & 15.8 refer to these respectively, where x refers to the LCL‐Lower Confidence Limit and y refers to the UCL‐Upper Confidence Limit. To be more precise, confidence bounds provide information about the outcome of a trial, if an experiment is repeated again and again. How frequently the observed interval contains the population parameter is determined by the confidence (significance) level or confidence coefficient. If confidence intervals are constructed across many separate data analyses of repeated and possibly different experiments, the proportion of such intervals that contain the true value of the parameter will match the Confidence Level. The significance level of the confidence interval would indicate the probability that the confidence range captures this true population parameter given a distribution of samples.
Certain factors may affect the confidence interval, including size of sample, level of confidence, and population variability. A larger sample size normally will lead to a better estimate of the population parameter. In order to ensure that reliability is imbibed in the optical system during design, following activities are to be carried out.
Brief descriptions of the activities listed above are presented in the next subsections.
The starting point of the reliability programme, after capturing the user requirements, is the Reliability Apportionment. This exercise would result in assigning of second level reliability. This level could further be assigned to third level and so on, until the desired indenture level is reached for realization. There are methods available to apportion the top‐level reliability to the next‐level reliability. In general, in the aircraft industries, the aircraft‐level reliability requirement for the stated duration is further assigned/apportioned to the next system level reliability. The design objective of each system should also consider the reliability goal specified in addition to functional requirements to be realized. Reliability Apportionment should be followed by Reliability Prediction during the preliminary design stage of system/subsystems/items, to ensure that the chosen design would satisfy the reliability requirements apportioned. To achieve DfR (Design for Reliability), following apportionment methodology shall be adopted. Following flow chart represents a typical apportionment exercise which is to be carried out initially. Weibull distribution can be effectively utilized to represent IFR, DFR, & CFR. Refer to Figure 15.9. For detailed allocation methodology [8] could be referred.
Very often the Reliability Prediction and estimation are misunderstood, if not erroneously interpreted, and, on most occasions, interchangeably used. Reliability Prediction is based on the “Handbook data” approach on most occasions, where the vendor‐tested data are not available, or the specific part of the design is not finalized. The failure rate data/MTBF required for the Reliability Prediction would be called from any of the following sources of data. The order of priority and the correctness to the real value, with reduced number of assumptions is as follows: Field Tested, Vendor Tested, Handbook.
It is impractical for any procuring agency to field test each and every component of the design, owing to the program priorities and resources required. Hence, it is desirable to use the vendor‐provided data, if available. Not meeting this requirement, the data for reliability prediction could be tailored from handbooks. MIL‐HDBK‐756B draws guidelines for carrying out reliability prediction. Following are typical methods used for the purpose:
Early prediction of reliability of a proposed design would be helpful in
There are well established tools and techniques available to predict, estimate, assess, and verify hardware reliability. However, the same is not applicable in the case of the software domain, for many reasons. Moreover, the definition of Software Reliability is the probability of failure‐free software operation for a specified period of time in a specified environment. Although Software Reliability is defined as a probabilistic function, and with the notion of time, it is different from traditional Hardware Reliability. Software Reliability is not a direct function of time.
As most of the present systems are being run in conjunction with software, Software Reliability is also an important factor affecting system reliability. It is very difficult to identify a system, which may be simple or complex, without being controlled by software. However, software reliability differs from hardware reliability in that it reflects design perfection, rather than manufacturing perfection. The high complexity of software is the major contributing factor to Software Reliability problems. There are many models to choose from for software reliability prediction. Careful selection of an appropriate model that can best suit the case is essential. However, measurement in software is still in its infancy, meaning that the models have excessive assumptions and limitations.
Software Reliability is an important attribute of software quality, together with functionality, usability, performance, serviceability, capability, installability, maintainability, and documentation. It is difficult to achieve higher Software Reliability, because the complexity of the software tends to be high. Developers tend to include more and more complexity into the software layers, with the rapid growth of system size and functionality requirement, by upgrading the software every now and then. Although complexities associated with optical system is expected to be much less when compared to many other complex applications, the role of software failures cannot be ignored. There are many SRGMs (Software Reliability Growth Models) published in the literature, which are applicable for a given set of assumptions and boundary conditions and shall be appropriately selected and utilized for quantifying software reliability. A survey of software reliability growth models could be found in [9].
Derating is defined as the mechanism of operating the component/module/assembly below its rated or design limit, in order to have a wider tolerance limit, in the case of excursions, and to increase the life of the item. Main objective of derating is to reduce the stress levels applied on components, herein this case those components of optical circuitry. Derating analysis is carried out during the design stage of the item and accordingly the reliability of the items would improve. The stresses such as voltage, temperature, current, power, duty cycle, load, frequency of operation etc. could be derated. Derating is one of the methods to improve reliability.
From the perspective of electrical components, detailed methodology has been drafted in [10].
For the case of mechanical components, where the failure phenomenon is influenced by mechanical and thermal load/stress applied on it and the inherent characteristics of the material, reliability is determined by the stress vs strength interference. The underlying fact for the same is attributed to failure mechanisms of mechanical components such as fatigue, leakage, wear, thermal shock, creep, impact, corrosion, erosion, lubrication, elastic deformation, radiation damage, de‐lamination, buckling etc., which depend upon the characteristics of the components chosen. These parameters could be described by probability distributions. When the strength of a material/component is less than the applied stress, the material fails. If the mean values of stress and strength are wider apart and variations about the means are less, then the probability of failure could be reduced to a minimum. A plot using Standardized Normal for both stress and strength indicating the region of interference is given in Figure 15.10.
Estimating the reliability of an item later during the design phase provides more insight than Reliability Prediction carried out during the conceptual or PDR stage. The former assumes logical and functional modeling, whereas the latter ignores all the features of the design. This means that Reliability prediction provides the approximate (too conservative to consider) value of the Reliability or Failure rate of a system, by assuming the “All Series‐Basic Reliability Modeling” concept, ignoring the functional importance of each and every item involved in the design. With the Reliability estimation, the MTBF of a system/item can be discovered, leading to a comparison with the apportioned values of Reliability metrics. The data for Reliability Estimation could be “field tested” or “vendor specified” or “handbook tailored”, based on the appropriateness and availability of data.
In the process of reliability modeling, the logical and functional relationship with the system failure are given due importance. Understanding system functionality is a prerequisite for the RBD modeling. Complexity and inter‐relationship among the components are indicated and modeled carefully for a meaningful RBD‐based estimation. This would be a better and more accurate approach for estimating system reliability, compared to Reliability Prediction.
FMECA is a design validation tool which, when carried out in the initial stages of design, will provide valuable information to the designers as regards identifying the potential critical failure modes and their effects on the overall system functionality and the mitigation methods possible. The potential failure modes could be identified for each component or function. Accordingly the FMECA could have functional failure modes or Engineering failure modes as its starting point. There are two major types of FMECA:
The outcome of FMEA could be expressed in terms of Risk Priority Number (RPN), which will rank the components based on the Severity, Criticality, and Detection Provisions. Those components with higher RPN, which is above the limit of acceptance from the risk point of view, could be considered for improvement/change in design, fault monitoring, fault tolerance, or redundancy.
If carried out by the team, which consists of members from design, RAMS, and manufacturing, FMEA will help in developing a highly reliable product. This exercise will ensure that the potential failure modes will not appear in the field, and, if they do appear, there will not be any major safety/mission consequences, as a result of detection and mitigation features having been incorporated. This analysis is a proactive analysis for validating the design. If carried out systematically, FMECA provides valuable information such as potential failures, root causes of identified failures, critical components, safety/Reliability of critical components, detection provisions, mitigation plans, and (non‐)acceptability of the failure modes in the design.
Criticality Analysis is a part of FMECA. This analysis is carried out to find out potentially critical components with respect to the loss of functionality. Severity and Probability of failure are the two governing factors for criticality analysis. The criticality number could be obtained from the CA, Item Criticality Number and Mode Criticality Number. These numbers are to be plotted on a 2D chart with severity and probability of failure on the X & Y axes, respectively (see Figure 15.11). Those items/modes which is/are posing danger to the system functionality with more severity (classification‐I:catastrophic) and with more probability of occurrence (> 0.7) to be addressed with due concern.
FMET is a method which reveals inherent design weaknesses first predicted by the FMECA process. By exposing a design to a single or combined set of environmental/input conditions, a distribution of multiple failure modes could be obtained.
FMET could be useful to:
Assessment is a technique by which identification, quantification, and ranking could be carried out on any system under study. From the perspective of RAMS, assessment indicates the establishment of realistic/achieved objectives with actual data pertaining to it. A quantitative assessment of reliability usually employs mathematical modeling, applicable results of tests conducted, failure data, estimated reliability values, and non‐statistical engineering estimates. Testing is a prerequisite for any reliability assessment. The assessment could be carried out for hardware or software or both. Accordingly, the failure data need to be segregated and categorized.
Reliability assessment could be carried out in a sequential manner split into phases according to programme priority and attained system maturity. In every phase, achieved reliability would be established and system design improvements may be initiated to eliminate the failures observed, if applicable, and/or improvements/features could be added at the subsequent stages.
It is a known fact that to err is human. It has also been proved in many fatal accidents that the slip and/or mistake in human action is the cause in most cases. Owing to the very nature of the severity of aircraft accidents, it is very much essential to assess probability of human error and thus human reliability. The assumption of human reliability as ’1’ is to be ignored on the above grounds. There are methods established to assess the human reliability for nuclear applications. The approach for safety‐critical applications could be established. The methodology in assessing human error probability for flight control applications is to be devised and assessed. This activity may be carried out if prerequisites are met and available.
There are many databases compiled to find out the approximate value of human reliability (HEP – Human Error Probability). In simpler terms, it is the ratio of the number of errors to total opportunities for making the error.
Human error probabilities are being studied in most domains, due to the effects of the outcome. Sun et al. [11] discusses quantification of HEP in railway dispatching tasks. Usually, HEP is given significance if the outcome of the slip/mistake results in loss of life or irrecoverable damages or huge economic damages [12].
Reliability growth techniques enable management to plan, evaluate, and control the reliability of a system during its development stage. Any product developed with the reliability goal set forth would possibly not meet the goal at the first instance of development due to the introduction of errors, deficiencies in design & development, and manufacturing flaws [13]. It is a common practice to produce engineering items/prototypes to demonstrate the design and correct the design/manufacturing errors/flaws to meet the objectives in terms of functionality and reliability. With the above assumptions, reliability achieved during the initial phase of product development would certainly be lesser than the desired goal. Hence, further corrections/modifications would be introduced to enhance/improve the reliability or to achieve the set goal.
If the areas of improvements are identified clearly and modifications are implemented correctly, then there could be a possible improvement/growth in reliability, when compared with the initial design. An analysis called Reliability Growth Analysis [RGA] would eventually capture and estimate the growth established by way of design improvements. Testing is carried out on the prototypes, to assess the achieved reliability and to identify the gap between the goal and the achieved reliability. Further to this, the cause for the lower reliability and the areas of improvement could be identified through failure analysis. FRACAS would be integrated with RGA results to improve system design and improvements.
RGA would call for planned and unplanned upgradings of system design. Well defined tracking of requirements and FRACAS plan would achieve a better RGA process.
Life Data Analysis is an effective method in characterizing the life of a product. The unit of life could be expressed in terms of hours, cycles, meters, or any other metric which represents the life of an item. Data from testing and field usage could be combined appropriately to calculate life or estimate remaining life of an item.
Using LDA, it is possible to predict the life of all products in the population by fitting a statistical distribution (model) to life data from a representative sample of units. Parameterized distribution for the data set can then be used to estimate important life characteristics of the product, such as reliability or probability of failure at a specific time, the mean life, and the failure rate. Selecting a lifetime distribution that will fit the data is an important task in LDA. Life is to be modelled with the selected distribution and the parameters estimated from the distribution depict various parameters of life of the product.
Physics of Failure (PoF) is a technique that leverages the knowledge and understanding of the processes and mechanisms that induce failure to predict reliability and improve product performance. This approach to reliability assessment is probabilistic in nature based on modelling and simulation that relies on understanding the physical processes contributing to the onset of the critical failures. Evaluating materials, structures, and the technologies are within the scope of PoF methodology. Identification and elimination of susceptibility to potential failure mechanisms in order to prevent operational failures, drives the PoF. It focuses on characterizing the life‐cycle usage and environmental stress load profiles of an application and understanding the cause and effect physical processes and mechanisms they produce which causes degradation and failure in materials and components. PoF modelling facilitates reliability and durability assessment of design alternatives and trade‐offs. Failures could be due to over‐stress, wear out, and variations.
The PoF approach to reliability aims to address the following aspects:
Designing any product involves cost, which often has a trade‐off with other performance parameters. A trade‐off is a scenario by which improvement in one aspect pulls down the other aspect and hence a proper balancing act is needed to achieve the optimally desirable point. It often implies a decision to be made with full comprehension of both the advantages and disadvantages of a particular choice. In the context of RAMS, every aspect of RAMS – Reliability, Availability, Maintainability, and System Safety – would either directly or indirectly be related to cost and is always directly proportional. For this reason, various design considerations are to be studied for, from the cost perspective, and an optimal value of RAMS parameters could be achieved with acceptable reduction to contain the cost within the budget. If the cost is not the criterion for selection of various alternate design methodologies (particularly applicable in military products, where RAMS are given most importance), then the design with the best RAMS combination could be selected for realization. In order to solve these problems in a simpler way, functional requirements and RAMS requirements are to be kept separate initially, and the former is solved first, within the given budget and further improvements in respect of RAMS. The latter RAMS parameters are to be further studied for improvements.
Availability of an item is determined by both Reliability and Maintainability. Analyses based on simulation of statistical behaviour of the systems under consideration are essentially carried out for the purposes of establishing the Availability of any system. Typical availability analyses include:
Assessment of availability plays a vital role in establishing the “On‐demand Reliability”. The inherent availability of any item is dictated primarily by its inherent reliability and maintainability factors. Availability of simple configurations could be assessed using conventional methods. Markov analysis is one of the methodologies usually used to ascertain the availability of a system which contains n items with m states. The total number of possible combinations is m × n. While estimating availability using Markov analysis, the failure rate and repair rates are required to be fed in, because the transition from a healthy state to an unhealthy state is determined by the failure rate and the reverse is deterrmined by the repair rate. For each state, the rates of transition into and out of the particular state are used to write the differential equations as a function of failure rate and repair rate. Those differential equations formed by all the intermediate and final states can be solved to obtain the time‐dependant probability of being in a given state of the m states.
Markov Analysis is one of the means of analyzing the reliability and availability of systems whose components exhibit strong dependencies between their states. Other analyses for the availability assessment generally assume component independence. Typical cases of dependency are given below, for example:
There are usually two variables involved in Markov Analysis, the Time and the State of the system. The time parameter could be either in continuous domain or discrete domain. The state may be in any of acceptable operable condition or in inoperable condition. When the state is unacceptably inoperable, maintenance action would get initiated. This will enhance availability of the system, where redundancy is involved and there are failures. The major drawback of Markov methods is that Markov diagrams for large systems are generally exceedingly large and complicated and difficult to construct and analyze. However, Markov models may be used to analyze smaller systems with strong dependencies requiring accurate evaluation.
The state transition diagram of Markov model identifies all the discrete states of the system and the possible transitions between those states. In a Markov process the transition frequencies between states depend only on the current state probabilities and the constant transition rates between states. In this way the Markov model does not need to know about the history of how the state probabilities have evolved in time in order to calculate future state probabilities. Although a true Markovian process would only consider constant transition rates, computer programs allow time‐varying transition rates. These time‐varying rates must be defined with respect to absolute time or phase time.
Reliability‐Centred Maintenance provides a structured framework for analysing the functions and potential failures for physical assets in order to develop a maintenance plan that will provide an acceptable limit of operability, with an acceptable limit of risk, in an efficient and cost‐effective manner. RCM ensures higher availability, with the available reliability designed in. This is an analytical process to determine the optimum failure maintenance strategies based on data/information to ensure safety and cost‐effectiveness. The maintenance strategies could be adopted and modified dynamically based on the behavior of the system. The main goal of RCM may not necessarily be avoiding the failure from occurring, but rather be avoiding the consequence of the functional failure. Task Evaluation helps in determining the appropriate proactive tasks for the functional failure and determines the best interval in which the Repair/Replacement tasks have to be performed. With the implementation of RCM, the availability, downtime, maintainability, cost‐effectiveness and efforts can be improved.
The dynamic RCM process involves systematic execution and evaluation of system design and life through FMECA and FTA. The maintenance plan gets modified according to the present state of the system.
This analysis could be carried out with the failure time data available for all possible failure modes of a product. There could be a greater number of failure modes for every component in a system which can result in failure of the system. The product could fail due to occurrence of any one of the failure modes, but predominantly due to a single failure mode, for any non‐repairable product. From the perspective of reliability analysis, the failure modes compete to cause the failure of the product. To carry out this analysis, all the potential failure modes of the product are known a priori, in addition to the failure rate and failure mode ratio. This can be represented in a reliability block diagram as a series system in which a block represents each failure mode. Pulido [14] details life data analysis using the competing failure modes.
Competing failure modes analysis segregates data pertaining to each failure mode and then combines the results to provide an overall reliability model for the product. The first step in analyzing data sets with more than one competing failure mode is to perform a separate analysis for each failure mode. In the analysis for each failure mode, the failure times for the mode being analyzed are considered to be failures and the failure times for all other modes are considered to be suspensions. These are suspension times because the units would have continued to operate for some unknown amount of time if they had not been removed from the test when they failed due to another mode.
A major outcome of the analysis to to eliminate, if not mitigate/tolerate, the competing failure modes of the product, in order to improve availability.
Reliability is the most important dictating factor in the case of warranty. The cost involved in maintaining the warranty (by way of repair or replacement) depends on the failure rate of the product. The cost incurred for maintaining the warranty depends on the type of warranty policy which is adopted. In determining the value of a warranty, a Cost Benefit Analysis is used to measure the life cycle costs of the system with and without the warranty, to determine whether the warranty will be cost‐beneficial to the producer. Following are some of the essential factors to be considered when developing a warranty policy:
The type of warranty, by taking into account of all the above factors, could be arrived at, based on the optimal value of warranty cost to the company and benefit to the customers. Detailed studies are required for the establishment of the warranty of items, ranging from failure rate, repair rate, repair cost, logistics, repair resources to inventory etc.
RAMS analyses help to reduce the LCC of the product/system. Based on the quantitative outcomes of the analyses, the performance of a product could be evaluated and its life cycle cost could be predicted. Trends in reliability could indicate the growth or deterioration, based on the improvements carried out on the product/system. From the detailed study of the trends, the LCC could be forecasted and the life of the product could be established, before its failure. Recommendations in respect of maintenance policies and management decisions are also inherited from the trend analysis, for effective management of project. The data source for the trend analyses may be from warranty, test, process, or field data. Trend analyses play a vital role in imparting an active reliability‐centered maintenance (RCM) system and enabling the prognostic nature of an Integrated Vehicle Health Management (IVHM) system. The following are typical parameters of interest from trend analyses, from a RAMS perspective: fault propagation, types of failure modes, competent failure mode, life consumed, remaining life, expected time to failure, achieved reliability.
Similarly to the concept and analytical activities related to Reliability, Maintainability also plays a significant role in dictating the dependability of the product. The following is a list of essential activities related to maintainability which would be carried out during the life cycle of the product:
Similarly to the concept of Reliability, the Maintainability of the LRUs/Systems/Subsystems should be dealt with from the initial stages of design. Maintainability Apportionment is the starting point of the maintainability analyses, given that the stated requirements of the overall maintainability are known. Both Reliability and Maintainability concepts are to be built into the system design from the design stage itself, to ensure better Availability of the system. The apportioned value of Maintainability could be expressed in MTTR, which consists of defect diagnosis, rectification, and retest, assuming that everything else required is immediately available. The other metric of maintainability, the MMH/FH, could also be apportioned.
Maintainability Apportionment is needed for:
Assessment is more important for those cases where the parameters are demonstrable. As in the case of Reliability, Maintainability can also be assessed statistically by modeling. To assess maintainability statistically, a list of maintainability factors for assessment should be devised. The outcome of the maintainability assessment is establishing the maintenance measure (MM), with weightings assigned to the maintainability factors. Items with higher MM should be attended to for further improvement in maintainability or better design. Setting up a threshold for MM can be done based on customer requirements and the expertise of the RAMS team with the maintainability factors.
Maintainability Demonstration is a methodology to verify whether the stated and designed‐in maintainability requirements have been achieved. Maintainability Demonstration should be part of the Maintenance Test Plan, which should clearly bring out the phases of plan, starting with Maintainability Verification through Demonstration to Evaluation of stated Maintainability. This environment shall be representative of the working conditions, viz., tools, support equipment, spares, facilities, and technical publications, that would be required during operational service use.
The impact of the actual operational, maintenance, and support environment on the maintainability parameters of the system will be evaluated during this stage of Maintainability Plan. Correction of deficiencies, if any, could be also part of the estimation. Evaluation is usually carried out in an integrated manner, simulating the actual environment. This exercise should be follow on after maintainability prediction.
Similar to the case of RBD, the logical activities of maintainability are modeled in the Maintainability Verification phase during the design stage, commencing with initial design and continuing through hardware development from components to the configuration item. The maintainability model would be developed at this stage to verify the claim on Maintainability. At the least, some minimal maintainability verification test could also be carried out, if required to support the design from the Maintainability point of view.
It would be a better approach in realizing a product to have both Reliability and Maintainability features built in. The relevance of Maintainability early during the design stage is justified in order to achieve a higher level of accessibility, testability, and availability. Hence, the Maintainability should be predicted very early in the design stage. The numerical value predicted would suggest the acceptability of the design as regards maintainability and serviceability. Design without the concept of Maintainability designed in would eventually result in exorbitant increase in life cycle cost in general, and maintenance cost and efforts in particular.
There are methods in carrying out maintenance activities that fall under either proactive maintenance or reactive maintenance. If the maintenance activities are based on calendar time or operating time, then this should be clearly highlighted in the maintenance philosophy or manual. The maintenance strategy could be arrived at based on numerical computations or experience, in cases where the necessasry data are not available. The requirements for the maintenance tasks, such as maintenance facility, maintenance personnel (skilled and skill level), tools, test equipment, spares, supporting items, and logistics, should also to be outlined clearly.
In order to support the maintenance actions, as identified in the maintenance philosophy, it is required to maintain a minimum level of inventory. This will affect the life cycle cost including warranty associated with the item. Hence the spare parts inventory should be optimized by considering all the other applicable factors of LCC. Standardization is one of the simpler means to achieve spare parts optimization. As in the case of warranty, spare parts optimization is also dictated by Reliability and Maintainability. In addition to this, interchangeability helps in reducing the inventory. These are design factors which should be conceptualized during the initial stages of product development.
Sustaining the RAMS objectives will be possible only with continued monitoring and improvement throughout the life cycle. Irrespective of the maturity level of the system, failures may happen randomly due to various factors associated with it and the randomness involved. FRACAS is one of the closed‐loop systems, which essentially have a tracking and implementation mechanism for the failures encountered. In FRACAS, failures related to hardware and software are formally reported without losing evidence. Analyses are performed to the extent possible to understand and identify the failure cause and positive corrective actions are identified, implemented, and verified to prevent further recurrence of the failure. This system would play a major role in combining data in respect of failures where the system is deployed to various territories.
Hence, in order to satisfy the safety goals, the risk assessment should be carried out by including software logics, interlocks, redundancy management, and failures.
Risk assessment is the major safety analysis process (Figures 15.12 and 15.13), which is essentially to be carried out to identify and mitigate/control safety:
Common Cause Analyses are gaining momentum in RAMS techniques due to the very nature of the severe effects of them on all the systems. Independent failures have a low probability of occurrence compared with dependent failures. Common cause failure modes result in the loss of independence, which dramatically increases probability of failure. Zonal Safety Analysis (ZSA) is one of the most important system safety analytical methods, used as part of CCA. ZSA combined with two other methods – Particular Risks Analysis (PRA) and Common Mode Analysis (CMA) – forms CCA. In summary, CCA consists of the following:
The common cause could either be internal or external. CCA is used to find and eliminate/mitigate common causes for multiple failures, usually external to the system/LRU under consideration. Intersystem effects are of major concern for the external CCA. Examples of common causes which are external are:
There are cases where the common causes are resident within, or applicable to, the intra LRU. Those effects are also to be studied for, due to its criticality. The following are some of the typical cases of internal CCA:
From the aspect of common power supply and grounding scheme to the systems on aircraft, the common causes, viz., over voltage, under voltage, and grounding, though applicable intra‐system, could be covered under external CCA.
The basic assumption of failure independence in the safety analysis is not valid due to the system design and implementation. One of the most important modes of failure, and one which can severely degrade actual safety, is a common mode failure. This type of failure involves the simultaneous outage of two or more components due to a common cause. Common Mode Analysis (CMA) provides evidence that the failures assumed to be independent are truly independent. In reality, this analysis is extremely complex due to the large number of common mode failures that may be related to the different common mode types such as design, operation, manufacturing, installation, and others.
Common Mode Analysis, which is the subset of CCA, is performed to verify that the events are truly independent. The effects of design, manufacturing, maintenance errors, and failures of system components which defeat the failsafe design should be analyzed as part of CMA. Consideration should be given to the independence of functions and their respective monitors.
The following are some examples of Common Mode Faults:
Fault tree analysis [FTA] is a deductive failure analysis which focuses on one particular undesired event and provides a method for determining causes of this event. In other words, a fault tree analysis is a “top‐down” (vertical) system evaluation procedure in which a qualitative model for a particular undesired event is formed and then evaluated. The analyst begins with an undesired top‐level hazard event (Failure Approach) and systematically determines all credible single faults and failure combinations of the system functional blocks at the next lower level, which could cause this event. The analysis proceeds down through successively more detailed lower levels of the design, until a primary event is uncovered or until the top‐level hazard event requirement has been satisfied. A primary event is defined as an event which, for one reason or another, has not been further developed. That is, the event need not be broken down to a finer level of detail in order to show that the system under analysis complies with applicable safety requirements. A primary event may be internal or external to the system under analysis and can be attributed to hardware failures/errors, software, or human errors.
Graphical representation of FTA is triangular tree in shape and takes its name for the branching that it displays. It is the format which makes this analysis a visibility tool for both engineering and the regulatory agencies. It is concerned with ensuring that design safety aspects are identified and controlled.
A Binary Decision Diagram (BDD), also known as function graph/directed acyclic graph (DAG), is one of the techniques used to find the unreliability/unavailability of an event. The sequential propagation/failure occurrence of all possible events are listed in BDD. The flow of binary logic is employed, whereas the onset of any event associated with the occurrence of the other events is developed like a interlinked tree structure. The BDD approach will be very useful in developing phased mission reliability analyses, owing to the fact that the system configuration, component behaviour, success criteria, and time varies from phase to phase. It will be simpler to analyze the system if the same is modelled into a logical reliability graph. The outcome of FTA could also be obtained by using the BDD technique, apart from the Exact method and the Esary–Proschan method. A BDD could be employed where sensitivity analysis, minimum cut sets, minimum path sets, and unreliability need to be computed.
More often, an accident occurs as the result of a sequence of causes. A hazard is defined as a condition, event, or circumstance that could lead to or contribute to an unplanned or undesirable event. A hazard analysis will consider system state and failures or malfunctions. While in some cases risk can be eliminated, in most cases a certain degree of risk must be accepted, by considering the cost, effect, and resources. The risk could be quantified by establishing the severity (consequence) and the probability of occurrence.
FHA is a tool to identify and evaluate the hazards by rigorous examination and evaluation of the system and subsystem configuration and functionalities, including software. The series of system functionalities are broken down to sub‐levels of functionalities and then hazards are identified. FHA is an inductive approach, similar to FTA. The main focus of FHA is on the functions of the systems/subsystems. FHA is more productive and supportive for design if carried out during the very early phases. System hazards are identified by evaluating the safety impact of a function failing to operate, operating incorrectly, or operating at the wrong time. FHA may be carried out for the single system, a subsystem, or integrated systems. A basic understanding of the system concepts, functionalities, and experience with the similar systems is essential to generate the list of potential hazards. Once the functional hazards are identified, then further analysis of the hazards is required based on the severity and probability of occurrence assumed. The following are the factors involved in FHA:
Hazard and Operability Studies is a methodology to identify potential hazards in a system and the issues in terms of operability, where the operation is conceived but not catered for in the design. This technique is usually well suited for process‐related domains, in examining the system built and the risk associated with it, in case of deviations from the design intent. The outcome of the studies will be utilized for risk management. HAZOP is carried for those health and safety‐related systems and subsystems, in order to assess the safety features built in to the design, in case of inputs/process that are not within the design scope of the item.
The HAZOP procedure involves taking a full description of a process and systematically evaluating/questioning every part of it, to establish how deviations from the design intent can arise. Once identified, an assessment is made as to whether such deviations and their consequences can have a negative effect upon the safe and efficient operation of the system. If considered necessary, suggestions for the actions to be taken to remedy the situation are also presented. Risk Management i.e., Risk Assessment, Control, Review and Communication also considered part of the HAZOP. HAZOP could typically be categorized as follows: Process HAZOP, Procedure HAZOP, Human HAZOP, Software HAZOP.
Zonal safety analysis is carried to ensure that every zone of the aircraft is safe and free from possible hazards. The objective of the analysis is to provide confidence that the installation of the equipments meets the safety requirements in terms of installation and interference. The effects of failures of equipment should be considered with respect to their impact on other systems and structures falling within their physical sphere of influence. When an effect which may affect safety is identified, it is to be highlighted and installation/design of equipment is revisited appropriately. Various zones are to be studied carefully, by clearly observing the systems in that zone and constructive/destructive interference among the systems. The identified interference, with an initiator turned on, may lead to catastrophic consequences. Event Tree Analysis can also be used to study the end effects of the adverse events identified.
During the design and development of new system, it is essential to carry out Particular Risk Assessment, which relates to threats to the aircraft/system/subsystem from the outside environment (bird strike, lightning, hail) and threats to the systems from events originating in other systems. PRA is carried out as part of System Safety Analyses, called the CCA. Accordingly, the PRA could either be external or internal to the aircraft. These assessments are carried out to ensure the robustness of the design to survive the potential threats identified. Formation and collection of all possible potential threats to the system under consideration are required to evaluate each of them. The results should be collated to have a single point of reference: the ability to survive all known external threats. If PRAs have been accomplished on previous programs, they can be used as a starting point for the new assessment, then the systems are reevaluated against the new design and differences created by new design features need to be added to the list. Following are the typical risks: fire, high energy devices, leaking fluids, hail, ice, snow, bird strike, tread separation from tire, wheel rim release, lightning, high intensity radiated fields, failing shafts, etc.
Each of the identified risks should be studied appropriately with respect to the design under consideration, and the simultaneous, or cascading, effect(s) of each risk, if any, are documented.
Software risk assessment plays a vital role in getting certification of safety‐critical installation/equipment/setup, as most of them are controlled by processor‐based application. In addition to this, these days COTS components are also being used in the Military/Automotive industry due to non‐availability of MIL‐grade components in the market. Hence, the risk associated is manifest multi‐fold along with the software ported.
Software is being used in every application, from day‐to‐day commercial applications to communications to safety‐critical nuclear applications to aerospace applications. As the safety of humans and the environment is considered to be the most important consideration, there are requirements for those control applications to meet the stipulated safety constraints/norms. In order to meet the safety norms, the risk associated with software‐based application should be acceptably low‐level. To demonstrate this, risk assessment of software needs to be carried out in quantifiable terms.
An event tree is a graphical, inductive, analytical representation of the Boolean logic model that identifies and quantifies possible outcomes following an initiating event, by taking into account whether the safety barriers are functioning or not. Forward logic is employed in event tree analysis and it is constructed through an inductive approach, unlike fault trees, which use a deductive approach. FTAs are constructed by defining top events and then useing backward logic to define causes. However, ETA and FTA are closely linked. FTAs are used to quantify the failure probability/unavailability of system events that are part of the event tree. The logical processes employed in FTA and ETA is the same.
The main objective of ETA is to identify, design, and avoid procedural weaknesses, and probabilities of the various outcomes from an anticipated accidental event. The secondary aim is to identify the improvements that could be possible in the protection systems and safety barriers, to reduce the probability of occurrence of the end event. The initiating event is usually the first significant deviation that may lead to unwanted consequences and that could be due to equipment failure, human error, or process upset. Consequences of a single occurrence of the initiating event are studied in relation to the other barriers of safety.
It is known that, due to the sensitivity of an optical signal to any imperfections or impurities, careful monitoring and control of the process parameters are crucial. In this aspect, critical parameters of optical components are to be measured and controlled. Statistical Process Control techniques shall be employed to ensure that deviations are very well within the acceptable limits.
For this purpose, critical physical parameters or components shall be drawn from FMECA. Critical failure modes identified as part of the study sheets of FMECA need careful attention. Variabilities are to be minimized to the extent technologically possible. Employment of SPC in optical system is essential to achieve/reap projected benefits.
Irrespective of the fact that the modulation and multiplexing, and then the demodulation and demultiplexing, are predetermined and selected based on the applications, the need for control and management of the characteristics of communication media on‐the‐fly is still required. This would be possible with increasing usage of software or firmware components as part of the network or switch elements. Usage of software/firmware enables flexibility in switching and routing of data. This would also result in miniaturization, and optimization of resources, volume, power, and weight. All of this results in easier configuration and management of hardware resources without any downtime.
In order to achieve configuration management and control on‐the‐fly, there shall be enough provision to effectively utilize the bandwidth and efficiency of the optical network, while optimally allocating the resources. At the same time, it is imperative to ensure that the RAMS aspects are also adhered to. Based on the area of application and the criticality of the functionality, the interactions among the hardware and software require significant consideration and characterisation.
Enough studies have been carried out to quantify the percentage of failures triggered by pure software and it is found that 35% of errors are converted into faults and eventually to failures which are attributable to errors in pure software elements. Of those remaining, the portion of failures due to hardware‐software interactions are expected to be greater in recent times, due to the functional dependency on software/firmware. From field experience it is expected that the pure hardware‐related faults are minimal. The underlying reason for the smaller number of faults which are related to hardware is that hardware components are often designed and tested for varying environmental applications. Inherent faults are getting screened and infant mortality and design‐related flaws are eliminated by the process of qualification testing. After qualification testing every deliverable component is screened prior to shipping to customers.
In most the cases, the application and other specific software components are loaded into hardware platform and then a hardware software integration test is carried out. Involvement of more and more sophisticated firmware/software requires certification norms to be fully complied with. In the case of the aerospace domain the Federal Aviation Authorities (FAA) and Federal Aviation Regulation (FAR) dictate the procedures and processes to be adopted to ensure that functionality is achieved without compromising safety. In the case of other domains, SAE ARP‐761, and 4754 elaborate detailed requirements of acceptance of software‐intensive items.
There are numerous activities relating to RAMS, which are to be carried out during the early stages of design. Those activities during various phases of the product/project life cycle are enumerated in the preceding sections. Prior to the phase‐wise classification of RAMS activities, the hierarchical levels of activities are defined below.
From the aspect of realization of an optical system, all RAMS activities could broadly be categorized under two major headings: system‐level RAMS activities and item‐level RAMS activities.
The system‐level RAMS activities are highlighted in Figure 15.14. Upon completion of system levels RAMS, further RAMS aspect can be applied to lower indenture levels, as depicted in Figure 15.15.
Taking an optical interconnect or network or switch as a system, further indenture levels can be derived using a deductive top‐down approach. Those activities/analyses listed and briefed in the previous sections would be given the utmost importance and the outcome of each of the analyses has to be fed back to the design team for design modifications/improvisation/mitigation/tolerance as the case may be. Following the flow chart in Figures 15.14 and 15.5 give an insight into the typical activities of product development, indicating the relevant RAMS tasks by the side at the system level. It is to be noted that system realization will fall into and overlap with the item development stage, and accordingly the RAMS activities would require detailed design consolidation.
There will be a large number of reviews carried out to ensure that the design proposed meets all the requirements required by the user explicitly as well as any other implicit requirements.
Having ensured that the architectural concept addresses all the observations brought during the system level, compliance of those attributes at each item level will also be ensured. In order to carry out these tasks, clarity on the items is assumed to be available only when the system description is complete and precise in nature. Activities could be initiated on an item by item basis if the item‐level design details are finalized. More detailed design analyses, simulation, test, evaluation, verification, and validation are required at the end component or item level. Applicable RAMS tasks at the unit/item/component level are enumerated in the form of a flow chart for easy visualisation. Any miss or delay in the execution of any/many of the tasks may delay execution of the project towards the completion in terms of certification or handing over to customer.
In terms of functional and bandwidth capabilities, an optical network‐based system outweighs a conventional electrical and electronics‐based system. Added to this, the voluminous amount of data handling is easily handled by the optical system. Careful attention must be given to optical system, as the physical medium used for communication does not tolerate impurities.
Hence the usage environment plays major role in determining the efficacy of optical systems for their intended purpose. Although the need for maintenance does not arise as compared to other conventional communication media, once it has occurred, the downtime is relatively higher and requires specialized tools, techniques, and experienced personnel. The following points need to be addressed:
If all the RAMS attributes are addressed adequately, an optical system will be far superior in terms of performance and life cycle cost.
From the perspective of RAMS within the optical domain, the following aspects still need to be addressed:
The RAMS activities are essential in establishing the dependability of the system, from conceptualization through design and development to phase‐out. Requirement capturing is a vital task in any project. Unless the requirements are stated clearly, realization is impossible, meaning that the time, cost, and efforts invested are wasted in realizing a product that will not be as per the expectations of the customers. Capabilities of optical network based system are realizable, only if RAMS features are built‐in. In addition to the functionalities, the RAMS parameters are also to be specified during conceptualization, failing which the item developed may be down for maintenance, functionally fit but not user‐friendly, difficult to access, harder to troubleshoot, difficult to repair and maintain, or have increased life cycle cost. These activities are to be carried out depending upon applicability at various stages and based on the program/project directives. At every stage of the RAMS activities, compliance to user requirements needs to be verified. Sequential integration of design changes into the RAMS analyses is essential to ascertain the RAMS parameters achieved during the evolution of products. Guideline documents as applicable are to be referred to in carrying out every analytical activity identified. The procedural and systematic execution of the activities as per the applicable guidelines/handbooks are to be adhered to, to meet the industrial standards. Timeliness and proper execution of the listed RAMS activities will be of greater impact and usefulness in realizing a dependable system, by ensuring that design feedback emanating from RAMS activities are provided on time with a broad objective of minimizing the life cycle cost and improving/maximizing system dependability. The advantages of optical system would be sustainable, if and only if RAMS attributes are addressed and incorporated as part of the design.
44.200.94.150