In accelerated stress testing (AST), the product undergoes higher‐than‐normal operating conditions in an effort to extrapolate the product reliability in use conditions by precipitating faults in a compressed period. Typical accelerating stresses include environmental, electrical, mechanical, and chemical factors. The choice of accelerating stress levels and the allocation of available test units to these levels are the key considerations in executing accelerated test experiments. This chapter begins with the introduction of the concept of AST that include a highly accelerated life test (HALT) and highly accelerated stress screening (HASS). The reliability extrapolation is often justified on the basis of physical and chimerical laws, or a combination with statistical models derived from the lifetime data. The chapter describes one‐stress Arrhenius law and multistress Eyring models, both of which are physics‐experimental‐based models. We also present three types of statistics‐based AST models: a scale and use rate acceleration model, a non‐parametric model, and semi‐parametric models that include a well‐known proportional hazard model (PHM). Both HALT and HASS are costly to implement due to the consumption of equipment, materials, and labor. Economic models are developed to guide the product manufacturer to realize the cost savings in the deployment of HASS. This chapter concludes by presenting a case study of implementing HASS and environmental stress screening (ESS) to improve the product reliability in a subcontract manufacturing facility.
Accelerated life stress testing or AST can be simply defined as: applying high levels of stresses to a device under test (DUT) for a short period of time to extrapolate the lifetime of the device in the use condition assuming it will exhibit the same failure mechanisms. Also known as accelerating variables, such stresses include temperature, humidity, voltage, power, speed, mechanical force, and torque, among others. The key here is to understand the root causes of failures and their relation to the applied stresses, either environmental, mechanical, or electrical stresses. The main purpose of AST is to accelerate the reliability growth in the product development stage such that the reliability performance at the time to shipment meets the design requirement. Depending on the stress levels and testing time, two types of AST are generally adopted in industry: HASS and HALT. We elaborate on both techniques below.
As shown in Figure 5.1, the overall operating range of a product can be divided into three areas: product specification, operating margin, and the destructive margin. The product specifications represent the normal stress levels that the product is expected to use in the field. The operating margin formed between the lower (upper) operating limit and the lower (upper) specification limit is the safety area where the product is still able to operate normally, but with a higher failure rate. This is also the operating area of implementing the HASS process, a technique that uses stresses beyond product specifications in order to detect infant mortality failures and shorten the time to corrective actions in mass production.
The destructive margin formed between the lower (upper) destructive limit and the lower (upper) operating limit represents the area where the product operates under a much higher than usual stress, and will fail quickly. This is the testing domain of the HALT process that aims to precipitate the failure modes in a compressed time window. Sometimes HALT is also referred to as the test, analyze, and fix (TAAF) during the product design and development stage.
If a product operates beyond the lower or upper destructive limits, failure will occur immediately because the product suffers from failure mechanisms that may not happen if operating within the destructive margins. Although Figure 5.1 depicts a two‐side (i.e. upper and lower) specification, many products like electronic devices are more susceptible to failures under high temperature and voltage rather than at the low stresses. Mechanical components like bearings and gears are prone to failure only if they are subject to high rotating speeds or torques. Hence both HASS and HALT processes in these circumstances is single‐sided (right) testing rather than two‐sided experimentation.
Both HALT and HASS are discovery testing as compared to compliance testing, like the reliability demonstration test. However, the following distinctions should be made between HASS and HALT. First, HASS is a reliability growth technique incorporated in the production phase to identify manufacturing defects that could cause early infant mortality, while the purpose of HALT is to identify the design flaws or weakness in the early development and prototyping phase. The result of HALT can be applied to the robust product design, like the design of the integrated chip wire‐bonding process that is less susceptible to environmental stresses (Yang and Yang 1999). Second, HASS is a quality control process and applicable to all finished products, while HALT is implemented on sampled products, usually in prototype forms. Third, HASS in general is more economical than HALT because the use of equipment and materials, such as chambers, shakers, instruments, power, and cooling facility (e.g. liquid nitrogen), is less intensive in HASS.
In certain applications where the operating limits are already extreme (i.e. operating margin and product specification overlaps), ESS is often preferred over the HASS process. The key difference is that the stresses applied by ESS are within a product's non‐destructive operating range used in the field. Like HASS, the ESS program is also implemented in post‐production in which 100% of produced units are subjected to more severe stresses than in normal service. However, each method achieves the goal differently, depending on application‐specific timelines, costs, and stress levels.
Other types of filtering processes exist to precipitate early infant mortality, such as burn‐in, thermal cycling, power cycling, and thermal shocks. For instance, Yan and English (1997) proposed a modified bathtub curve that integrates the concept of latent failures and obsolescence for microdevice manufacturing. The idea is to construct an integrated cost model used to determine both optimal burn‐in and ESS times. Ye et al. (2013) design an optimal burn‐in process to minimize the warranty cost of a new product. Like ESS, the stress levels of these techniques are typically set in the boundary between the operating margin and the product specification. Table 5.1 summarizes the applications, stress levels, and impact units of different accelerated testing and screening techniques.
Table 5.1 Summary of HALT, HASS, ESS, and other screening techniques.
Test type | Application | Stress level, timeframe | Impacts | Purpose | Cost ($/item) |
HALT | Design, development, and prototyping | Destructive margin, very short time | Small sample | Identify design defect and weakness | High |
HASS | Production stage | Operating margin | 100% products | Remove latent failure or infant mortality | Medium to low |
ESS | Production stage | Extreme limits of product specification | 100% products | Remove latent failure, infant mortality, and manufacturing defects | Low |
Burn‐in, thermal cycling, thermal shock, power cycling, voltage margining | Production stage | Extreme limits of product specification | 100% products | Remove latent failure, infant mortality, and manufacturing defects | Low |
In an effective accelerated test, the reliability expert chooses one or more stress types that cause the product to fail under normal operating conditions. The stresses are then applied at various accelerated levels and the time‐to‐failure and time‐to‐degradation for the units under test are recorded. For example, a product normally operates at 30 °C ambient temperature with 40% relative humidity (RH). If high temperature and humidity cause the product to fail more quickly, the product can be tested under 60 °C with RH = 70% or 100 °C with RH = 90% in order to accelerate the units to fail more rapidly. In this example, the stress type is temperature and humidity and the accelerated stress levels are 60 and 100 °C for temperature and 70% and 90% for RH. Depending on the nature of product materials and the operating conditions, stresses used to accelerate the failure can be classified into three categories, as shown in Table 5.2. These are environmental, electrical, mechanical, and chemical stresses.
Table 5.2 Classification of stress types and their factors.
Stress type | Stress factors |
Environmental | Temperature, thermal cycle, humidity, thermal shock, vibration, sand and dust, nuclear and cosmos radiation, altitude |
Electrical | Voltage, current, power, frequency, electrical field, power cycling |
Mechanical | Force, friction, torque, fatigue, vibration, pressure |
Chemical | Corrosions, diffusions |
They are the factors that are closely related to the surroundings of the operating product. Temperature, humidity, and thermal cycling are the typical environmental stresses. It is important to determine the critical environmental stresses and assign appropriate stress levels that do not induce different failure modes other than the ones at the use condition. For instance, the life of semiconductor devices are sensitive to the operating temperature and RH. However, if the same devices are used in satellites or space stations orbiting outside the atmosphere, beta radiation and gamma rays have the ability to ionize semiconductor materials, which results in new failure effects: (i) producing additional electron‐hole pairs and (ii) creating high‐energy charges to be injected into silicon dioxide regions, causing the degradation and failure of transistors. While the humidity emerges as a key environmental stress for electronic devices used in vapor‐intensive tropical areas or rainforest regions, cosmic radiation becomes one of the primary environmental stresses when they are used in space engineering systems.
Electrical stress is applied to exercise a product near or at its electrical limits. Examples of electrical stress tests include simulating junction temperatures on semiconductors and testing the insulation of circuit breakers of high‐voltage transmission. Two basic types of electrical stress tests are available: voltage margining and power cycling. Voltage margining pertains to varying input current or voltage above and below nominal operating limits. A subset of voltage margining is frequency margining which is often used in stressing the speed (or the clock cycle) of microprocessors like the central processing unit. Other types of voltage margining include electric field. Power cycling consists of turning a machine's power on and off at specified levels with predetermined time intervals. It is often used to induce the solder joint failure by creating thermal fatigue when the temperature of solder joints increases and decreases cyclically with the on–off power. Electrical stress alone is not able to expose the number of defects commonly found under the vibration test or temperature cycling. However, it is often economical to implement the electrical stress along with other stresses to increase the overall effectiveness of ALT or HASS programs. This is because it is often required to supply electric power to products under test in order to stimulate soft or hard failures induced by mechanical or environmental factors.
This type of stress can be induced by force, torque, vibration, and thermal shocks, among others. The effort would be caused by internal or external factors. For instance, a solder joint crack in a circuit board is often induced by repetitive thermal cycling, which is an internal factor. The disconnection of a universal serial bus connector in a computer could be caused by frequent insertion and extraction operations, belonging to an external factor. The breakage of wind turbine blades is induced by material fatigue due to vibration and wind shocks repeatedly applied to the blade. Therefore, failure of mechanical systems is largely associated with external factors.
Corrosion and diffusion are two basic failure mechanisms induced by chemical stresses. Corrosion is a natural process converting a refined metal to a more chemically stable form, such as its oxide, hydroxide, or sulfide. It is the gradual destruction of materials by chemical and/or electrochemical reaction with their environment. Diffusion is the net movement of molecules or atoms from a high concentration region to a low concentration region as a result of random motion of tiny particles. Diffusion is driven by a gradient in the chemical potential of the diffusing species.
Different types of load profile are available in AST and their application depends on the stress type, availability of the test bed, and the product's operating condition. Based on the stress amplitude and its variation frequency, stress loading profiles can be classified into five categories: (i) constant stress, (ii) sinusoid stress, (iii) step stress, (iv) ramp‐up and dwell, and (v) zigzag and cyclic stress.
Figure 5.2 depicts the loading profile of a constant stress. As the name implies, the stress level remains constant during the entire test period. For ALT and HASS processes subject to a constant stress level, the decision variables are the level of the stress A1 and the duration of the test period t1. For instance, Yang (1994) proposed an optimal design of four‐level constant‐stress ALT plans that chose the stress levels, test units of each stress, and censoring times to minimize the asymptotic variance of the maximum likelihood estimators of the mean life. Stresses like temperature, humidity, voltage, or power can be set at a fixed level during the accelerated testing period.
A sinusoid stress profile is constantly used in fatigue testing of mechanical systems. As shown in Figure 5.3, the stress profile is characterized by two parameters: the amplitude A1 and the cycle period T. The mathematical expression of the sinusoid stress is
where f is the frequency of stress and f = 1/T is in units of Hertz. Obviously, the frequency and the amplitude are the key factors that determine the severity of the stresses imposed on the testing units.
Figure 5.4 shows a step‐stress loading profile where a constant stress is applied for a period of time and then the stress is escalated to a new level for another period of time. This process is repeated until it reaches the end of the test time or all the stress levels have been applied. The decision variables in a step‐stress test include the number of stress steps and the amplitude and the duration of each step stress. The mathematical formula is
where t0 = 0 and u(t) is a standard step function with u(t) = 1 for t ≥ 0 or u(t) = 0 for t < 0. For example, for the three‐step stress test in Figure 5.4, the values of A1, A2, A3, t1, t2, and t3 need to be determined prior to the execution of the ALT or HASS test. As an example, Miller and Nelson (1983) designed an optimum plan for two‐step stress tests where all units are run to failure and the goal is to minimize the asymptotic variance of the maximum likelihood estimator of the mean life. In electrical stress testing, it is relatively easy for stresses like voltage, power, and electric field to be transitioned from one level to another instantaneously. Hence the step stress loading profile is commonly adopted in an electrical test.
Figure 5.5 depicts the loading profile of a dwell and ramp‐up stress where the transition from the low stress level to the upper level is not instantaneous, but it takes an amount of time, i.e. t2 − t1 and t4 − t3, before reaching level A2 and A3, respectively. Hence, dwell and ramp‐up profiles are usually applied in environmental tests where different levels of stresses like temperature and humidity cannot be reached immediately due to the limitation of the test equipment.
There exist other types of stress profiles such as the zigzag stress in Figure 5.6 and the cyclic stress in Figure 5.7. These can be treated as the variation of the dwell and ramp‐up stress in Figure 5.5. For instance, in the zigzag stress, there is no dwell period and the stress ramps up from zero to A1 for a period of t1 and the ramps down to level A2 between t1 and t2. This process is repeated until it reaches the end of the test time. The pattern of a cyclic stress test is similar to a zigzag stress profile, except that there exist dwelling periods once the stress reaches the peak or drops to the valley. Cyclic stress profiles are widely used in power cycling of electronics equipment, namely, the products under test undergo a sequence of tests: power‐on at t = 0, warm‐up between 0 and t1, normal operation between t1 and t2, power‐off between t2 and t3, and complete shut‐down from t3 and t4. While A1 > 0, in a power cycling test A2 is usually set to 0.
A scale accelerated failure time (SAFT) model belongs to a statistics‐based reliability modeling approach. In an SAFT model, lifetime T(s) at stress s is scaled by a deterministic number that often depends on the applied stresses s and certain unknown parameters. Therefore, the SAFT model in statistical literature is also referred to as the accelerated failure time (AFT) model. Let Tn and Ts be, respectively, the lifetime under the normal use condition and the accelerated stress condition. Their relation is governed by the acceleration factor (AF) as follows:
where Af is the acceleration factor and Tn and Ts are the lifetime at normal and the stressed condition, respectively. Lifetime is accelerated when Af > 1 and decelerated if Af < 1. For example, if the lifetime at the stressed level is exponentially distributed with failure rate λs, then the reliability at stress s is
Let Rn(t) be the reliability at the normal stress sn. Then it can be extrapolated as
Furthermore, the probability density function (PDF) and the hazard rate function at sn can also be obtained as
Equation (5.3.5) shows that the hazard rate (or failure rate) at the normal condition is scaled down by a factor of Af, which echoes the definition of the acceleration factor in Eq. (5.3.1).
In this section, we extend the exponential AFT model to the two‐parameter Weibull lifetime distributions. Let θs and βs be the Weibull scale and shape parameters under the accelerated condition. Then the reliability function under the stress condition is
Given Rs(t), we can derive the reliability, PDF, and hazard rate in normal use; they are given as follows:
Let θn and βn be, respectively, the scale and shape parameters in normal use. Based on Eqs. (5.3.7–5.3.9), it is also easy to realize that
The above results are based on a common assumption that the applied stress only influences the scale parameter, but not the shape parameter of the distribution (Escobar and Meeker 2006).
Lognormal distribution is widely used for modeling and estimating the failure times of electronics components subject to thermal or electric stresses. These include temperature, voltage, power, and the electric‐magnetic field. Another application of lognormal distribution is to model the fracture of the substrate of integrated circuits. The root cause of this failure mechanism is power cycling. It makes the device junction temperature fluctuate because of the differences between the coefficients of thermal expansion of device packaging materials. Let μs and σs be the parameters of the lognormal distribution under the stressed condition; then the PDF is
The AF condition in Eq. (5.3.11) in logarithmic scale can be expressed as
By substituting Eq. (5.3.12) into Eq. (5.3.11), the PDF at the normal use condition is
where μn and σn are the parameters of the lognormal distribution under normal use conditions. The following relations held for the lognormal AFT model:
The lognormal distribution belongs to the location‐scale distribution family where μn (μs) are called the location parameters that are dependent on stress level s, and σn (σs) are called the scale parameters that are independent of s. Other location‐scale distributions include normal, uniform, and Cauchy distributions.
While the product is placed in the normal operating environment, increasing the usage can be an effective approach to accelerating the life as well. This differs from HALT and HASS techniques where operating conditions such as temperature, voltage, and vibrations are escalated to higher than the normal operating condition. Usage acceleration can be applied to products subject to intermittent or non‐continuous operations, such as relays, switches, bearings, motors, gearbox, vehicle tires, washing machines, and air conditioners. The basic assumption of usage acceleration models is that the product useful life should not be affected by the increased rate or cycles of operations during the short time period. This is important because cycling simulates the actual use and if the cycling frequency is low enough, the test units can return to the steady state prior to the start of the next cycle. As such, the time‐to‐failure distribution is independent of the cycling rate or there is no reciprocity effect. The implies that there exists a linear relation between the accelerated lifetime and the lifetime under normal use, namely
The model in Eq. 5.3.16 is also called the SAFT because the lifetime is proportional to the usage rate. For example, Johnston et al. (1979) observed that the cycles‐to‐failure of a type of insulation material was shortened with the increased alternating current (AC) frequency. The acceleration factor can be estimated as Af(412) = 412/60 ≈ 6.87 when the AC frequency in voltage endurance tests was increased from 60 to 412 Hz by keeping the voltage at the same level.
Ideally increasing the usage rate should not significantly change the actual use condition of the product. Hence in accelerated usage rate tests, other relevant factors should be identified and controlled to reflect the actual use environment. If the cycling rate is too high, it can cause reciprocity breakdown (Escobar and Meeker 2006). For example, in a power cycling test excessive heat may build up on test units (e.g. microprocessor chips) if the time interval between two consecutive power cycles are too short. This induces the reciprocity breakdown because the cycles‐to‐failure distribution depends on the cycling rate. Thus, it is necessary to let the test units “cool down” between the cycles of consecutive operations.
Most mechanical systems often endure repeated loads that are constantly applied to the object, and the magnitude of the loads may exceed the upper or lower limits of the material strength. The stresses above or below the limits are called critical stresses. Typically examples include wind turbine blades, suspension spring, aircraft wings, and gearbox, among others. Fatigue is generated by cyclic stresses beyond the critical values, especially the upper limit. The system eventually breaks down as the result of the accumulation of fatigues.
Figure 5.10 shows a typical cyclic load profile that fluctuates between the upper and lower limits. It is assumed that there is no fatigue effect as long as the stress does not exceed the critical levels. The fatigue lifetime of an item subject to varying stress can be estimated using Miner's rule. This is expressed as
where
The values of Ni for i = 1, 2, …, k can be obtained from the so‐called stress‐cycle curve, or S–N curve. Figure 5.11 depicts the empirical relationship between stress and cycles to fatigue. An important assumption is that the life is infinite below the fatigue limit stress Lc, while the system fails immediately if the maximum stress Lmax is imposed. When the applied stress varies between Lc and Lmax, it induces material fatigue, leading to a failure as a result of cumulative damages.
The Arrhenius acceleration model belongs to physics‐experimental‐based models. Proposed by Svante Arrhenius in 1889, the Arrhenius equation is an empirical formula for describing the temperature dependence of reaction rates (Arrhenius 1889). It has been recognized as one of the earliest and most effective acceleration models to predict how the time‐to‐failure changes with the imposed temperature stress. The model has been widely used for failure mechanisms that depend on diffusion processes, migration processes or chemical reactions. Hence, it covers many of the non‐mechanical (or non‐material) fatigu failure modes that are responsible for the dysfunction of electronic circuits or devices. The Arrhenius model takes the following form:
where
T = the temperature measured in degrees Kelvin (K)
Ea = the activation energy in units of electronvolts (eV)
k = Bolztmann's constant with k = 8.617 × 10−5 eV /K
Parameter A is a non‐thermal constant called the scaling factor. The value of Ea depends on the materials of the product and the failure mechanism. Typically it is in a range between 0.3 and 0.4 eV, but can go up to 1.5 eV or higher. The Arrhenius model argues that for reactants to transform the product into a chemical process, they must first acquire a minimum amount of energy, called the activation energy Ea. The concept of activation energy explains the exponential relationship between the reaction rate and the fraction of molecules having kinetic energy larger than Ea. The latter can be calculated from statistical mechanics.
Under different A and Ea values, Figure 5.12 plots three cases to show how the time‐to‐failure decreases with the elevation of temperature. Note that the vertical axis represents tf(T) in logarithmic scale with base 10. Assume Case 1 as the baseline with A = 0.01 and Ea = 0.3. By comparing Cases 1 with 2, it is found that activation energy plays a critical role in determining the time‐to‐failure (i.e. reliability). If Ea increases from 0.3 to 0.4 with the same A = 0.01, the value of tf(T) in Case 2 increases over 300 times at T = 200 K and 20 times at T = 400 K as opposed to Case 1.
When dealing with the degradation of organic materials, such as food, plastics, pharmaceuticals, paints, and coatings, it is more common to replace the Boltzmann constant k with the universal gas constant R. Then the Arrhenius model becomes
For example, R = 8.314 46 J /(mol K) is commonly given in units of joules per mole Kelvin. In this formula, Ea is the activation energy in units of joules per mole (J/mol). According to Escobar and Meeker (2006), the corresponding Arrhenius acceleration factor is
where Tn and Ts are the temperature in Kelvin in the use and the stressed condition, respectively.
It is worth mentioning that the Arrhenius model is not applicable to all thermal acceleration problems. A main reason is because both the activation energy Ea and the rate constant k are experimentally determined. They represent macroscopic parameters that are not truly related to threshold energy and the success of individual molecular collisions. Rather, macroscopic measurements are the result of a group of molecular collisions with different parametric scenarios. Nevertheless, the Arrhenius model has been proven to be satisfactorily in many different applications.
The Eyring model has its theoretical basis in chemical reactions and quantum mechanics. It is a physics‐experimental‐based acceleration model. If an aging process, such as chemical reaction, corrosion, diffusion, or metal migration, is driving the reliability degradation toward failure, the Eyring model is able to capture how the rate of degradation varies with the imposed stress. Unlike the Arrhenius model, the Eyring model can be used to model life acceleration under multiple stresses, includes temperature, voltage, current, and other relevant stresses. A two‐stress Eyring model accommodating temperature and one non‐thermal stress takes the following form (NIST 2017):
where
As with the Arrhenius model, k is the Boltzmann constant and Ea is the activation energy. Parameters A, B, and m determine the acceleration effect between stressed and use conditions, and C determines the combined effect of stresses T and S. Applications in the literature have typically use a fixed value of m ranging from m = 0 (Boccaletti et al. 1989), m = 0.5 (Klinger 1991a,b), to m = 1 (Mann et al. 1974; Nelson 1990).
If one compares Eq. (5.5.1) with Eq. (5.4.1), the temperature terms are very similar between them, explaining why the Arrhenius model has been so effective in establishing the connection between the parameter Ea and the quantum theory concept of “activation energy needed to cross an energy barrier and initiate a reaction” (NIST 2017). The general Eyring model accommodating temperature and n non‐thermal stress factors can be expressed as
where S = {S1, S2, …, Sn] is the non‐thermal stress vector and Bi and Ci are the parameters associated with stress Si for i = 1, 2, …, n. In the general Eyring model there exist terms characterizing the interactions between the temperature and non‐thermal stresses. This means that the effect of changing temperature on the lifetime depend on the levels of other stresses. In models with no stress interaction, the acceleration factors can be computed separately for each stress and then multiplied together. This would not be the case if the interaction terms are necessary and required for the underlying physical mechanism, like the temperature and other stresses in the general Eyring model.
The inverse power law (IPL) model is commonly used for non‐thermal accelerated stress tests and possesses the following form (Yang 2007; Elsayed 2012):
where
The IPL in Eq. (5.5.3) can also be expressed in logarithmic scale as follows:
The linear relation between ln V and ln L is appealing because −ln (K) now become the intercept and n is the slope. Both ln (K) and n can be directly estimated from the log‐scale plot. In Figure 5.14, the IPL model is plotted in log‐scale with base 10 in three cases: {K = 0.001, n = 1}, {K = 0.0001, n = 1}, and {K = 0.0001, n = 1.7}, respectively. A common observation is that lifetime decreases with the increased stress level. Given the same initial life at normal condition, a larger n implies that the product life decreases faster (see Cases 2 and 3). The value of K determines the product life time at the normal condition (see Cases 1 and 2) and a smaller K implies a longer life.
Let Vn and Vs be the stresses in normal and accelerated conditions, respectively. The acceleration factor for the IPL is
The PDF of the IPL‐exponential model takes the following form:
Note that this is a two‐parameter {K, n} distribution model. The reliability function and the failure rate of the model are also obtained as follows:
The failure rate is indeed constant (time‐invariant). This result is coincident with the expected lifetime model in Eq. (5.5.3) because h(t) = 1/L(V) under exponential distribution. Both K and n can be estimated through the maximum likelihood estimation (MLE) method. The MLE function in the logarithmic scale is given as follows:
where
Fe = the number of groups of exact time‐to‐failure data
Ni = the number of time‐to‐failure in the ith time‐to‐failure dataset
Vi = the stress level of the ith dataset
Ti = the exact failure time of the ith group
S = the number of groups with suspended data points
= the number of suspensions in the ith group of suspension data points
= the testing time of the ith suspension data group
FI = the number of interval data groups
= the number of intervals in the ith group of data intervals
= the beginning of the ith interval
= the ending of the ith interval
The solution to K and n will be found by solving the following equation system:
Cox (1972) proposed a proportional hazard model (PHM) in 1972. It is a semiparametric model and has been widely accepted for analysis of failures with covariates. In statistics, a semiparametric model is a statistical model that has parametric and non‐parametric components. PHM is built upon the hazard rate function, assuming that the hazard rate h(t; z) under the covariate s is the multiplication of an unspecified baseline hazard rate h0(t) and a relative weight, exp(βTs), where β is the regression coefficient vector. A covariate in accelerated life testing represents the stresses imposed, such as temperature, voltage, and force. The PHM model can be expressed as:
where
s = [s1, s2, …, sn] is the covariate vector with n being the number of covariates
β = [β1, β2, …, βn] is the regression coefficient vector
The model has been successfully used for survival analysis in medical areas (O'Quigley 2008) and reliability forecasting in accelerated life testing (Elsayed and Chan 1990; Elsayed and Jiao 2002). If we are interested in the time‐to‐failure T, the cumulative distribution function (CDF) function is given by
In this semiparametric model, the parameter has both a finite‐dimensional component and an infinite‐dimensional component. Note that {β, h0(u)} are model parameters, where β is finite‐dimensional and is of interest and h0(u) is an unknown non‐negative function of time. The collection of possible candidates for h0(u) is infinite‐dimensional. If the analytical form of h0(t) is known, the conventional MLE can be employed to estimate the regression coefficient vector β. On the other hand, what makes the PHM so attractive is that β can also be estimated by maximizing the corresponding partial likelihood function without specifying h0(t). Cox (1972) developed a non‐parametric method called partial likelihood estimation to estimate the covariate parameters. The partial likelihood estimator is given as
or, expressed in the logarithmic form,
where s(i) is the regressor variable associated with the testing samples or items that failed at t(i). The index r refers to the units under test at t(i). Estimation of the parameter values is then obtained by use of the maximum partial likelihood estimation (MPLE). The partial likelihood function is derived by taking the product of the conditional probability of a failure at time ti, given the number of items that are at risk of failure at that time. Below we use a two‐level stress experiment to illustrate how to estimate the regression coefficients based on Eq. (5.6.3).
In certain semi‐parametric acceleration models, the form of degradation path or lifetime distribution is specified or at least partially specified (Kobbacy et al. 1997, Wang and Kececioglu 2000). Depending on the actual distributions, these models can be classified as: (i) Weibull PHM; (ii) logistic regression model; and (iii) log‐logistic regression model. These models will be elaborated in the next three sections.
When h0(t) in Eq. (5.6.1) resumes a Weibull baseline hazard function, the PM model is referred to as the Weibull proportional hazard model (WPHM). The mathematical form is resumed as follows (Jardine et al. 1987):
By maximizing this likelihood function, regression coefficients β and baseline hazard rate parameters {β, η) in the model are estimated. Gorjian et al. (2009) discussed the advantages and limits of the Weibull PH model. The key advantages include:
According to Gorjian et al. (2009), two main limitations of the WPHM are:
The logistic regression model, also called a logit model, is an ordinal regression model that is a special case of a proportional odds model (POM). It is a regression model for ordinal dependent variables and was first considered by McCullagh (1980). A logistic regression model is usually adopted to relate the probability of an event to a set of covariates. This concept can be used in degradation analysis. If the current degradation features are s(t), the odds ratio between the reliability function R(t; s(t;)) and the cumulative distribution function can be defined as follows (Liao et al. 2006):
where both α > 0 and β are the model parameters to be estimated. Therefore, the reliability function can be expressed as
The maximum likelihood function for the model parameters can be obtained by maximizing the log‐likelihood function using the Nelder–Mead algorithm.
The advantages and limits of the logistic PH model are summarized as given by Gorjian et al. (2009). The key advantage is that compared with the PHM model, it requires less computation effort to estimate the parameter of the logistic regression model based on its likelihood function. There are two limitations of the model: (i) unlike POM, the model in Eq. (5.6.14) assumes a specific distribution; and (ii) to estimate parameters and evaluate the reliability function, this model takes into account only the current covariates, whereas PHM incorporates both current and historical covariates.
The log‐logistic regression model is a special case of POM when a log‐logistic distribution is assumed for the failure times. The log‐logistic regression model is described in which the hazard for separate samples converges with time. Therefore, this provides a linear model for the log odds on survival by any chosen time. This model is developed to overcome some shortcomings of the Weibull distribution in the modeling of failure time data, including time‐varying failure rates with an up‐and‐down profile.
The distribution used frequently in the modeling of survival and failure time data is the Weibull distribution. However, its application is limited by the fact that its hazard must be monotonic (either increasing or decreasing), whatever the values of its parameters. Bennett (1983) claims that the Weibull distribution may be inappropriate where the course of the failure (e.g. disease in individuals) is such that mortality reaches a peak after some finite period and then slowly declines. To characterize this type of non‐monotonic pattern, the following hazard function of a log‐logistic regression model is proposed by Bennett (1983):
where δ is a measure of precision and β is a measure of location. The hazard is assumed to be increasing first and then decreasing with a change at the time, which has its maximum value at
The parameters of the log‐logistic regression model can be estimated by maximizing the likelihood function. The ratio of the hazard for a covariate s takes two values s1 and s2 is given by Bennett (1983):
If we let t increase, it converges to unity. Gorjian et al. (2009) summarize the advantages and limits of the log‐logistic PH model. The key advantages are: (i) it is more suitable to apply in the analysis of survival data rather than a lognormal distribution; (ii) the model extremely fits the applications where hazard reaches a peak after some finite period, and then slowly declines; and (iii) unlike the Weibull PH model, the hazard for different samples is not proportional for a log‐logistic regression model – rather their ratio trends to unity for large t, as shown in Eq. (5.6.18). Thus, this property implies that the survival probabilities of different groups of asset/components become more similar as t increases. The main limitation of the model is that it assumes a specified distribution.
HASS stands for highly accelerated stress screening, and it is a treated as the production equivalent of HALT. It is a post‐production test activity with coverage of 100% finished units. The stress intensity is typically half that used in accelerated life testing. The stress levels in HASS are near the operating limits, creating enough fatigue to precipitate latent or hidden failures owing to poor workmanship issue. As such it prevents faulty units from entering the higher level of assembly or end customer. Meanwhile, HASS can be treated as an aggressive version of ESS in which the thermal and vibration stresses are typically set near the limit of product specification (see Figure 5.1).
Manufacturers dealing with “fast‐to‐market” products are often under tremendous pressure in order to gain the market share, ward off competition, and meet reliability requirements. In the product development and prototype stage, an intensive and prolonged reliability growth test (e.g. HALT) is infeasible because of a compressed design schedule. HASS can be considered as an alternative approach to improving the reliability under fast‐to‐market pressure. To implement HASS, it is important to convince the management team that the return‐on‐investment of HASS, though it may extend product delivery time, could be larger than the products with no HASS. Below we present a framework to estimate the benefit of HASS in terms of reliability growth and financial savings of a new product introduction.
Figure 5.15 shows a typical service flow diagram for repairable systems during the product introduction period. New products are produced in the factory and are shipped and installed at customer sites. To assess the effectiveness of HASS, all shipped products are divided into two groups, one that is subjected to the HASS process and the other without HASS (referred to as “Non‐HASS” hereinafter). Two groups of products are continuously released into the market under these categories. Further assume that the manufacturer negotiates with the customers to achieve a desired mean‐time‐between‐failure (MTBF) over a promised period of time (e.g. six months or one year). Failures are returned and repaired in the factory's repair shop. Meanwhile a good unit from the inventory is delivered to replace the field failure. Let suffixes “H” and “NH” indicate HASSed and non‐HASSed groups. Assuming that the failure intensity rate is the one in the Crow/AMSAA model, we have
where {λ1, β1}, and {λ2, β2} are the model parameters corresponding to two groups, respectively. Based on the respective failure intensity rates of product groups with and without HASS, the rate difference is obtained as follows:
Considering that HASS eliminates infant mortality, we would expect the difference between HASSed and non‐HASSed groups to be positive, i.e. Δu(t) > 0 in Eq. (5.7.3).
Table 5.4 Failure intensity of HASSed and non‐HASSed products.
Time (t) | Single PCB | Group PCB | ||
HASSed | Non‐HASSED | HASSed | Non‐HASSED | |
50 | 0.0153 | 0.0262 | 3.83 | 2.62 |
100 | 0.0159 | 0.0301 | 3.97 | 3.01 |
200 | 0.0164 | 0.0346 | 4.11 | 3.46 |
300 | 0.0168 | 0.0375 | 4.19 | 3.75 |
400 | 0.0170 | 0.0398 | 4.25 | 3.98 |
500 | 0.0172 | 0.0416 | 4.30 | 4.16 |
600 | 0.0173 | 0.0431 | 4.34 | 4.31 |
700 | 0.0175 | 0.0445 | 4.37 | 4.45 |
800 | 0.0176 | 0.0457 | 4.40 | 4.57 |
900 | 0.0177 | 0.0468 | 4.43 | 4.68 |
100 | 0.0159 | 0.0301 | 3.97 | 3.01 |
In general all business activities are profit‐driven and accordingly implementing HASS is not different from any business decisions. As such it must be financially justifiable from the management team perspective. It is relatively easy to estimate the cost of the HASS process, and the comparisons can be made between HASSed products and non‐HASSed products in terms of cost savings. Assumptions for developing the financial model are made as follows:
There are four major cost items associated with HASS implementation: labor, material, facility, and opportunity costs. Material costs refer to the consumables (e.g. liquid nitrogen, electricity), replacement components, and other incidental materials used. Investment in testing beds like environmental chambers and its depreciation (or lease or rentals) are all assumed to be part of the facility costs. Opportunity cost is referred to as the loss of goodwill because a HASS test likely will postpone the product delivery time. Other expenses such as training and documentation costs may also be significant, but for the model simplicity these costs are considered to be insignificant and as such are not included. Therefore, we have
where
The manufacturer incurs the cost of field failures that take place within the warranty period. There are three major cost components of field failures: repair costs (Cr), shipping/logistics costs (Cs), and inventory holding costs (Ch). Repair costs typically consist of material cost, labor cost, and overhead cost at the repair shop. Shipping costs include the delivery of spare parts to the customer sire for replacement and the returning of defective items. Inventory holding costs include the cost of holding the spare parts in the stockroom and the defective items in the repair pipeline. During the time interval [t, t + τ], the cost savings CSAV on account of HASS can be expressed as
where
Notice that k = 1 is for HASSed and k = 2 is for non‐HASSed products. If during the [t, t + τ], τ0 time (τ0 ≤ τ) is used for HASS, then to justify the continuation of HASS, we need the factory repair cost to be higher than HASS costs for the relevant period. By comparing Eqs. (5.7.4) and (5.7.5), we have
It then follows that, if Cf and Cl are incurred more as fixed costs rather than variable costs, then they have to be accounted for accordingly and the use of (τ0 ≤ τ) has no meaning. In other words, idle time of resources set aside for repairing field failures needs to be absorbed in the factory repairs costs. In effect, we would be using τ as a factor instead of τ0 to compute the variable cost component of repairs, while accounting the prorated fixed costs for the relevant period of τ as follows:
Company ABC designs and markets high‐end testing equipment for wafer probing and device test in semiconductor manufacturing sector. A system is usually configured with 30–40 PCB modules depending on the functional requirements. While ABC designs the PCB, the manufacturing of these boards are subcontracted to external suppliers. Upon receiving the PCB by ABC's assembly factory, each board undergoes optional testing, system configuration, and system test before customer shipment and installation. If a PCB fails in any one of the in‐house processes, it is returned to the repair shop for root‐cause analysis. When fixed, the board is routed to “Incoming Stock” again. After the system is shipped and installed in the field, the system will be tested for one more time. If a PCB failed in the field test, it is returned to the repair shop. When fixed, it is routed to the “Incoming Stock” as well. The flowchart in Figure 5.16 shows the subcontract manufacturing process, system configuration, in‐house testing, field installation, and final test of new systems.
Jin et al. (2011) proposed a closed‐loop HASS program with the objective to implement reliability growth initiatives in a distributed manufacturing and service chain. As shown in Figure 5.17, the proposed program consists of six functional blocks to form a Six Sigma reliability control scheme through DMAIC: Define, Measure, Analyse, Improve, and Control. DMAIC is a data‐driven improvement strategy to optimize and stabilize the business processes of a new product introduction. Tang et al. (2007) emphasize the importance of incorporating operations research and management science techniques for enhancing the effectiveness of DMAIC methodology. It is the core tool to orchestrate and guide Six Sigma projects in manufacturing industries.
The Six Sigma program was motivated by customer satisfaction, but the potential cost savings resulted from the improved process is the actual incentive. The resources involved in a repair process include shipping logistics, labor, materials, and testing facilities. To assess the cost benefit of the Six Sigma project, a Monte Carlos simulation program was developed to estimate the cost savings assuming that the infant mortality rate (IMR) could be reduced from the current 14 % to 7% or 3%, and the results are summarized in Table 5.5.
Table 5.5 Potential cost savings of reducing infant mortality rates (IMR) (unit: $).
Source: Reproduced with permission of Elsevier.
Target rate | Time (month) | T = 3 | T = 6 | T = 9 | T = 12 |
IMRa = 7% | Average | 358 209 | 716 613 | 1 074 189 | 1 429 820 |
Standard deviation | 63 692 | 90 500 | 110 737 | 129 480 | |
Optimistic | 462 972 | 865 473 | 1 256 336 | 1 642 796 | |
Pessimistic | 253 445 | 567 754 | 892 043 | 1 216 845 | |
IMRa = 3% | Average | 564 734 | 1 128 324 | 1 696 137 | 2 259 437 |
Standard deviation | 102 049 | 147 199 | 180 477 | 206 932 | |
Optimistic | 732 590 | 1 370 445 | 1 992 996 | 2 599 810 | |
Pessimistic | 396 878 | 886 202 | 1 399 278 | 1 919 064 |
For example, if IMR is down to 7%, the average saving over three months would be $358 209, and $1 429 820 over 12 months. The simulation data allows us to estimate the mean and standard deviation of the savings. At 90% confidence level, the optimistic savings (OSs) and the pessimistic savings (PSs) over the three‐month period are given as
The upper and lower cost savings are very important because they comprehend the uncertainties of the SixSigma program in actual implementations. Similar interpretations can be applied to the potential cost savings under IMRa = 3%.
Effective implementation of Six‐Sigma programs requires the formation of a cross‐functional team spanning engineering, operations, supply, marketing, and field services. Figure 5.18 shows the hierarchical structure in which members from different departments are selected both vertically and horizontally to form the Six‐Sigma team. For example, the members from engineering are responsible for eliminating design weakness, correcting software bugs, and updating the software versions. Operations engineers are responsible for manufacturing, handling, and installation issues. Elimination of NFF failures is challenging because various reasons cause NFF returns. One approach is that field engineers work closely with the customer to collect the onsite failure signature, and then send it to the repair center along with the defective part. Finally, the market engineer is able to design the financial metrics to gauge the cost benefit based on the savings upon the implementation of the Six‐Sigma program.
Based on the system installation dates and the defective PCB return time, Pareto charts are generated to visualize the early PCB failures, and the results are shown in Figure 5.19. The Pareto chart shows that among all the failures, 34% occurred within 30 days of installation, and 18% of the failures occurred within 10 days of installation. The failure trend strongly supports the hypothesis that the current PCB manufacturing and testing process needs to be re‐examined so as to reduce the IMR.
In this phase, the team's objective is to concentrate on the process map and narrow the process input variables down to the vital few variables, because they have the greatest impact on the early failure rate. Two hypotheses are created as the objective for the controlled experiment:
Two controlled batches of PCB with 100 each were used in the controlled experiment. The first batch of PCB underwent the thermal/vibration screening and power cycling before they were delivered to the system configuration. The second batch of PCB, after being received from subcontract manufacturers, go directly into the system integration without going through HASS and power cycling tests. To minimize the customer usage effects, these boards were mixed randomly and configured into new systems being shipped to different customers. The reliability of these PCB can be easily tracked by the board's unique series number.
After the 30‐day field operation, failure reports of both batches revealed that there were eleven failures among 100 non‐HASS and non‐Power Cycle boards (11%). Only three failures were observed from 100 HASSed and Power Cycled boards (3%). A two‐proportion hypothesis test (Gupta 2004) was performed comparing the two batches of PCB populations. The hypothesis of the test is given below:
The test statistics are
Here p1 and p2 represents the failure rate for both batches, respectively. In addition, and are the corresponding estimates. In our example, x1 = 11, x2 = 3, n1 = n2 = 100. The two‐proportion hypothesis test revealed that we could conclude with 98% confidence that the HASSed and Power Cycled Tested population performed much better than those without HASS and Power Cycled during the first 30‐day operational period.
Based on the result of hypothesis test, the recommendations were given by the team to implement a one‐day HASS process followed by automated power cycling tests in subcontract manufacturers. Specifically, the improvement plan highlights the following action items:
A control plan is developed to monitor and respond to any issues arising from the key inputs and outputs of the new process. The plan included two core components:
State the difference between testing specifications of HALT and HASS.
where λ is the scale parameter and θ is the shape parameter. Assuming that the FPGA is subject to the HALT test and the relation between the normal and stressed levels are governed by E[Tn] = AfE[Ts], where Tn and Ts are the normal and stressed life and Af (for Af > 1) is the acceleration factor, do the following:
where Ts and Tn are the temperature at the stressed and normal conditions, respectively. Do the following:
Stress level (×106 N/m2) | 6.5 | 8 | 10 |
Mean cycles to failure (×105) | 13 | 11.5 | 7.4 |
Re‐plot subproblems (1) and (2) in logarithmic scale.
Stress (V) | Times to failure |
20 | 11.1, 16.3, 18.5, 24.1 |
36 | 2.2, 3.1 3.6, 4.7, 4.9, 5.8, 6.1, 6.8, 7.4, 8.2, 9.1, 10.4, 10.7, 12.2 14.3, 15.2 |
Using the maximum likelihood function, use the Weibull inverse power law to model the parameter (also see http://reliawiki.com/index.php/Inverse_Power_Law_Relationship).
The table shows the reliability data of 20 samples in the first three‐month field operation.
Days | 0.5 | 0.75 | 1 | 1.25 | 1.5 | 1.75 | 1.75 | 2 | 2.25 | 2.5 | 0.5 |
Pass | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
Days | 2.75 | 3 | 3.25 | 3.5 | 4 | 4.25 | 4.5 | 4.75 | 5 | 5.5 | 2.75 |
Pass | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Three HASSed products: 1050, 3700, 7900
Seven non‐HASSed products: 330, 540, 890, 1200, 3400, 5700, 8200
Do the following:
3.149.26.176