11
Design for Maintainability

11.1 WHAT TO EXPECT FROM THIS CHAPTER

Once a system’s maintainability requirements are known, properties of the system need to be arranged so that the requirements will be satisfied. Deliberate actions must be taken to guide the system to a state in which it is likely that fulfillment of its maintainability requirements will become more than a fervent hope. This chapter reviews design for maintainability techniques, including

  • Quantitative maintainability modeling,
  • Level of repair analysis (LoRA),
  • Preventive maintenance,
  • Reliability-centered maintenance (RCM).

Each of these is intended to add features, properties, and characteristics to the system’s design that will enhance its ability to be repaired quickly, inexpensively, and with few errors.

11.2 SYSTEM OR SERVICE MAINTENANCE CONCEPT

Try as we may to design for reliability to prevent failures, it is rare that we are completely successful. So when planning a new system, product, or service, it is a good idea to pay attention to how the system, product, or service will be repaired and restored to operation when it fails (corrective maintenance) and to procedures needed for preventing failures after the system is in operation (preventive maintenance). In more formal terms, when we begin to design a system, we also create the beginnings of a plan for how that system will be maintained. This plan is called the system maintenance concept.

As noted in Section 10.2.2, the system maintenance concept addresses

  • what parts of the system will be maintained, and how will this maintenance be accomplished,
  • how many levels of maintenance are anticipated before any formal planning is carried out (see Section 11.4.2),
  • what types of repairs and other functions are anticipated to be performed at each level,
  • what maintainability requirements (Section 10.3) will be instituted to meet the particular needs of this system,
  • what design features should be incorporated to simplify system repair, speed it up, and make it less error-prone,
  • preliminary ideas on other maintenance elements such as type of testing and diagnostic procedures to be employed, staff skills needed, etc., and
  • relevant environmental requirements (e.g., special precautions to be taken in case maintenance is performed in deleterious environments such as in a desert, on shipboard, in polluted atmospheres).

At the very early stages of design, when the maintenance concept is first explored, one should not expect firm answers for all these issues. However, it is in keeping with the spirit of prevention and quality engineering to begin thinking about these issues as soon as is practical. The maintenance concept should be continually updated and become more precisely specified as greater understanding and specificity of the design are attained.

Some parts of the maintenance concept shade over into the system support concept, and it could be argued that some relevant activities could be reasonably placed in either category. Some of these include spares inventory planning, logistics planning for the transport of failed units, spares, and repaired units, provision of online or off-line test procedures and equipment, planning (layout, staff sizing, etc.) of a maintenance facility, etc. Rather than laboring over semantics, it is best to make sure that important activities are covered. A good way to do this is to integrate maintenance staff and support staff into the design team so that should one or the other side inadvertently omit a needed activity, the chances that this omission will be caught and rectified are increased.

The maintenance concept for a service requires an understanding of the service delivery infrastructure and how the maintainability of each of its elements contributes to the maintainability of the service. Chapter 8 showed some examples of how service reliability is driven by the reliability of elements of the service delivery infrastructure [20] . The same reasoning applies to maintainability: the duration of service outages is influenced by the durations of outages in the elements of the service delivery infrastructure (as well as other factors, including the service delivery infrastructure architecture and backup provisions). Most often, at least a monotone relationship can be asserted: the longer outages persist in the service delivery infrastructure, the longer the service outages will be. Quantitative modeling for relating the duration of service outages to the duration of outages in the service delivery infrastructure is needed so that outage duration requirements (i.e., maintainability and support requirements) for elements of the service delivery infrastructure are developed on a rational basis.

11.3 MAINTAINABILITY ASSESSMENT

11.3.1 Maintenance Functional Decomposition and Maintainability Block Diagram

Section 3.4.1.2 introduced the system functional decomposition, a systematic description of how the various elements of the system work together to carry out each system function. The system reliability block diagram is an important by-product of the system functional decomposition. The reliability block diagram indicates how a system failure (violation of one or more requirements) is caused by the failure of an element of the diagram. We have called this the “reliability logic” of the system. Thought of in this way, the reliability block diagram is like the inverse of the system functional decomposition: while the system functional decomposition tells how an element of the system contributes to the system’s functioning, the reliability block diagram tells how the failure of an element of the system contributes to failure of the system.

At times, failure of an element of the system does not cause a system failure. This is the case, for instance, when a system element is backed up by a redundant element (“spare”) so that when the system element fails, the spare element takes over the function(s) of that element, and the system continues to function without (or with only a brief1) interruption. Maintenance planning requires that these events be taken into account because

  • some cost is incurred each time this happens,
  • an action to replace the failed unit may be needed,
  • if not attended to, some such events could leave the system in an undesirable brink-of-failure state.2

The maintenance functional decomposition facilitates this accounting. The maintenance functional decomposition is a systematic description of whether the failure of a system element requires that a maintenance action be performed (e.g., replacement of the failed element). We may derive a maintainability block diagram from the maintenance functional decomposition in the same way the reliability block diagram is derived from the system functional description. The maintainability block diagram expresses in pictorial form the way in which a maintenance action may follow from the failure of an element of the diagram. The simplest maintainability block diagram is a series system in which each element of the series arrangement causes a maintenance action when it fails.

If an element that is a single point of failure fails, then both a system failure and a maintenance action to replace the failed unit need to be recorded. The reliability block diagram scores a system failure, and the maintainability block diagram scores a maintenance action. If an element fails that is backed up by a redundant unit, a maintenance action may or may not be called for. For instance, if the element that fails has a hot standby spare, it may be decided, as part of the system maintenance concept, to leave the failed unit in place until the second unit also fails, at which time both are replaced in a single maintenance action. Also, when the system maintenance concept calls for the repair of certain subsystem failures to be deferred until some later time (Section 10.2.2.1), the maintainability block diagram does not include this subsystem. Whenever a unit or ensemble failure entails a maintenance action, we place that unit or ensemble in a series configuration in the maintainability block diagram, even if that unit in the reliability block diagram may have redundant backup. When a unit does not require a maintenance action when it fails, that unit enters the maintainability block diagram in a parallel configuration with the number of “spare” or “backup” units determined by the number of unit failures that have to take place before a maintenance action is required. For instance, a two-unit hot standby ensemble that is not maintained until the second unit fails (so the entire ensemble fails at that time) enters the maintainability block diagram as a two-unit parallel system. If the system maintenance plan calls for each unit to be replaced when it fails, even though a spare unit is in place and enabling system operation to continue, then the two-unit hot standby system enters the reliability block diagram as a two-unit parallel system but enters the maintainability block diagram as a series system of two units (because the failure of each unit increases the maintenance action counter by one).

Example: Consider the single server rack in a server farm example shown in Section 4.4.5. All units in the rack are single points of failure except for the redundant power supply. There are two choices concerning the replacement of the power supply when a failure occurs: we may either replace each power supply unit whenever it fails, regardless of the status of the other supply unit, or we may wait until both power supply units have failed and then replace the ensemble of the two units together. In the first case, a maintenance action is called for every time a power supply unit fails, in effect treating the two power supplies as nonredundant from the maintenance action point of view. The maintainability block diagram for this case is a series system of all 16 units comprising the rack. In the second case, no maintenance action is called for until both power supply units have failed, and replacement of the ensemble of the two redundant units requires only one maintenance action. The maintainability block diagram for this case is a series system of 15 units, 1–12, 15, and 16, and another single unit representing the parallel ensemble of units 13 and 14.

11.3.2 Quantitative Maintainability Modeling

11.3.2.1 Frequency of maintenance actions

Once a maintainability block diagram is in place, projections about the number of maintenance actions, the times between maintenance actions, etc., may be made using the same techniques that were used in Chapters 3 and 4 for reliability block diagrams. The key is to prepare a block diagram that reflects the number of maintenance actions instead of the reliability of the system. As a rule, a system will undergo more maintenance actions than it will undergo failures, mainly because redundant units, while useful for preventing outages, may require attention when they fail in order not to leave the system in a brink-of-failure state.

This section discusses the use of the separate maintenance model (Section 4.4.5) as a model for the number of maintenance actions over a given time period. To implement the separate maintenance model in this application, begin with a maintenance functional decomposition (Section 11.3.1) in which every replaceable unit in the system is individually identified as an element of the decomposition. A maintenance functional decomposition is like a system functional decomposition, but it is constructed so that every maintenance action is recorded—even if it does not cause, or result from, a system failure. There may be other elements in the maintenance functional decomposition, but every subassembly or LRU that is designated as replaceable in the system maintenance plan should appear in the decomposition. A maintainability block diagram is a reliability block diagram based on the maintenance functional decomposition. Separate the diagram into two parts so that one part contains all the replaceable units. Let φM(X1, . . ., Xn) denote the structure function of that part of the diagram containing the replaceable units (numbered 1, . . ., n).3 Finally, denote by Z1(t), . . ., Zn(t) the reliability processes (Section 4.3.2) of the n replaceable units.4 The separate maintenance model is the reliability process ZM(t) = φM(Z1(t), . . ., Zn(t)) of the ensemble of the replaceable units. We use ZM(t) to obtain the number of maintenance actions for the system.

Example: Continue the server rack example from Section 11.3.1. Let N1(t), . . ., N16(t) denote the number of unit failures in the time period [0, t] for units 1, . . ., 16, respectively. We consider two cases:

  1. Each power supply module is replaced when it fails. In this case, the number of maintenance actions is N1(t) + ⋅ ⋅ ⋅ + N16(t) because every unit failure, including each power module, causes a maintenance action. When the operating time and outage time distributions for each of the system’s components are as shown in Table 4.1, the expected number of maintenance actions over 5 years is 49.363 if the units are replaced by new ones when they fail [21, 23]. If each unit is revived (Section 4.4.3.1) when it fails, the expected number of maintenance actions over 5 years is 79.579 (Exercise 2). The number of failures with revival is greater than the number of failures with renewal because of the increasing hazard rate nature of the life distributions for the server and the power supply.
  2. When the power supply module in service fails, the backup power supply (if it is not already failed) is put into service; the ensemble of two hot standby power supply modules is replaced at the time of the second power module failure. In this case, the number of maintenance actions is N1(t) + ⋅ ⋅ ⋅ + N12(t) + N15(t) + N16(t) + J(t), where J(t) denotes the number of replacements of the ensemble of two hot standby power modules. The expected value of J(t) may be determined from equation (4.10) of Ref. 22 if an alternating renewal model for the modules is acceptable and the on- and off-time distributions in that model are assigned.

11.3.2.2 Duration of maintenance actions

Maintenance action durations of interest include

  • duration of an individual operation at a single workstation and
  • total time needed for a single maintenance job to transit a facility.

These may pertain to preventive or corrective maintenance.

The sojourn time of a job at a single workstation may be estimated from historical data or may be measured using a time-and-motion study [9] . In a maintainability context, these studies are also known as maintenance task analysis [3].

Information about the total time a job spends in a maintenance facility may be gained from a precedence diagram (critical path method) or activity network model [9] for the facility. For purposes of this analysis, a maintenance facility may be conceptualized as a network of workstations with jobs flowing around the network in a pattern determined by the type of equipment being serviced, its service needs, and the types of tasks that can be performed at each of the workstations in the facility. Discussion of this variable is postponed until Section 13.4.1 in which we describe a stochastic network flow model for performance analysis and optimization of a maintenance facility.

11.4 DESIGN FOR MAINTAINABILITY TECHNIQUES

11.4.1 System Maintenance Concept

The system maintenance concept (Section 11.2) serves as a foundation for maintenance planning. While it begins at the early stages of system design and so at that time is necessarily incomplete and lacking in detail, good practice encompasses continual updating of the maintenance concept as the design progresses. There is a natural link between the maintenance concept and system reliability modeling: maintainable system reliability modeling undertaken (e.g., as in Section 4.4) is driven by the system maintenance concept because the locations and types of maintenance performed are the raw material for system reliability models such as a separate maintenance model (Section 4.4.5) or a state diagram reliability model (Section 4.4.7).

A strategy for assigning certain maintenance actions to different locations is a vital part of the maintenance concept. The locations at which repairs are performed are referred to as “levels,” a terminology deriving from early use of this procedure in defense systems. Options for maintenance levels include

  • online maintenance at the site where the system is being used,
  • offline maintenance at a location near the site where the system is being used,
  • offline maintenance at a location distant from the site where the system is being used, and/or
  • offline maintenance at a manufacturer’s plant or supplier’s facility.

A system maintenance concept need not include all these levels. The choice of which levels to employ is accomplished by a LoRA (Section 11.4.2), an economic exploration and optimization of maintenance operation through implementation of options from this menu. Some additional factors that influence design of maintenance and assignment of specific maintenance procedures to each level include

  • what repairs need to be done, including an assessment of how complicated each repair type may be,
  • what spare parts, tools, documentation, and other materials need to stored to enable maintenance at that level,
  • what skills will be required of the repair staff, including an assessment of how much repair can be accomplished by the system operators who may not necessarily be trained to carry out maintenance tasks, and
  • how often maintenance (corrective and preventive) may be needed.

Details pertaining to each level of maintenance are considered in Section 11.4.2.

11.4.2 Level of Repair Analysis

A LoRA is an economic optimization that determines the least expensive assignment of repair operations to one or more of the four levels of maintenance commonly considered. Division of maintenance activities into levels comes originally from the defense industry in which systems may be deployed in far-flung locations and rapid repairs are usually required, so that the option of repairing a failed system on-site was developed.

11.4.2.1 Online maintenance

Online maintenance refers to preventive or corrective maintenance actions that take place where the system is deployed. Usually, online maintenance comprehends simpler, shorter-duration, or less frequently–occurring tasks, such as cleaning, minor adjustments, periodic condition monitoring, and the like, that may be accomplished by system operators without taking the system out of service. Online maintenance may also be preferred for rapid replacement of critical items to minimize system outage time.

Maintenance training for system operators incorporates

  • ability to initiate and interpret diagnostic routines for fault location,
  • procedures for the maintenance tasks (preventive or corrective) that are determined by the LoRA to be performed on-site, and
  • ability to discern when a repair is outside the scope of on-site maintenance but needs to be referred to a higher level for completion.

Planning for environmental influences on repair procedures is more important at the online level because systems may be deployed in a variety of differing environments: shipboard, aircraft, automotive, poor air quality, arctic or equatorial, etc.

11.4.2.2 Off-line maintenance on-site or at a nearby site

Certain preventive or corrective maintenance actions may require that a system be taken out of service before they can be performed. The system is said to be off line, and the maintenance may then be undertaken on site or at a nearby fixed or mobile location. This, and off-line maintenance at a remote location (Section 11.4.2.3), is referred to as “intermediate maintenance.” For example, some line-replaceable units (LRUs) may require the system to be powered down before they can be safely removed or replaced.

11.4.2.3 Off-line maintenance at a remote location

This is the second type of intermediate maintenance, sometimes called depot-level maintenance. It should be considered for repairs that

  • may be more complicated,
  • may require specialized test equipment and/or tools,
  • may require more specialized expertise to accomplish, or
  • may occur less frequently (so that on-site stocking of spare part(s) required for this repair may be costly).

Turnaround times will be greater than for online maintenance or for off-line maintenance on site or nearby, and transportation costs are also incurred.

This is also the first level of maintenance where it is reasonable to consider repair of LRUs. Many LRUs are valuable enough that they are not discarded when they fail but are instead repaired and placed into a spares inventory for use in future system corrective maintenance. Repair of an LRU usually entails replacement of some component(s) on the unit, so solder rework stations and other specialized tools may be required. This is also usually exacting work, and it is not reasonable to expect that it could be performed under field conditions (even an environmentally controlled location such as a telephone central office, while offering a benign (stable temperature and humidity, low vibration, etc.) environment, would not be equipped with the workspaces and workstations needed for these repairs). When a system contains valuable LRUs that are repaired and not discarded when they fail, depot- and/or manufacturer-level maintenance is almost mandatory. Figure 5.1 shows an example of the flows of material and information supporting a repair scheme in which LRUs are repaired at an independent facility under contract to the system manufacturer. That is, the system manufacturer has outsourced the repair of LRUs that it might have performed itself. Note that Figure 5.1 does not allude in any way to the potential political difficulties that may arise in seeking to share reliability data across unrelated organizations. While this is an important consideration, it is beyond the scope of this book.

11.4.2.4 Off-line maintenance at a manufacturer’s or supplier’s facility

For systems or products using a multilevel maintenance concept, this is the last resort for repairs. Consider reserving this level of maintenance for

  • particularly intractable failures that are difficult to diagnose,
  • situations where long turnaround times can be tolerated, or
  • tasks that are beyond the capability of on-site or intermediate maintenance staff.

A multilevel maintenance scheme also offers the possibility of spilling over to the next higher level repairs that have been attempted but were unsuccessful. It is reasonable to expect (but should be verified) that the manufacturer of the system has the specialized expertise, tools, and diagnostic systems to handle almost every type of failure. For some types of products (e.g., consumer entertainment products), this may be the only option offered by the manufacturer (even though the owner may be able to perform repairs himself or may contract repair to be conducted by an independent shop).

As with intermediate maintenance, costs will be incurred for transportation of materials both to and from the facility.

11.4.2.5 Analysis and optimization

The LoRA described in this section helps choose the least cost repair scheme to fit the particular needs of your system. From the four levels of repair and the possibility of discarding the item, there are at most 31 combinations (ranging from using on-site repair only up to using all four levels plus discard). Some combinations may be ruled out by other conditions prevailing in system use or operation, so the number of choices is usually limited to a small number. Choice of the least cost option is readily accomplished through use of a spreadsheet-based accounting procedure. LoRA is described in detail in MIL-STD-1390D [24], which, while no longer supported by the Department of Defense, contains a wealth of information and procedures that help in practical LoRA studies. LoRA is also used in the automotive and aerospace industries [19] . Software to allow rapid completion of LoRA has been described [7, 11]. As with all off-the-shelf software, users should verify that the assumptions used by the software developers are appropriate for the study being undertaken before relying on the answers generated by the software.

To begin a LoRA, choose a time horizon for the decision process. Use a time horizon that reasonably reflects the time over which you expect that the system will be supported by the maintenance-level scheme. Use one spreadsheet worksheet for each item type to be maintained in the multilevel scheme. Use one column for each level-of-repair option (e.g., a three-level maintenance scheme uses four columns, one for each level and one for the “discard” option). Use one row for each of the following costs:

  • Acquisition cost: the first cost of purchasing the item,
  • Expected maintenance labor cost: labor cost per hour multiplied by the expected total duration of all maintenance tasks on that item over the chosen time horizon,
  • Maintenance staff training cost allocated to the item,
  • Maintenance facility costs (rent, utilities, janitorial costs, capital costs, etc.) allocated to the item (if necessary, separate capital costs from expenses),
  • Inventory acquisition cost for the item,
  • Inventory carrying cost for the item,
  • Repair parts inventory acquisition cost for the item,
  • Repair parts inventory carrying cost for the item,
  • Cost of test equipment, tools, documentation, software, etc., for the item,
  • Transportation costs for the item,
  • Recycling and disposal costs allocated to the item.

Each cell is populated with the cost associated with its row for the level associated with its column. Use the spreadsheet to add up the costs down each column. The LoRA procedure chooses the option corresponding to the column with the lowest total. If you were making a decision for that one item only, you could do it now based on the spreadsheet results. If there is more than one item in the plan and all items are to be repaired using the same multilevel plan, make a weighted average of all the total costs, weighted by the proportion of the total population of items represented by each individual item. For instance, if there are two items A and B, and item A represents 30% of the total number of both items and item B represents 70% of the total, weight the total costs of items A and B by 0.3 and 0.7, respectively. Make separate worksheets for the two items and average the total costs over the two worksheets for A and B using these weights. The lowest weighted average total cost over the options considered is the LoRA solution. If there is more than one item in the plan but each item may have a separate repair scheme, the items may be treated individually with an optimal level of repair selected for each, and the weighted-average procedure is not needed.

The remainder of this section is devoted to a small example of a LoRA, not so much as a generic example to be followed to the letter in particular applications but more as an illustration of the procedure and the reasoning process that makes LoRA useful.

Example: This example concerns two LRUs, A and B, that are themselves repairable. The options considered include intermediate-level repair, depot-level repair of the LRUs (a failed LRU is replaced by a working one from a spares inventory so that the system is restored to service; the failed LRU undergoes repair itself using the scheme to be decided by the LoRA), and discard (when an LRU fails, it is not repaired but is discarded or recycled). The units are not repairable on-site. We will illustrate a LoRA for these units using a 10-year time horizon. To fill in the spreadsheet, some facts about units A and B are required.

  1. A system containing units A and B is installed in 10 submarines. Each submarine contains two systems. Each system contains three A units and seven B units. The systems are in service 12 hours a day, and the submarines run on a 6-months-on, 2-months-off schedule, so over the 10-year study horizon, each system accrues a total of 32,400 hours of operation.5 The type A units accrue a total of 1,944,000 unit-hours, and the type B units accrue a total of 4,536,000 unit-hours.
  2. Type A units cost $18,000 each if designed to be repairable and $15,000 each if designed to be discarded. The corresponding costs for unit B are $3500 and $2750, respectively.
  3. The labor rate, including all overhead, is $35 per hour at the intermediate level and $55 per hour at the depot level. Unit A takes an average of 4 hours to repair, while unit B takes an average of 3 hours to repair.
  4. The estimated failure intensity of unit A is 6 × 10−5 failures per hour and that of unit B is 2 × 10−5 failures per hour.6
  5. Training repair personnel costs $60 per hour for intermediate-level staff and $80 per hour for depot-level staff.
  6. Facility costs allocated to units A and B are $1.50 per maintenance hour at the intermediate repair level and $2.50 per maintenance hour at the depot repair level.
  7. The spares inventory size is 2 type A spares and 5 type B spares per system at the intermediate level and 10 type A spares and 25 type B spares (covering all systems) at the depot level when the units are repaired. If the units are discarded, one spare is needed for every unit in the field. Inventory carrying costs are approximately 7% of the inventory value per year.
  8. Parts consumed in the repair of unit A amount to $78 per repair and for unit B amount to $24 per repair.
  9. Costs of test equipment, tools, etc., allocated to units A and B together are $12,000 per installation for intermediate-level repair and $50,000 per installation for depot-level repair.
  10. Transportation costs $500 for any number of units from the submarine to either the intermediate or depot repair facility, regardless of the number of units being shipped.
  11. Disposal costs $25 for unit A and $15 for unit B. Fifty percent of those disposed are recycled, and recycling unit A (B, respectively) brings in $80 ($10, respectively) in revenue.

The spreadsheet for unit A is shown in Table 11.1.

Table 11.1 Unit A LoRA Spreadsheet

Cost Description Intermediate Repair ($) Depot Repair ($) Discard Option Notes
Acquisition 1,080,000 1,080,000 900,000 20 systems, 3 type A units per system
Labor 16,330 25,661 0 Expected number of failures over 10 years is 116.64
Training 11,520 2,880 0 Eight students for 3 days at intermediate, two students for 3 days at depot
Facility 700 1,166 0
Spares inventory acquisition 720,000 180,000 900,000
Spares inventory carrying 504,000 126,000 630,000
Repair parts consumption 9,098 9,098 0
Test equipment, tools, etc. 3,600 4,500 0
Transportation 0 0 0 Not included because it is the same in all scenarios
Recycling/disposal 0 0 −3,208
Total 2,345,248 1,429,305 2,426,792

The spreadsheet for unit B is shown in Table 11.2.

Table 11.2 Unit B LoRA Spreadsheet

Cost Description Intermediate Repair ($) Depot Repair ($) Discard Option Notes
Labor 22,226 34,927 0 Expected number of failures over 10 years is 90.72
Training 11,520 2,880 0 Eight students for 3 days at intermediate, two students for 3 days at depot
Facility 408 680 0
Spares inventory acquisition 350,000 87,500 385,000
Spares inventory carrying 245,000 61,250 269,500
Repair parts consumption 2,177 2,177 0
Test equipment, tools, etc. 8,400 10,500 0
Transportation 0 0 0 Not included because it is the same regardless of the number of units
Recycling/disposal 0 0 227
Total 639,731 199,914 654,727

In both cases A and B, the analysis indicates that intermediate repair is preferred. Note that training costs are the same for both units A and B, so these could have been left out of the analysis, and the conclusion would not change. If we needed to consider both units A and B together (i.e., only a single level of repair strategy was to be selected for both units), the additional step of taking weighted averages of the results for A and B would be required if the two analyses pointed to two different choices for A and for B. In this example, this is not needed because the conclusion is the same for A and B: use intermediate-level repair.

This example is oversimplified and not intended to provide a template for any particular LoRA. Rather, it is intended to show in general terms how LoRA is carried out and what the underlying reasoning process is. Practical LoRA is facilitated by standards such as Refs. 19, 24 and off-the-shelf software such as Refs. 7, 11.

11.4.3 Preventive Maintenance

Preventive maintenance is the application of occasional interventions intended to forestall possible system failures. Preventive maintenance is most effective in cases in which there are one or more wearout failure modes present in the system for which suitable countermeasures were not implemented, either because none could be identified or because identified countermeasures were considered too expensive, and measures are known that may be applied to forestall the wearout failure mode. An example that clearly illustrates preventive maintenance, because the wearout failure mode is readily discernible, is lubrication of bearings. In the absence of lubrication, metal-to-metal contact in rolling or sliding bearings would cause rapid wear and failure of the system of which they are a part. Therefore, lubrication is specified as a part of most, if not all, bearing applications to postpone the time at which this wearout failure mode may activate. An additional preventive maintenance aspect of this example is that, in some cases, such as an internal combustion engine, the lubricant may itself wear and need to be replaced from time to time to maintain its effectiveness. So a schedule of lubricant replacement is recommended: change the engine oil every 7500 miles or once a year, whichever comes first. This is an example of a fixed schedule in which “age” is measured both by elapsed time and by elapsed mileage.

Fixed preventive maintenance schedules may not be optimal. For instance, in the internal combustion engine example, lubricant wear also depends on other factors, such as style of driving and environmental conditions. Highway driving at a more-or-less constant speed is less wearing on lubricants than stop-and-go city driving. Driving in dusty or sandy environments causes faster lubricant wear. But a fixed schedule of lubricant replacement does not account for these variables and may cause

  • premature replacement of lubricant that may have many miles of safe use remaining, or
  • tardy replacement of lubricant that may have already worn past the point of effectiveness.

As industries began to recognize the economic implications of these (we might call them) type-1 and type-2 errors, a search for better preventive maintenance schemes began. One result was RCM.

11.4.4 Reliability-Centered Maintenance (RCM)

When initially conceived, preventive maintenance was envisioned as a regularly scheduled activity that would take place regardless of other conditions prevailing in the system. It soon became apparent that a fixed schedule of preventive maintenance was not optimal. You could be carrying out preventive maintenance long before it was necessary, or the failure mode it was intended to forestall occurred before the preventive maintenance could be applied. So it became sensible to look for ways to incorporate knowledge of the system’s current condition and past failure behavior into a preventive maintenance scheme. Techniques of this type fall under the category of RCM [5, 17]. We will describe two types of RCM, predictive maintenance and condition-based maintenance.

Language tip: “Predictive maintenance,” “condition-based maintenance,” and “RCM” are not used consistently in the community. We have chosen in this book a usage that we hope aligns the term with the process so that it will be easier to remember which term applies to which concept. Thus, we use RCM as the general term for all preventive maintenance schemes that involve using information about the reliability of the system to plan the next, or the next series of, preventive maintenance intervention(s). We reserve predictive maintenance for plans based on the knowledge of stochastic properties of times to failure or operating times of the system, and condition-based maintenance for plans based on following the progress of some degradation process active in the system or on the results of some specific testing applied to the system. Be aware that usage varies and take pains to verify which is being used for which when clarity is important.

11.4.4.1 Predictive maintenance

There are two broad classes of RCM schemes: ones based on the knowledge of the pattern of system failures in time (or other age-measuring variable), and ones based on an understanding of some degradation process at play in the system. We will describe the first class in this section on predictive maintenance and the second class as condition-based maintenance in Section 11.4.4.2.

We have previously conceptualized the times at which failures of a maintainable system occur as a point process (Section 4.3.3). If it is possible to characterize the operating times in this point process thoroughly, what we know about the distributions of the operating times in the process may be used to construct a preventive maintenance schedule. For example, suppose n − 1 failures have occurred in the system so far. At the end of the current outage, the system is repaired and returned to service, and the next operating time Un begins. If we know the distribution of Un, we may choose a time at which we are, say, 90% certain that the next failure will occur beyond this time (this would be the 10th percentile of the distribution of Un). Of course, this choice is not going to be arbitrary (although even an arbitrary choice has a chance of improving on the fixed schedule scheme) but will be determined through an optimization balancing the cost of carrying out the preventive maintenance too early against the costs of a failure. Some examples of optimization of predictive maintenance schemes can be found in Refs. 6, 14.

11.4.4.2 Condition-based maintenance

As an alternative to acquiring stochastic characterization of the system’s operating times, some physical characteristic(s) of the system may be measured and tracked over time7 in an attempt to predict when a failure may be imminent. The measurement may be passive (i.e., no special stimulus is applied to the system but rather some existing operating characteristic or sensor reading is followed) or active (i.e., some stimulus is applied to the system, and a response is measured).

In the first case, the measured characteristic is usually conceptualized as a stochastic process {X(t) : t ≥ 0} (X(t) is the value of the measurement at time t), and the connection with maintenance is that the system fails at the first time τ that this process crosses some stated threshold x0. That is,

images

if we think of the process X(t) as nondecreasing. For instance, X(t) may represent the percentage of oxidation in a steel structure; when that percentage reaches a predetermined threshold x0, some remedial action is taken to forestall collapse of the structure. X(t) may represent the level of vibration measured in some rotating machinery; when the measured vibration is too great, preventive action is taken to forestall a failure that may take place if bearing wear (or some other failure mechanism) were to be allowed to increase unchecked. A schedule of inspections is needed so that maintenance personnel will know when the system should be monitored. A static schedule calls for measurements at fixed, predetermined times. If the system is continuously monitored, this procedure is called condition monitoring [8] . A dynamic schedule may be developed based on an understanding of how rapidly X(t) changes, that is, slowly changing phenomena need not be inspected as often. A dynamic schedule may be created by estimating X′(t) at the same time X(t) is measured so that the time of the next inspection will be longer if X′(t) is small and shorter if X′(t) is large.

The measurements (longitudinal data) obtained from these inspections are values of X(t) at the inspection times t1, t2, . . . . Statistical treatment of these data in an engineering context was introduced by Carey and Tortorella [4] and developed to a high standard by Lu and Meeker [13] and others. Degradation analysis is now a standard part of statistical analysis of reliability data [15] and is widely used in many condition-based maintenance analyses [2, 12, 13, 14, 25].

In the second case, inspection entails measuring the response of the system to a defined stimulus. For example, inverse acoustic scattering may be used to detect cracks in structural materials [1], enabling early detection of deterioration that, if left unchecked, could lead to catastrophic failure [10] . In other respects, this procedure is like passive condition-based maintenance with the added feature that periodic inspection of elements not ordinarily visible may be accomplished.

11.5 CURRENT BEST PRACTICES IN DESIGN FOR MAINTAINABILITY

11.5.1 Make a Deliberate Maintainability Plan

The degree of maintainability to be achieved by a system or service is determined by customer needs for rapid, low-cost, and error-free repair or restoration, and the business case for the system or service. It may be possible to develop an optimization model to guide the proper balance or determine just how much maintainability customers will be willing to pay for. Even if this is done only informally, through discussions between marketing and systems engineering, greater understanding of the trade-offs involved forms a more rational basis for action. Some deliberate action must be taken, or the result will be a system or service that has some degree of maintainability that is achieved essentially at random. That is, without deliberate planning and attention to design for maintainability, the system or service maintainability is what it is because of actions or omissions that were unguided, and a good outcome would be the result of good luck rather than a solid plan. So the first best practice is to determine just how much maintainability is appropriate for the system or service. This should be undertaken before a maintenance concept (Section 11.2) is considered.

11.5.2 Determine Which Design for Maintainability Techniques to Use

Not every system or service will require a high degree of maintainability, but every system or service should have a maintenance concept, for it is here that the fundamental decisions about maintenance are made and documented. From these decisions, maintainability requirements should be constructed. Once the requirements are known, you can decide how much effort in design for maintainability will be needed. This chapter discusses three relevant designs for maintainability techniques: the system or service maintenance concept, LoRA, and preventive maintenance.

Every system or service needs a maintenance concept, even if the concept is “do no maintenance” (and if this is the decision, it must be deliberate). LoRA is not needed for systems that don’t use the replace-and-reuse concept for subassemblies. If a system is deployed in widely separated geographic locations, and is repaired using a replace-and-reuse concept, a LoRA is needed so that costs of the repair operation can be identified and a rational, minimum-cost repair scheme is selected. Even legacy repair infrastructures should be considered for cost-saving opportunities that may arise from batching strategies, repair-on-demand possibilities, etc.

Investigate possible preventive maintenance schemes. Determine whether the system harbors any wearout failure mechanisms and whether these are likely to activate before the end of the system’s intended service life. If so, consider developing preventive maintenance, either scheduled or predictive, as appropriate, to forestall the large number of failures that may occur because all deployed systems have the same wearout failure mechanism(s). This is analogous to the treatment of design flaws in reliability engineering: if a failure mode is due to a design flaw, then every copy of the system contains the same flaw, and (depending on the environment in which the system operates) the flaw will cause a failure sooner or later in every copy of the system. As always, the preventive maintenance decision is an economic one. It may not be economically sensible to spend large sums on prevention for systems having short useful lives, of low value, or that do not generate large external failure costs.

For high-consequence systems, the same principle used in design for reliability (Chapter 7) applies here too: require justification for any elimination of design for maintainability interventions. The contrast is with “ordinary” (not high-consequence) systems in which economic considerations usually require justification of inclusion of design for maintainability interventions. The balance between prevention costs and external failure costs should always be considered.

11.5.3 Integration

Maintainability (and supportability) and reliability are connected through availability requirements (see the example in Section 10.2.3). Therefore, reliability modeling and maintainability modeling (and supportability modeling, Chapter 12) should be linked so that

  • availability implications of maintenance and support policies can be discerned, and
  • maintainability requirements can be set on a rational basis.

While the frequency of failures is determined by reliability, the duration of outages is driven by supportability and maintainability. If there are downstream effects of failure frequency,8 these should be considered when developing requirements for reliability. Downstream effects of outages in elements of a service delivery infrastructure should factor into the development of maintainability (and supportability) requirements.

11.5.4 Organizational Factors

Integrate maintenance team members with design team members as soon as possible. We have seen examples of how design features may promote or inhibit maintainability. It is wise to begin interdisciplinary communication as early as possible in the design process so that decisions that affect maintainability are reviewed and rationalized to avoid adverse effects. Design reviews are a natural forum for these discussions.

11.6 CHAPTER SUMMARY

As usual in this book for systems engineers, this chapter offers many recommendations for design for maintainability, but only some of these are covered in detail. Most often, these actions will be carried out not by the systems engineer but by other maintainability specialist engineers on the design team. Some of the resources that can be used to fill in the details of these procedures include Refs. 3 and 16, and, for insight into early thinking on the subject [18].

11.7 EXERCISES

  1. Continue the server rack example from Sections 11.3.1 to 11.3.2. Suppose that replacement units are new and that unit i has an exponential life distribution with parameter λi, i = 1, . . ., 16, where λ1 = ⋅ ⋅ ⋅ = λ12 = 2, λ13 = λ14 = 8, λ15 = 1, and λ16 = 0.01 failures per year. Suppose that the downtime for unit i (including support and replacement time) has a uniform distribution on [1, 4] for every i = 1, . . ., 16.
    1. In case every unit is replaced when it fails, what is the expected number of unit replacements in the rack over the first year of operation?
    2. In the case where the power supply ensemble is replaced only after both power supplies have failed, what is the expected number of ensemble replacements in the first year of operation? (Hint: see Section 4.2 of Ref. 22.)
  2. Complete the example in Section 11.3.2.1 using the method found in Section 4.4.3.1.
  3. List the 31 possible combinations of repair levels noted in the beginning of Section 11.4.2.5. Do all 31 combinations make sense? Are there any situations in which it makes sense not to have an on-site repair option?
  4. Cite and discuss two examples of preventive maintenance. Identify the wearout failure mode(s) the preventive maintenance is intended to forestall. How is “age” measured in your examples?

REFERENCES

  1. 1. Baltazar A, Wang L, Xie B, Rokhlin SI. Inverse ultrasonic determination of imperfect interfaces and bulk properties of a layer between two solids. J Acoust Soc Am 2003;114 (3):1424–1434.
  2. 2. Besnard F, Bertling L. An approach for condition-based maintenance optimization applied to wind turbine blades. IEEE Trans Sustain Energy 2010;1 (2):77–83.
  3.  3. Blanchard BS, Verma D, Peterson EL. Maintainability: A Key to Effective Serviceability and Maintenance Management . Volume 13, New York: John Wiley & Sons, Inc; 1995.
  4.  4. Carey MB, Tortorella M. Analysis of degradation data applied to MOS devices. Sixth International Conference on Reliability and Maintainability; Strasbourg, France. 1988.
  5.  5. Carter AB. Reliability centered maintenance. 2011. United States Department of Defense Manual 4151.22-M . Washington, DC: US Department of Defense.
  6.  6. Chu C, Proth JM, Wolff P. Predictive maintenance: the one-unit replacement model. Int J Prod Econ 1998;54 (3):285–295.
  7.  7. Elliott-Brown JA, McPherson SW. NAVSEA level of repair analysis (LORA) software. Naval Eng J 1995;107:59–66.
  8.  8. Elsayed EA. Reliability Engineering . 2nd ed. Hoboken: John Wiley & Sons, Inc; 2012.
  9.  9. Freivalds A. Niebel's Methods, Standards, and Work Design . Volume 700, Boston: McGraw-Hill Higher Education; 2009.
  10. 10. https://en.wikipedia.org/wiki/Mianus_River_Bridge
  11. 11. Ituarte-Villarreal CM, Espiritu JF. A decision support system for the level of repair analysis problem. Proceedings of the 41st International Conference on Computers & Industrial Engineering. October 23–25; Los Angeles, CA; 2011. p 666–671.
  12. 12. Jardine AKS, Banjevic D, Makis V. Optimal replacement policy and the structure of software for condition-based maintenance. J Qual Maint Eng 1997;3 (2):109–119.
  13. 13. Lu CJ, Meeker WQ. Using degradation measures to estimate a time-to-failure distribution. Technometrics 1993;35 (2):161–174.
  14. 14. Lu S, Tu YC, Lu H. Predictive condition–based maintenance for continuously deteriorating systems. Qual Reliab Eng Int 2007;23 (1):71–81.
  15. 15. Meeker WQ, Escobar LA. Statistical Methods for Reliability Data . New York: John Wiley & Sons, Inc; 1998.
  16. 16. Okogbaa OG, Otieno W. Design for maintainability. In: Kutz M, editor. Environmentally Conscious Mechanical Design . New York: John Wiley & Sons, Inc.; 2007. p 185–248.
  17. 17. Rausand M. Reliability centered maintenance. Reliab Eng Syst Saf 1998;60:121–132.
  18. 18. Rigby LV, Cooper JI, Spickard WA. Guide to integrated system design for maintainability. Defense Technical Information Center document AD-0271477. 1961.
  19. 19. Society of Automotive Engineers. Level of Repair Analysis standard AS-1390. 2014.
  20. 20. Tortorella M. Cutoff Calls and Telephone Equipment Reliability. Bell Syst Tech J 1981;60 (8):1861–1889.
  21. 21. Tortorella M. Numerical solutions of renewal-type integral equations. INFORMS J Comput 2005;17 (1):66–74.
  22. 22. Tortorella M. On cumulative jump random variables. Annal Oper Res 2013;206 (1):485–500.
  23. 23. Tortorella M, Frakes WB. A computer implementation of the separate maintenance model for complex-system reliability. Qual Reliab Eng Int 2006;22 (7):757–770.
  24. 24. US Department of Defense. Military Standardization Document 1390D . Washington, DC: US Department of Defense; 1993.
  25. 25. Williams JH, Davies A, Drake PR, editors. Condition-Based Maintenance and Machine Diagnostics . New York: Springer-Verlag; 1994.

NOTES

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.79.59