4.1. Introduction

4.1.1. Overview

With a plethora of natural and man-made disasters that could affect telecommunication links and nodes, protection and restoration of telecommunication services is necessary not only to satisfy regulatory requirements and/or service level agreements (SLAs), but also to provide service differentiation. Before utilizing any particular form of protection or restoration, the time frame for the corrective action and the implications of failing to restore service within a particular time frame should be considered. In [Sosnosky94], Sosnosky reviewed the impact on public switched telephone network (PSTN) customers of progressively longer restoration times. This work is summarized in Figure 4-1.

Figure 4-1. Restoration Impact on PSTN Customers circa 1994 [Sosnosky94]


Although this work reflects services common at the time of its publication, the service mix in today's networks is dramatically different. With the increase in volume and importance of IP traffic, a new set of threshold targets have to be considered. These include application timeouts based on lack of connectivity at the TCP layer and link failures and route removals detected by IP routing protocols. More important, the loss of revenue and other impacts to businesses could determine the tolerance to disruption in services.

4.1.2. Protection and Restoration Techniques and Trade-offs

The general idea behind protection and restoration techniques is to utilize redundant resources as backup when failures affect “primary” resources in use. The term protection is used to denote the paradigm whereby a dedicated backup resource is deployed to recover the services provided by a failed resource. In contrast, the term restoration is used to denote the paradigm whereby backup resources are not dedicated to specific primary resources, but a pool of resources is kept available for overall recovery purposes. Failure of a primary resource results in the allocation of a backup resource dynamically. This may result in significantly longer recovery times as compared with the protection case, but with possible improvements in network utilization. As we shall see in this chapter, there is in fact a continuum of techniques for recovery with specific advantages and disadvantages. Thus, the terms protection and restoration are used rather loosely in this chapter.

There are three fundamental qualities of any protection/restoration scheme: robustness, bandwidth efficiency, and recovery time. In addition, there are a number of practical aspects such as complexity, inter-vendor interoperability, and interoperability with control plane automation techniques, which have to be considered when evaluating protection and restoration schemes.

4.1.2.1. RELIABILITY/ROBUSTNESS

The main factors affecting the reliability and robustness of a restoration method are

  • Failure types that are restorable

  • Implementation complexity

  • Distributed vs. centralized restoration

The types of failures that are recoverable range from single line failures to arbitrary network failures. The trade-off in considering schemes that handle limited failure scenarios and those that deal with more general failure is between flexibility and the increase in recovery time or in the variance of the recovery time (more general restoration methods will have larger variances in the time to recover).

The implementation complexity of a restoration scheme affects not only its ability to correctly perform its task when invoked, but also its ability to do no harm when not invoked. This condition is typically known as imperfect fault coverage or imperfect fail-over. If the probability of missing a fault or failing to restore is reasonably small (it does not have to be nearly as small as the failure probability for the connection or line), then the statistics for mean time to failure (for a connection) are still very good [Trivedi82]. If, on the other hand, the implementation complexity leads one astray from the protection maxim “do no harm,” then the probability of this occurring directly impacts the overall connection reliability.

4.1.2.2. BANDWIDTH EFFICIENCY

The main factors affecting the bandwidth efficiency of a restoration method are

  • Network structure

  • Bandwidth granularity

  • Distributed vs. centralized restoration

  • Impact on existing connections

The restoration techniques and concepts are applicable to different layers in the communications hierarchy, for example, the optical layer, the electrical TDM hierarchy, or in virtual circuit packet switched technologies such as MPLS, Frame Relay, and ATM. In general, the coarser the granularity of restoration relative to the average span capacity, the fewer the circuits to be restored (repaired). There is, however, a trade-off between restoration granularity and bandwidth efficiency since fewer routing options may be available with coarser granularity, that is, it is more difficult to pack bigger circuits tightly than smaller ones in a given network.

It is clear that as the control of restoration is centralized, the vulnerability of restoration control increases and its robustness decreases. On the other hand, it is also true that distributed restoration cannot globally optimize the use of bandwidth over an entire network the way a centralized mechanism can. For instance, the global network optimization problem can be cast as a multicommodity flow problem, which is amenable to solution via linear programming techniques (see Chapter 11). Such techniques can be applied to those connections affected by a failure in the case of centralized control. Between distributed and centralized control, there is a spectrum of schemes that involve varying degrees of coordination.

Related to the centralized vs. distributed issue is whether one allows the disruption of some services, in order to restore some or all of the services affected by a failure. In other words, to restore the connections affected by the outage, other connections may need to be rerouted or preempted. In the case of centralized control, a global rerouting of connections in the network may be considered. In the case of distributed control, preemption mechanisms may be employed. Both approaches might result in a network running close to its capacity, but negatively impact a significant number of other connections not directly affected by the failure event. This also violates the “do no harm” principle.

4.1.2.3. RECOVERY TIME (SPEED AND DETERMINISM)

The main factors affecting the recovery time, speed and determinism are

  • Network size and geographic extent

  • Local versus end-to-end restoration

  • Method used to find alternate routes

  • Bandwidth reservation versus coordination time

There is nothing that a restoration algorithm can do to increase the speed of light. Hence, the geographic extent of the network sets a fundamental limit on how fast a restoration mechanism can react. As an example, the 4-fiber Bidirectional Line-Switched Ring (4F-BLSR) specification [ITU-T95a] gives a recovery time performance, which scales with the physical dimensions of the ring. Another aspect that can affect the performance of a restoration algorithm is the size of the network in terms of the number of nodes and links.

The manner in which alternate paths are determined reflects a trade-off between robustness and recovery speed. Preconfiguration of alternate paths results in less path computation and set-up latency after a failure event. Preconfigured paths, however, can only deal with a limited set of failure scenarios. Hence, preconfiguration is not as robust as computing alternate paths based on the current network information available after a failure event.

A trade-off exists between the amount of bandwidth reserved for restoration (and hence unavailable for other use) and the amount of time spent in adjudicating between connections that compete for that bandwidth. Bandwidth reservations are effective only in a limited number of failure scenarios. Additional coordination mechanisms can expand the failure scenarios that may be handled, but these increase the recovery time. Typical mechanisms used for coordination include priorities (which connection gets access to a bandwidth resource), preemption (can one connection disrupt another and under what circumstances), and crank-back (the ability to start the recovery procedure over and look for an alternate path). We will see examples of these in the protection mechanisms discussed in this chapter.

4.1.2.4. INTEROPERABILITY

There are three interoperability issues:

  • Interoperability with other protection mechanisms

  • Interoperability with automated provisioning, and

  • Multivendor interoperability

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.176.88