Fault-Tolerant WAN Concepts

Because of the mission-critical character of SAP services for the business, high-availability considerations and failure-tolerant architectures for a network to provide maximum uptime is even more crucial when wide area network solutions for SAP are implemented. For the users in a remote office, plant, subsidiary, or affiliate, the wide area network connection is truly a lifeline. When this line is cut no work can be done at all. When SAP services are outsourced to an external data center, this is true for the whole enterprise. However, this lifeline is seldom under your total control. In most cases, long distance connections, crossing the borders of the company's property, have to be leased from telecommunication providers. These network providers, called carriers, achieve high availability via their highly meshed network core. After decades of upgrades and fine-tuning, the carriers' public switched telephone network (PSTN) has achieved an extremely high degree of network availability. In spite of that, WAN links are typically the least reliable components in a network infrastructure, usually because of problems in the local loop.

The local loop is the connection between the property of the customer and point of presence (POP) of the telecommunication carriers where the access switches to the core network are installed. For many reasons, this local loop is called the dirty last mile. Often this is the only connection to the carrier and therefore a single point of failure (SPOF) (see Figure 11-12). In addition to being relatively unreliable, these links are an order of magnitude slower than the LANs they connect. The combination of potentially suspect reliability, lack of speed, and high importance makes the local loop a good candidate for redundancy.

Figure 11-12. Local Loop as a SPOF in a Typical WAN


Numerous methods exist to enhance WAN availability, including load balancing and link redundancy, where separate physical connections are used for the first hop from the enterprise property to the carrier. Managing data traffic flow between multiple paths is what routers were invented for. From a router's standpoint, any media failure can be bypassed as long as alternative paths are available. Using various detection techniques, routers sense media-level problems. If routing updates or routing keep-alive messages have not been received from devices that would normally be reached through a particular router port, the router will soon declare that route down and will look for alternative routes. The routing software recalculates the routing algorithm and begins sending all traffic through another link. This allows applications to proceed in the event of WAN link failure, improving application availability. If the routers support load balancing, link bandwidth can also be dynamically increased, lowering response times for users and increasing application availability.

The time a network needs to detect a link failure and reroute traffic is called convergence time. The convergence time is determined by size and topology of the network as well as the routing protocol and parameters implemented. Routing protocols have default update intervals and hold-down values. These parameters determine the time that elapses before a route is considered down and the traffic is diverted via an alternative path. Convergence time has a significant effect on session timeouts for various applications. Although SAP applications will tolerate timeouts, users will still experience noticeable delays during rerouting conditions. Most routing protocols allow adjustment of these parameters from default values. Tuning hello intervals from 30 to 5 seconds and dead-interval timers from 90 to 15 seconds will accelerate the convergence time from 90 seconds to 15 seconds in order to bypass a route that has been identified as unreachable.

Hardware Redundancy

Like all complex devices, routers are exposed to hardware problems. Therefore, many networks are designed with multiple routers to provide redundancy. If each network connected to the failed device has an alternative path out of the local area, complete connectivity will still be possible. The effectiveness of this design is limited by the speed at which the hosts on those LANs detect a topology update and change routers. In most cases, however, IP hosts are configured with a default gateway or are configured to use Proxy ARP in order to find a router on their LAN. Convincing an IP host to change its router usually requires manual intervention to clear the ARP cache or to change the default gateway. The Hot Standby Router Protocol (HSRP) is a solution that allows network topology changes to be transparent to the host. HSRP typically allows hosts to reroute in approximately 10 seconds. With this technology, network resiliency is transparent to the hosts. The hosts do not need to perform any active recovery.

Power loss is a common fault in large-scale networks. From the standpoint of internetworking devices, dual power supplies can prevent otherwise debilitating failures. However, power instabilities due to lightning striking a public utility line may cause the routers to reboot, causing network outages. To secure the availability of the WAN infrastructure, power stabilization by uninterruptable power supplies (UPS) is highly recommended. In general, power outages are usually more common than failures in a router's power system.

Connection Redundancy

Link redundancy can be implemented in a permanent as well as on-demand approach. In cases where different carriers offer connectivity within an area, bandwidth demand can be shared between multiple permanent leased lines as well as Frame Relay links using load balancing technologies. This approach provides full end-to-end redundancy when the infrastructures of both providers use separate paths and devices all the way from end to end.

The primary disadvantage of duplicating WAN links is cost. In addition, independent carriers are seldom available outside of the United States, at least for the local loop. An essential high-availability feature addressing costs is dial-up backup. A router detecting a lost carrier signal from the primary line device can initiate an immediate dial-up of a backup connection. Typically, a primary link might be a Frame Relay connection; the dial backup might be ISDN. Dialed backup links provide cost-effective protection against WAN downtime by allowing a permanent line connection to be backed up via a circuit switched connection. To configure dial backup, connections to the public telephone networks on both sides as well as additional router interfaces (ISDN or asynchronous interface) are necessary (see Figure 11-13). Once configured, the dial backup software keeps the secondary line inactive until the primary line goes down. In case of a leased line failure, the router will automatically activate the backup link. The switchover is automatic and transparent to the application. Users wouldn't even know it is happening. In addition, the link can also be used for bandwidth on demand when the transmitted traffic load on a primary line exceeds a defined limit. Be sure that the automatic restore of the primary link is configured when the primary link gets fixed.

Figure 11-13. Dial Backup Using the Public Telephone Network


“The El Niño storms damaged a large, heavily traveled bridge over a river in central California. Fiber from at least two carriers was contained in a single conduit that runs through the bridge. When the river crested, both carriers' connectivity was lost. Some customers, believing they had purchased multiple services that ran on separate fiber paths, were surprised to discover that these services were actually run across the same fiber.” (Network Magazine, August 1998)


Failures of the local loop due to excavation, natural disaster, and so forth often also affect the backup lines. Outside the United States it is common that only one carrier is available for the local loop—the former government telecommunication agency still owns the local cabling. To make the situation even worse, the local access switch at the telecommunication provider sites is a SPOF as well. Being an elaborate computer system, these switch nodes must be updated and rebooted from time to time, sometimes without notice. In worst case, the node is destroyed by fire, thunderstorm, or other disaster. Therefore, high-availability concepts for wide area networks should take the challenge of resolving this SPOF into account.

One way to resolve the local loop as a single point of failure is to connect your property to a second POP of the carrier (see Figure 11-14). This way, redundancy is achieved for the local loop as well as for the POP. However, this can be a very expensive solution when this POP is in another town, and cables have to be laid out for that. As presented in Chapter 10, microwave links can be deployed across distances up to 20 km when there is a free line of sight.

Figure 11-14. Dial Backup Using Redundant Access Paths to the Public Telephone Network


In some countries, mobile telephone providers offer such microwave links as an alternative to the cabling of the local carrier. They simply use their antenna infrastructure already linked by fiber or microwave links to step into the data transmission market.

TIP

Control Status of Dial Backup Links on a Regular Basis

In cases when the dial backup link has the same bandwidth as the primary link, switchover to backup is so transparent that not even the operators are aware of it. In some countries, the local carriers do not monitor the status of leased lines and fix failures only on request. Consequently the dial backup lines can be up for weeks until an extraordinarily high telephone bill causes action. Using a network management system to monitor WAN links is therefore highly recommended.


To eliminate any single point of failure with terrestrial connections, VSAT links can be deployed for backup. This way the backup path is of a different nature to the primary link. A satellite dish with a diameter up to 1.2 meters can be installed easily on a rooftop with a direct connection to the backup router (see Figure 11-15). Therefore, no extra cabling outside the property is necessary. Satellites are nearly independent of terrestrial influences with the exceptions of hurricanes or ice rain hurting the satellite dish. Two times a year there is interference when the sun is exactly in a position behind the satellite's line of sight. These dates are well tabulated, however. Bi-directional VSAT connections are available worldwide for a base rate of some hundred dollars a month per satellite dish. As mentioned before, the drawback of satellite links is long latency times. Therefore, a switchover to the backup line is not fully transparent to the users. In case the terrestrial line is cut, bad response times are better than no response at all.

Figure 11-15. Satellite Backup


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.140.108