High-Availability Introduction

Keeping the SAP system available refers to the fraction of time the SAP application can be used for its intended purpose: performing business transactions. When the system is not available, this is considered downtime. Many factors impact a system's availability, against which preventative measures can be taken.

There are several levels of components or functions that make up a complete mySAP.com environment, as listed in Table 5-1. From a user's point of view, the application is the availability of a service level. From an IT perspective, the availability is the sum of all the functions provided in the different levels of the entire SAP application environment.

Table 5-1. Hierarchy of the SAP Application Environment
Availability Level Functions or Components
Applications mySAP.com Middleware (Internet Web Servers, ITS, Workplace)
mySAP.com Business Applications (SAP R/3, BW, APO, CRM, etc.)
Data Business Data and the Database Management System
OS Operating System and Clustering or Failover Software
Hardware and Servers, Storage, and Local Networks, etc. Infrastructure
Environment Power, Temperature, Fire Protection, Access Security Control, WANs, Geographic Region, etc.

Downtime

Downtime of a mission-critical SAP system can be costly and should be avoided as much as possible. How much is invested in the SAP system's availability should be proportional to the costs of downtime. These investments should consider the hardware and system availability as well as the application and the data availability. Many factors can cause a system to be unavailable. These are generally divided into two categories: planned and unplanned downtime.

Planned Downtime

Planned downtime is typically scheduled in advance during periods of lower business processing. Planned downtime can be caused by factors such as those listed in Table 5-2.

Table 5-2. Planned Downtime for SAP Systems
Availability Level Typical Causes of Planned Downtime
Applications Configuration changes, installation of hot packages or legal changes, SAP R/3-put or upgrade, kernel patches, profile management, etc.
Databases Configuration changes, installation of patches, DB version upgrade, reorganization, offline backups, etc.
Operating System OS and failover software configuration changes and upgrades, installation of patches, failover and recovery tests, etc.
Hardware and Infrastructure Upgrades of CPU, memory, critical disks, I/O cards, etc. Firmware or BIOS upgrades, or other HW extensions
Environment Maintenance, construction work, functional tests, etc.

In the Internet age, the planned downtime windows are shrinking, posing a challenge for maintaining and upgrading the SAP system and underlying infrastructure.

HA Strategy for Planned Downtime

To minimize the downtime of changes to the software (application, database, OS, and failover software) use a well-structured and documented con figuration. Plan ahead for patch installations and kernel upgrades, and implement a Change Management Scheme. Also, always size and configure the system to have enough performance headroom. Make tests of the upgrades before performing them on the production system. These are a few important elements of the IT processes implemented for reaching higher levels of availability.

For an SAP system, patches and kernel upgrades can be applied on a regular basis as preventative measures, whenever approved by SAP, but should only be done if actually needed to minimize unnecessary changes. Release upgrades (such as from R/3 4.0B to 4.6C) happen less often and depend on the business requirements.

Specifically for the database, consider implementing a Zero Downtime Backup solution to keep the online hours at their maximum. With a failover configuration, rolling upgrades can also be implemented for OS and failover software changes. Hewlett-Packard is developing solutions to help reduce the planned downtime even for the application SAP R/3, introduced later in this chapter (RACS and HP Somersault rolling kernel upgrades).

Hardware upgrades or expansion may be needed depending on the performance level required by the existing hardware. By oversizing the system to begin with, this can be kept to a minimum. For hardware and environmental changes, planning, scheduling, and communicating the activity is sometimes the only way to minimize the impact of the downtime.

Unplanned Downtime

Unplanned downtime is much more critical because it may happen during business-critical hours, which has a more direct impact on the continuity of business operations. Unplanned downtime can be caused by many factors, some of which are listed in Table 5-3.

Table 5-3. Unplanned Downtime for SAP Systems
Availability Level Typical Causes of Unplanned Downtime
Applications Configuration limits exceeded, software and configuration problems, performance degradation.
Databases Configuration limits exceeded, software and configuration problems, performance degradation, database corruption, data loss or inadvertent change of data due to user or administrator actions (drop table).
Operating System OS limits exceeded, panics, driver or patch problems, and other system failures due to software configuration. User-caused problems, including planned configuration changes not properly applied.
Hardware and Infrastructure Server hardware, including failures of CPU, memory, disks or disk systems, network cards, fans, power supplies, or other critical system board components. Local network infrastructure failures of cables, hubs, switches, etc. Configuration and user-caused problems.
Environment Malfunction or outage of power or temperature controls, WAN service loss, etc. Disaster due to fire, floods, earthquakes, theft, etc.

The frequency of hardware failures is related to the mean-time-between-failure (MTBF) ratings of the individual components as well as to the complete system (sum of components). How frequently software failures occur is more difficult to estimate, but a typical approach is to wait for stable releases, versions, or patch levels (service packs) before using a particular software product for production. Environmental failures may happen more frequently in unstable or outlying regions but can happen to any organization.

HA Strategy for Unplanned Downtime

A strategy to minimize the unplanned downtime caused by software (application, database, OS, and failover software) problems includes implementing important IT processes and tasks. These include making careful analysis of the needed parameters; proactively analyzing known problems, perhaps through SAP's EarlyWatch service; installing bug-fixes and approved patches; as well as monitoring the conditions of SAP application and the database, the log files and traces, among others.

Having a well-trained IT support staff that communicates and practices the recovery procedures among the critical members, together with good support partnerships, also help contribute to faster recovery in cases of unplanned downtime. The support partnerships ca n be service level agreements for performance and time-to-repair agreements for hardware failures, for example.

Making regular copies of the system, data, and log files to a tape, disk, recovery system, or equivalent, is a critical measure that should be taken to protect the database against unplanned downtime, regardless of which other technical solutions are employed. In addition, it is possible to use shadow or remote database mirror copies to protect against data inconsistency or data loss due to user errors and other disasters. These technical solutions are discussed in more detail in this chapter.

An effective way to prevent downtime due to hardware component failures is to make these components redundant and repair them online. For disks, it is common to use RAID 1 or 5 along with hot swap functionality. The disk and network I/O adapters can also be duplicated with redundant paths and also support hot swap functionality. Unfortunately, CPUs and memory may be deallocated but are not commonly redundant or repairable online and neither are certain elements of a server's system board, so some hardware single points of failure (SPOFs) remain.

Failover solutions can be used to protect against the hardware and software SPOFs found in an SAP system. A low cost method is to use a manual failover configuration. The more automated solution employs a cluster configuration. In addition, process mirroring of the SAP application SPOF can be used in combination with clustering for higher levels of SAP application availability.

Having a well-managed data center environment also helps keep the system up and running. This can include having adequate and redundant power and temperature controls, redundant and stable network services, raised floors for cable routing, sprinklers and firewalls, high levels of security and access control, and many other environmental related items. Most hardware system vendors provide site preparation services to help in this area.

Costs of Downtime: How Much HA Is Needed?

A good question to ask is “What is the financial and business exposure if the SAP environment becomes unavailable—how many orders cannot be honored, how many customers will be dissatisfied, what is the workforce productivity loss, et cetera?” This important concept is known as the cost of downtime.

The costs of downtime of the SAP system within an organization vary depending on the applications or modules deployed. This can be determined by performing an impact analysis of the critical business processes.

  • Some organizations can survive if the SAP system is down a few days because other activities can be performed. In addition, if data is lost it can be recovered manually without an impact to the business. Conventional levels of availability apply here.

  • More commonly, however, the organization's operational efficiency is impacted, and many individuals are idle or unproductive. Manual methods must be used to process transactions, which may require additional resources. For these organizations, the business functions may be briefly interrupted, but the integrity of the data must be ensured, requiring highly reliable systems.

  • Other organizations can only tolerate minimal interruptions of their business processes during critical timeframes. If such failures occur often, the result may be customer or vendor dissatisfaction, loss of future sales or market share, and employee dissatisfaction. Typical high-availability solutions apply here.

  • In a few select businesses, no interruptions during critical timeframes can be tolerated, although restarting transactions and reduced levels of performance are acceptable. This requires achieving even higher levels of availability, considered fault-resilience.

  • In even fewer businesses, continuous operations are demanded at all times, with no loss in computing performance. For these organizations, even minutes of downtime can result in serious financial loss, and a day of downtime threatens the existence of the company, or at minimum, dramatically crashes its stock price. This represents fault-tolerant or nonstop levels of availability.

Once the cost of SAP system and business process downtime is known for an organization, then the appropriate investments in high-availability and disaster recovery solutions can be made.

Specifying Uptime

Vendors of server and storage systems tend to measure the availability of their solutions in terms of a percentage of available uptime. The goal is to have the system be available 100% of the time needed.

The time a system needs to be available could be 24 hours per day, 7 days per week, 52 weeks per year, or less, depending on the business requirements. The typical SAP R/3 customer has a relatively normal availability requirement during three-quarters of the month, but very high availability requirements for the month-end processing. Therefore, this could be expressed as a 24×7-availability requirement during one week each month and as a 20×5 requirement for the rest of the time. With the introduction of e-commerce solutions with mySAP.com, the 24×7 requirement is becoming more common, even among firms who previously didn't have this requirement for their ERP application.

The percentage of time a system is available is usually expressed in percentages of the continuous uptime per year goal. For example:

  • 99% uptime = 87.3 hours of downtime per year

  • 99.5% uptime = 43.7 hours of downtime per year

  • 99.9% uptime = 8.7 hours of downtime per year

  • 99.95% uptime = 4.4 hours of downtime per year

  • 99.99% uptime = 52 minutes of downtime per year

  • 99.999% uptime = 5 minutes of downtime per year

  • 99.9999% uptime = 30 seconds of downtime per year

  • 100% uptime = fault tolerance

When the uptime of a system is quoted by a hardware vendor, usually only the base levels of a system are considered: the hardware plus the operating system and failover software (not including downtime due to configuration change errors). The uptime or availability of the database and application software components is much more dependent on IT processes and application usage, and thus needs to be treated differently.

TIP

System Availability Ratings

Depending on the HA solution offered, system uptime may only refer to the server, storage, network, OS, and clustering components, not to the entire application stack.


A single server with standard storage can typically achieve 99% uptime per year for the hardware. Most hardware solutions based on clustering are able to reach 99.5% to 99.9% uptime levels. To have the entire system, including the database and SAP application, reach these or higher levels requires more than typical hardware solutions can provide.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.88.249