High Availability Architecting Differences

One of the biggest misconceptions regarding failover software, also sometimes referred to as clustering or high availability, is that it guarantees availability. Failover software cannot eliminate all outages or problems, but it can provide additional availability when it comes to hardware failures.

There are two main aspects to architecting a high availability solution:

MTBF—Mean Time Between Failures— How much time elapses on average between each failure.

MTTR—Mean Time To Repair (or Recover)— How quickly the system comes back up and is available for users after a failure occurs.

Unfortunately, many people put too much emphasis on the MTBF figure at the expense of the MTTR figure. Take for example two scenarios where the MTBF is three months, meaning the system is likely to experience a failure four times each year. In the first scenario the MTTR is one hour, while in the second scenario the MTTR is eight hours. In scenario one, the total downtime for the year is four hours while the second scenario results in 32 hours of downtime. Reducing the MTBF to 12 months still results in eight hours of downtime per year, which is more than in the first scenario.The key to availability is addressing both MTBF and MTTR, and sometimes focusing more in reality on MTTR than MTBF.

What does failover software guard against? The failover software addresses server hardware-related issues such as failed network adapters, CPUs, and so on. Some failover software goes further and protects against hung or non-responsive software processes (for example, the LDAP daemon) by querying the process every now and then. Should it be non-responsive, the failover software then restarts the daemon. If this fails a set number of times, a complete failover is then triggered. Both VERITAS and Sun Cluster software perform the process restart attempt.

What does failover software not protect against? Failover software does not protect against everything, and specifically does not address:

  • Operator error (for example, rm -rf *)

  • Software problems (for example, bugs)

  • Storage failures (for example, drive failures or controller failures)

You can put a cluster in place and actually have more downtime due to operator error if you do not adequately provide for system administrator training on the clustering software. You can have downtime due to defective software. You can have a cluster that will not failover because the storage system, which is a shared resource, fails catastrophically or because it was not protected (for example, not on a UPS like the server was—yes, this has happened to customers). Failures can still occur. Even after addressing issues such as operator errors through training and formal procedures, software problems by an internal testing process and storage failures by having protected the storage (for example, RAID 5e and UPS, and so on).

What then? It is really a matter of planning to fail, that is, how you will handle a failure, even with a clustered environment. As the old commercials for American Express Traveler's Checks said, “What will you do, what will you do?” By closely examining the restoration process for your messaging environment, you can develop specific steps that will result in the fastest restoration of service time. They may include everything from the basics of re-indexing the mail contents to a complete restore, including the Solaris OE.

Questions that must be asked in your environment are:

  • What is the procedure for doing this?

  • How can it be improved?

Each environment is slightly different, but there are some basic techniques such as JumpStart and Flash Archive usage for rapid restoration of the operating system and software, as well as period snapshots of the database and directory, to complete mail content backups. For more details, refer to Chapter 15, “Managing Messaging Services and Preventive Maintenance,” on page 209.”

High Availability Architectures

The iPlanet Messaging Server Installation Guide for UNIX outlines several HA architectures and discusses a few of the pros and cons of each. The installation guide can be found at:

http://docs.sun.com/source/816-6014-10/.

This manual lists the following HA architectures:

  • Symmetric (hot standby)

  • Symmetric (Active-Active)

  • N + 1 (N servers + one standby)

However, there are other HA architectures to be considered once you understand all of the parts of the architecture and how (or whether) HA affects them. As discussed earlier in the book, sometimes there are advantages to keeping things simple.

The Parts
  • Directory— can be protected by using failover or multiple master replication

  • Mail Store— stateful and requires failover

  • MEM— stateless, requires multiple physical servers, no failover agent available

  • MMP— stateless, requires multiple physical servers, no failover agent available

  • MTA— stateless, can be made available by either failover or multiple physical servers. MTA is considered stateless because, due to the nature of store-and-forward, there is nothing stateful in memory, it is all written to disk—so appropriate storage protection (for example, RAID 5e or RAID 0+1) is a good idea.

Directory

With the advent of Multiple Master replication technology in the Directory Server 5.1 and higher, customers have the option of making their directory server highly available. They can use the tried-and-true method of using Sun Cluster or VERITAS Cluster software. Or, they can use Multiple Master replication that is now built into the Sun ONE Directory Server.

Mailstore

The mailstore provides the basic storage of messages as well as the native HTTP, IMAP, and POP services. Due to the stateful storage of the header information in a database, it becomes necessary to use failover software such as Sun Cluster or VERITAS to obtain high availability.

MEM and MMP

The MEM and MMP function as proxy servers. So long as the configurations and files are the same on all systems, you can have as multiple servers performing the same function. This does require a network- based load balancer such as Resonate, Cisco Load Director or F5, or Alteon to work.

MTA

Since messaging by nature is a store-and-forward architecture, it allows for some flexibility regarding availability. That is, should an MTA be unavailable, other parties will hold their messages for some period of time, periodically retrying. Typically most environments can easily configure multiple MTAs and appropriate DNS entries to provide for redundancy at the MTA level. The failover time, however, is not instantaneous, so many organizations also provide a virtual IP and network-level failover as you would for MEM or MMP. So while the MTA has some information, it can generally start up and continue where it left off without many issues—forwarding the mail it has in the queue, albeit somewhat delayed.

Other Architectures

When you consider what items within the messaging architecture require failover or can take advantage of failover, plus any addition services (such as the Calendar Server) that are often integrated into such an environment, the possible number of architectures increases.

Alternative No. 1

In any environment, having to provide a server for a hot standby architecture is wasteful use of computing resources. The alternative configuration (FIGURE 14-1) that some customers have implemented has the Sun ONE Messaging Server environment (mailstore and MTA) running on one system and the main LDAP server running on the other node. This configuration provides for high availability of both messaging and directory, while allowing independent failover of each, plus utilizes both nodes. Additional directory replicas, called consumers, can be configured to replicate from the main LDAP server.

Figure 14-1. High Availability Configuration Failover


Alternative No. 2

Customers often implement the Sun ONE Calendar Server in addition to the Sun ONE Messaging Server, since they can be purchased in a money-saving package called the Web Communication Bundle. This alternative configuration (FIGURE 14-2) provides for a highly available calendar system in addition to the messaging system. As in Alternative No. 1, the directory server is made highly available on the second server, but now the calendar server is added to the system. This configuration provides for high availability for messaging, calendar, and directory while allowing independent failover of each, plus it utilizes both nodes. As in Alternative No. 1, additional directory replicas can be configured off the main LDAP server.

Figure 14-2. Failover Using Both Nodes in a High Availability Configuration


Differences in Planning for High Availability Messaging

Planning for use of failover software involves obtaining and managing additional IP addresses and hostnames.

Differences in Installing HA Messaging

The obvious difference is that you have to install, configure, and manage the failover software such as Sun Cluster or VERITAS. Beyond that, the largest differences in installing messaging on a clustered system involving dealing with the logical host. Always use the fully qualified logical host name and IP address. Do not use the physical host. You must also use the logical storage devices. The differences include references to things such as the LDAP server when installing messaging. Do not refer to the physical host of the LDAP server, but rather to the logical host. There are also some edits to configuration files that must be performed as you will see.

Best Practices and Caveats

Caveat—While everything works well with Sun Cluster for failover on an ACTIVE-ACTIVE cluster configuration, there is one slight issue. Specifically, the Simple Network Management Protocol (SNMP) monitoring daemon is not able to understand that you now have two message servers running on the same physical host, and it goes away so you no longer have a monitoring daemon.

Installation Procedure and Notes

For complete details, see Chapter 4, “High Availability” in the iPlanet Messaging Server Installation Guide for UNIX located at:

http://docs.sun.com/source/816-6014-10/ha.htm#11284.

This section came about from a situation where one of our customers was having significant difficulty getting the Sun ONE Messaging Server installed with Sun Cluster 3.0 software and the EMC storage units. The customer we were doing this lab work for spent about four weeks dealing with hardware installation issues that were related to their EMC storage system. So do not underestimate the time it takes to install and physically configure the hardware.

Note

When installing the Solaris OE you must select the Entire Distribution, and you really should select NONE for naming service and manually configure DNS. For example, edit the /etc/nsswitch.conf file and configure the /etc/resolv.conf file.


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.164.39