Availability Variables

The primary variables to help you determine what high availability path you should be going down are

  • Uptime requirement— The goal (from 0% to 100%) of what you require from your application for its planned hours of operation. I would imagine this to be above 95% for a typical highly available application.

  • Time to recover— A general indication (from long to short) of the amount of time that can be expended to recover an application and put it back online. This could be in minutes, hours, or just in terms of long, medium, or short amount of time to recover. The more precise the better, though. As an example, a typical time to recover for an OLTP (Online Transaction Processing) application might be 5 minutes. This is fairly short but can be achieved with various techniques.

  • Tolerance of recovery time— Describe what the impact might be (from high to low tolerance) of extended recovery times needed to resynchronize data, restore transactions, and so on. This is mostly tied into the time-to-recover variable, but can vary widely depending on who the end-users of the system are. As an example, internal company users of a self-service HR application may have a high tolerance for downtime (because the application doesn't affect their primary work). However, the same end-user might have a very low tolerance for downtime of the conference room scheduling/meeting system.

  • Data resiliency— A description of how much data you are willing to lose, and whether it needs to be kept intact (have complete data integrity, even in failure). Often described in terms of from low to high data resiliency. Both hardware and software solutions are in play for this variable—mirrored disk, RAID levels, database backup/recovery options, and so on.

  • Application resiliency— An application-oriented description of the behavior you are seeking (from low to high application resiliency). In other words, should your applications (programs) be able to be restarted, switched to other machines without the end-user having to reconnect, and so on? Very often the term application clustering is used to describe applications that have been written and designed to fail-over to another machine without the end-user realizing they have been switched. The .NET default of using “optimistic concurrency” combined with SQL clustering often yields this type of end-user experience very easily.

  • Degree of distributed access/synchronization— For systems that are geographically distributed or partitioned (as are many global applications), it will be critical to understand how distributed and tightly coupled they must be at all times (indicated from low to high degree of distributed access and synchronization required). A low specification of this variable indicates that the application and data are very loosely coupled and can stand on their own for periods of time. Then, they can be resynchronized at a later date.

  • Scheduled maintenance frequency— An indication of the anticipated (or current) rate of scheduled maintenance required for the box, OS, network, application software, and other components in the system stack (from often to never). This may vary greatly. Some applications may undergo upgrades, point releases, or patches very frequently (SAP and Oracle applications come to mind).

  • Performance/scalability— A firm requirement of the overall system performance and scalability needed for this application (from low to high performance need). This variable will drive many of the high availability solutions that you end up with since high performance systems often sacrifice many of the other variables mentioned here (like data resilience).

  • Cost of downtime ($ lost/hr)— Estimate or calculate the dollar (or euro, yen, and so forth) cost for every minute of downtime (from low to high cost). You will usually find that the cost is not a single number (like an average cost per minute). In reality, short downtimes have lower costs, and the costs (losses) grow exponentially for longer downtimes. In addition, I usually like to try to measure the “good will” cost (or loss) for B2C type of applications. So, this variable might have a subvariable for you to specify.

  • Cost to build and maintain the high availability solution ($)— This last variable may not be known initially. However, as you near the design and implementation of a high availability system, the costs come barreling in rapidly and often trump certain decisions (such as throwing out that RAID 10 idea due to the excessive cost of a large number of mirrored disks). This variable will also be used in the cost justification of the high availability solution. So, it must be specified or estimated as early as possible.

As you can see in Figure 1.8, you can think of each of these variables as an oil gauge or temperature gauge. In your early depiction of your high availability requirements, simply place an arrow along the gauge of each variable estimating the approximate “temperature” or level of a particular variable. As you can see, I have specified all of the variables of a system that will fall directly into being highly available. This one is fairly tame, as highly available systems go, because there is a high tolerance for recovery time and application resilience is moderately low. Later in this chapter, four business scenarios will be fully described including a full specification of these primary variables. In addition, starting in Chapter 3, “Choosing High Availability,” a return on investment (ROI) calculation will be included that will provide the full cost justification of a particular HA solution.

Figure 1.8. Primary variables for understanding availability needs.


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.236.231