What’s the Point?

Whenever you look at any technology, or any feature within a technology, it’s worth asking why you might want to use it—what’s the benefit? If you can’t describe what you hope to achieve, then you can’t decide how best to use the technology and how to measure your degree of success.

There are two arguments put forward for RAC—scalability and high availability, and the scalability argument comes in two flavors: throughput and response time. So, before you start chasing RAC, you need to decide which of these two arguments matters to you, and how best to implement RAC on your site to ensure that you turn the theory into reality.

High Availability

The theory of high availability says that if one of the nodes in your RAC system fails, it will only be moments before one of the other nodes recovers the redo from the failed node, the nodes rebalance, and everything then keeps on running. The failover should be virtually transparent to the front-end (although transparency isn’t something that applies to transactions that are in flight unless you recode the application). There are four obvious questions to ask in response to this argument, as follows:

  • How important is it to restart, say, within 10 seconds? Can we live with an alternative that might be just a little slower?
  • How often does a machine fail compared to a disk or a network? (Assuming that we’re not going to cover every possible single point of failure (SPOF)). How much benefit do we get from RAC?
  • Do we have the human resources to deal with the added complexity of the RAC software stack and the level of skill that is still needed to handle patches and upgrades?
  • How many nodes do we need to run before the workload for N nodes can run on N-1 nodes without having a performance impact that would make us uncomfortable?

Alternative technologies include things like basic cluster failover as supplied by the operating system vendor, or a simple physical standby database if we want to stick to pure Oracle technology—options where another machine can be available to take on the role of the database server. The drawbacks are that the time to failover will be longer and the other machine is an idle machine, or used for something else but with spare capacity, or it’s going to give you degraded response time while it’s acting as the failover machine. It’s quite easy to set up the database server to fail over fairly quickly—but don’t forget that every little piece of your application has to find the new machine as well.

Looking at SPOFs—how many options are you going to cover? Doubling up network cards is easy, the same goes for switches, but doubling up the storage and the actual network cabling is harder—and what’s the strategy for a power outage in the data centre? You might also look at the complexity of the whole RAC stack and ask how much time you’re going to lose on patching and upgrading. Oracle Corp. is constantly working towards “rolling upgrades”—but it still has a way to go.

Scalability

There are two different concepts that people tend to think of when thinking about scalability. These are as follows:

  • Get the same job done more quickly—improved response time
  • Get more copies of the same job done at the same time—improved throughput

It’s quite helpful to think of the first option in terms of individual big jobs, and the second in terms of a large number of small jobs. If you have a batch process or report that takes 40 minutes to complete, then sharing it across two instances may allow it to complete in 20 minutes, sharing it across four nodes may allow it to complete in 10 minutes. This image probably carries faint echoes of parallel execution—and the association is quite good; if you hope to get a shorter completion time without rewriting the job you’re probably going to have to take advantage of the extra nodes through parallel execution. If parallel execution comes to your aid, the threat is the extra cost of messaging. There are overheads in passing messages between layers of parallel execution slaves, and the overheads are even greater if the slaves are running in different instances. If you want to make big jobs faster (and can’t improve the code), maybe all you need is more memory, or more CPUs or faster CPUs before you move to greater complexity.

If your aim is to allow more jobs to run in the same time—let’s say you’re growing a business and simply have more employees doing the same sort of thing on the system—then adding more instances allows more employees to work concurrently. If you can run 50 employees at a time on one instance, then maybe you can run 100 at a time on two, and 200 at a time on four. You simply add instances as you add employees.

In favor of this strategy is the fact that each instance has its own log writer (lgwr) and set of redo log files—and the rate at which an instance can handle redo generation is the ultimate bottleneck in an Oracle system. On the downside, if you have more processes (spread across more instances) doing the same sort of work, you are more likely to have hot spots in the data. In RAC, a hot spot means more traffic moving between instances—and that’s the specific performance problem you always have to be aware of.

Again you might ask why you don’t simply increase the size of a single machine as your number of users grows. In this case, there’s an obvious response: it’s not that easy to “grow” a machine, especially when compared to buying another “commodity” machine and hanging it off the network. Indeed, one of the marketing points for RAC was that it’s easier to plan for growth—you don’t have to buy a big machine on day one and have it running at very low capacity (but high cost) for a couple of years. You can start cheap and grow the cluster with the user base.

images Note One of the unfortunate side effects of the “start big” strategy that I’ve seen a couple of times is that a big machine with a small user base can have so much spare capacity that it hides the worst performance issues for a long time—until the user base grows large enough to make fixing the performance issues an emergency.

The Grid

There is another facet to using a cluster of commodity machines rather than relying on individual large machines, and that’s where the grid concept came from—and it’s where some of the enhancements in Oracle 11.2 are finally taking it. If you have a large number of small machines in a cluster (not yet running any Oracle instances), you can then set up a number of different databases on your disk farm, and choose how many instances should run on each machine for each application, and redistribute the workload dynamically. Figure 8-2 shows such a configuration with two possible configurations—one for standard week day processing, where every machine runs just a single database instance (ignoring the ASM instances) during the week, but two of the GL machines start up a WEB instance at week end, when the load on the general ledger system is likely to decrease, and the load on the web application is likely to increase.

images

Figure 8-2. Matching number of instances to workload

If you look closely at this picture you’ll notice a little hint that I like to follow the “Sane SAN” strategy (proposed by James Morle) for dealing with multiple databases on a single SAN—try to break the SAN down into areas dedicated to individual databases.

In this scenario you could also imagine starting up an extra GL instance on the HR machine at the month end; and for reasons of high availability you might even have two HR instances running (with the second instance running on one of the GL machines) in Oracle’s special two-instance active/passive mode (see note) in case the single HR instance failed for some reason.

images Note I have mentioned the communications between instances as an important aspect of how RAC works. There is a special case implementation for two-node RAC that eliminates this cost by using one instance to do the work and the other instance to do nothing (in principle) but wait for the first instance to fail. This is the option known as the active/passive setup, and is an interesting strategy if you have a need to run two or more databases on a small cluster with rapid failover.

I have seen a couple of sites that have taken advantage of this type of multi-instance cluster, but I can’t help thinking that you would be hard pushed to find many sites with a realistic requirement to move entire instances in and out of play as workload changes; however, Oracle is trying to make exactly that type of movement easier, and automatic, even in the latest releases. (I have to say that one of my client sites using this type of setup was very happy with what they had achieved, but they had a cluster of eight machines, with a standby cluster of eight machines, and it had taken a few weeks for a visiting consultant to get the whole thing working.)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.165.247