Chapter 2

Tuning the Environment

IN THIS CHAPTER

Bullet Seeing how to survive system failures

Bullet Tuning your recovery and operating systems

Bullet Getting the most from your hardware

Bullet Introducing new hardware

Bullet Understanding multiprocessor environments

Computer systems are subject to all kinds of failures: power failures, hardware failures, operating system failures, application program failures, and even database management system (DBMS) failures. These things happen; you can count on it. The important question you should ask yourself is “What happens when a failure occurs somewhere in my environment?” In critical applications, you don’t want to lose any data, and you don’t want to lose any more time than absolutely necessary.

Failures aside, you want your system to run as efficiently as possible. Inefficiencies can hamper productivity in many ways, sometimes slowing it to a crawl. In dealing with both failures and inefficiencies, the key to optimizing performance is tuning each part of your system. Your recovery system is designed to handle failures gracefully, but it must be tuned to operate at top efficiency. Your operating system and storage subsystem need to be tuned for the work that you’re doing. Settings that would be appropriate for one job mix might be terrible for another. If you have multiple computers connected on a network, the way that they’re connected can have a tremendous impact on performance. All these considerations are important, and they tend to interact, which makes optimizing them even more challenging.

In this chapter, I talk about some of the things you can do to ensure that you’re getting the best performance out of your system that you can reasonably expect, given the budget and staff time that you can devote to maintaining an acceptable level of performance.

Remember The performance of a database system can vary dramatically — anywhere from lightning-fast to molasses-slow. Where your system sits on that spectrum is largely up to the decisions you make as a developer, the decisions the database administrator (DBA) makes, and the decisions the system administrator makes about how the operating system will run and what hardware it will run on. All these levels affect the ultimate performance of a database application.

Surviving Failures with Minimum Data Loss

Sooner or later, failures occur. That being the case, planning for those failures can make a big difference. Your plans should be aimed at minimizing the disruption that would be caused by downtime and loss of data. At any given time, the current copy of your database resides in two places: in stable storage in the form of hard disk drives and in a volatile solid-state memory called the page buffer. The page buffer contains the working set — the data currently being actively read and written by transactions that have been initiated by the executing program. The pages in the working set are destined to be written to disk, but that hasn’t happened yet. If a failure occurs before a transaction commits or aborts, the recovery system must be able to return the database to a consistent state: either the state the database would have been in if the transaction had committed successfully or (in the case of an abort) the state the database was in before the transaction started.

What happens to transactions when no failure occurs?

In the absence of a failure, a transaction can end up in one of two ways: It will either commit or abort. When a transaction commits, it has completed successfully. This does not mean that the changes made by the transaction have been written to stable storage; it means that the changes are recoverable in the event of a failure, thanks to the logging feature (which I discuss later in this section). After the recovery operation is performed, the database is once again in a consistent state.

The other thing that a transaction can do is abort. A resource conflict, for example, may prevent a transaction from committing. This conflict isn’t a failure, because nothing is broken; it’s just a case of trying to do something at the wrong time. When a transaction aborts, a rollback is initiated. The database transaction log records every operation that takes place in the course of a transaction. The rollback operation consults the log and performs in reverse all the operations listed in it, restoring the database to the state it was in before the transaction started. This state is consistent. The transaction can be attempted again when conditions are more favorable. In many cases, conditions are favorable just as soon as the rollback completes. Whatever concurrent transaction caused the conflict has either committed or aborted, and now the contested resource is available.

What happens when a failure occurs and a transaction is still active?

If a failure occurs before a committed or aborted transaction has been written to disk, the database may be left in an inconsistent state. It’s up to the recovery system to restore consistency. To do so, it makes use of the transaction log, which contains a record of what was done up to the point of the failure. This information can be used to restore consistency, either by reconstructing the committed transaction and writing it to disk or by erasing all record of the aborted transaction so that it has no effect on the database that exists on disk.

Tuning the Recovery System

Because it’s a sure thing that failures will occur, it makes sense to plan ahead for them. A failure that you haven’t planned for could result in a data loss severe enough to jeopardize the continued survival of your organization. Effective preparation requires that you understand what kinds of failures could happen, what the impact of such failures would be, and what you can do to mitigate that impact.

The most damaging failure is the one that causes irretrievable loss of data. Data can be irretrievably lost due to a variety of causes, based on the way that data is stored.

Volatile and nonvolatile memory

Two kinds of storage devices in common use today are subject to different failure modes:

  • Volatile memory: Volatile memory forgets everything if its power is interrupted, even for a fraction of a second. Thus, any kind of power failure can cause all data held in volatile memory to be lost.
  • Nonvolatile memory: Nonvolatile memory, as I’m sure you’ve guessed by now, doesn’t lose its contents in the event of a power outage.

    Another name for nonvolatile memory is persistent storage. The data persists even in the event of a three-day blackout after a major storm. When power is restored, the data is still there, just waiting for you.

You may wonder why computers have volatile memory at all. Because volatile memory forgets everything after a power interruption or processor reboot, why not use nonvolatile persistent storage for all the memory? The answer has to do with performance. Volatile main memory consists of semiconductor RAM chips, and nonvolatile persistent storage is implemented with hard disk drives. Semiconductor RAM can be as much as a million times faster than hard disks in terms of exchanging data with the processor. With a million-to-one speed advantage, volatile semiconductor RAM earns a place in every computer in which speed is a consideration.

Now you may wonder why, if volatile RAM is a million times faster than nonvolatile hard disk storage and remedies have been developed for recovering from failures, you would ever want to use hard disk storage. The answer is cost. The cost of storing a bit on hard disk is much lower than the cost of storing the same bit in semiconductor RAM. A personal computer may have a 2TB hard disk but only 8GB of RAM. The difference in capacity reflects the difference in cost per bit.

Because both volatile and nonvolatile memory are subject to failure, you need strategies to cope with those failures. Because the failure modes are so different, the strategies you use also need to be different. How do you protect data against failures in a volatile memory system? For this discussion, I assume that the nonvolatile memory system is operating properly because it’s unlikely that both the volatile and nonvolatile memory systems would fail at the same time. Such a failure is possible in the event of a building collapse or other major disaster, of course. There are ways of coping with even those eventualities, but those methods are beyond the scope of this book.

Memory system hierarchy

Modern memory systems are hierarchical in nature, with the fastest memory devices at the top of the memory hierarchy and the slowest at the bottom. The fastest devices are also the most expensive in terms of cost per bit stored. For this reason, it isn’t feasible to populate a memory system completely with the fastest memory available.

Here’s the hierarchy for a typical database server:

  • CPU registers: CPU registers are fastest because they’re fabricated with the same process as the CPU itself, on the same chip. They’re closest to the processing unit, so transit delays between the two are minimized.
  • Level 1 cache: Level 1 cache is a little slower than CPU registers.
  • Level 2 cache: Level 2 cache is slower yet.
  • Level 3 cache (if present): Level 3 cache may or may not be present. If it is, it’s a little slower than Level 2 cache.
  • Main memory: Main memory is much slower than cache.
  • Solid state disk (SSD): Slower than main memory, but much faster than hard disk.
  • Hard disk: Hard disk is much, much, much slower than main memory.
  • Magnetic tape: Magnetic tape is so slow that it’s used only for backup and archiving. You would never keep active data on magnetic tape.

CPU registers, cache, and main memory are all examples of volatile memory. When you lose power, you also lose all the data stored in those devices. Hard disks and magnetic tape are persistent storage devices. Data stored in such devices survives power outages. This isn’t to say that persistent storage devices are invulnerable to failure; they just have different failure modes from those of volatile memory devices.

Hard disks and magnetic tape drives are machines that contain moving parts. Such devices fail catastrophically from time to time. If you use them long enough, they’re sure to fail. When a hard disk fails, you lose all the data that was stored on it. A regular backup schedule, faithfully followed, can minimize your losses in such an event.

Putting logs and transactions on different disks

Volatile memory failures are inconvenient but not catastrophic if the information you need to recover to resume operation is stored on some nonvolatile medium, such as a hard disk. Clearly, you can’t store the results of your computations directly on hard disk as you perform them; if you did, your processing would be reduced to hard disk speeds rather than the million-times-faster RAM speeds. You can do a few things, however, to minimize the amount of time that results spend in RAM before being written to disk. These methods involve transactions and logging. (I discuss transactions and logging in detail in Book 4, Chapter 2.)

The log is a record of every operation that alters the database in memory. It resides primarily on disk, but part of it — the page buffer — is in RAM. Writes of log entries to disk are made before the results of the corresponding transactions are written to disk. If a system failure or crash occurs, you can reconstruct any transactions that were in progress when the crash occurred. If a transaction had committed or aborted but had not yet been flushed to disk, the log on disk holds the information that enables the system to commit or roll back the transaction, as appropriate. By keeping the log on a disk separate from the disk that holds the database, performance during normal operation is maximized because writes to the log are sequential and no time is lost doing disk head seeks. This being the case, frequent writes to the log don’t exact the same performance penalty that buffer flushes to the database do.

To understand why you should put a database’s log on its own hard disk drive, you should understand how a hard disk drive is constructed and how it operates.

Hard disk drive construction

A hard disk drive contains a sealed enclosure that’s protected from particulates in the air by a very effective air filter. Inside the enclosure, a spindle connected to a motor rotates at speeds on the order of 7,200 revolutions per minute (rpm). Attached to the spindle are several circular platters stacked one atop another within the enclosure. The platters are aluminum disks coated with a magnetic material. On each platter, a set of more than 1,000 concentric tracks are magnetically recorded. These tracks hold the data in the form of magnetic domains that are magnetized in one direction or the other. If a magnetic north pole is up, it represents a binary 1; if a south pole is up, it represents a 0.

An actuator moves an arm so that its tip, which holds a magnetic read/write head, can hover over any of the tracks. The read/write head, when reading, can sense whether it is flying over a north pole or a south pole and, thus, reads either 1 or 0. When writing, the head, which is an electromagnet, can flip the polarity of the domains it flies over, thus laying down a pattern of 1s and 0s. The top and bottom of every platter has an actuator arm and its associated read/write head. The set of corresponding tracks on all the platters — all the track 0s, for example, which are the outermost tracks — constitute a cylinder. All the heads in the stack are flying over the same cylinder at any given time.

Hard disk drive performance considerations

Because of the way that hard disk drives are constructed, and because mechanical movement is involved, there are limits to how fast a given byte or word can be read or written. The delay involved in a read or write has several components:

  • Disk drive controller delay: The time it takes the controller to determine whether it’s being asked to do a read or a write and to determine the location that is to be read or written.
  • Seek time: The time it takes for the read/write head to move from whatever track it’s currently on to the track that it needs to go to next.
  • Settling time: The time it takes for the jiggling to stop after the read/write head has arrived at the track it will operate on next.
  • Rotational latency: Even at 7,200 rpm, it takes a significant amount of time for a word of data to rotate all the way around the disk and come under the read/write head. If the desired word has just gone by when the head settles over a track, there will be a delay of up to 8 milliseconds before it comes around again.
  • Read/write time: The time it takes to do the read or write operation.

When you add up all the noted delays, it becomes clear why hard disk storage is so much slower than semiconductor RAM. The natural question that arises is “What can I do to minimize the delays inherent in the operation of hard disks?”

As it happens, a major part of the total delay involved in hard disk operation is due to the combination of seek time, settling time, and rotational latency. Minimizing these factors dramatically improves performance. The best way to minimize seek time and settling time is to do all your writes on the same cylinder so that you never have to do a seek. That’s not possible, but what you can do is make all your writes to the same cylinder until it fills and then move to an adjacent cylinder. This practice minimizes both seek time and settling time. Furthermore, if you’re writing sequentially along a track, rotational latency isn’t an issue. You’re always writing on the sector that immediately follows the one you’ve just written.

Because it’s critical that log entries be written to disk sooner rather than later, dedicating an entire disk drive to the log is an obvious choice. Sequential writes to the log are as fast as possible because seek time, settling time, and rotational latency are all minimized.

Tip For performance reasons, unless you’re dealing with a read-only database, always locate a database’s log on its own disk, not sharing the disk with any other function or any other database. This way, log writes are always to sequential sectors on the same disk cylinder, minimizing access time.

It’s a given that any reasonable system employs transactions and logging. Transactions and logging are forms of insurance. They protect you from the worst effects of a failure, but you must pay a premium for that protection. Clearly, you want the protection, but you don’t want to pay a higher premium for it than you have to. This is where tuning comes in. You make a trade-off between the time it takes to recover from a failure and performance during normal operation.

To minimize the time to recover from a failure, you should flush log entries from the page buffer located in RAM to hard disk frequently. The more often you make disk accesses, however, the greater the impact on performance. Somehow, you must find that sweet spot where time to recover from a failure is tolerable and normal operation performance is acceptable. Finding that sweet spot is the DBA’s responsibility.

The DBA is the person with the power to determine how often the page buffer is flushed to disk. The specifics of how to do this vary from one DBMS to another. Consult the system administrator documentation for whichever product you are using for details on how to control the timing of buffer flushes.

Tip Keep a detailed history of every time your system goes down for any reason, and use that knowledge to estimate the frequency of future failures. Combine that information with the results of the data you’ve recorded on the amount of time it takes to recover from a failure to decide how to set the buffer-flush timing parameters.

Tuning write operations

With a performance difference on the order of a million to one between semiconductor RAM and hard disk storage, the less frequently you write to disk, the better your performance is. Balancing that consideration is the fact that if you don’t write to stable storage, you’ll lose all your data in the event of a failure.

Before a transaction commits, it writes the “after” image of the change to the log disk. It may be quite a while after the change is made before the change is transferred from the buffer, which is in solid-state memory, to the stable storage of the data disk. The length of that interval affects performance.

If a failure occurs after the log has been updated with the result of a transaction but before the change has been made to the data disk, recovery isn’t a problem. Thus, you can wait until a convenient time to store the changed data to disk. A convenient time would be when the disk read/write head happened to be located over the track you want to write to. In such a case, there would be no delay due to seek time and settling time.

Different DBMS products have different procedures for determining when it becomes advantageous to start looking for opportunities to make convenient writes. The buffer in solid-state memory is divided into pages. Whenever a change is made to a page, the page is marked as dirty. You never need to write a clean page out to disk, because its contents already match the contents of the corresponding disk location. Dirty pages, on the other hand, differ from their corresponding disk locations and eventually have to be written to disk, incurring the performance penalty that such an operation entails. Generally, when the percentage of dirty pages in the page buffer exceeds a certain threshold, the system starts looking for opportunities to make convenient writes.

You gain a considerable advantage in delaying writes to the data disk — an advantage that goes beyond the difference in speed between a write to buffer and a write to disk. If a memory location is hot, in the sense that it’s being updated frequently, many of those updates may not need to be written to disk. Suppose that a memory location in the page buffer has a value of 4. In quick succession, it may be updated to 7, 17, 34, and 54. If the page that location is on isn’t flushed to disk until the last of those writes takes place, the updates to 7, 17, and 34 are never written to disk, and the time that such writes would have taken is never consumed. This consideration strengthens the case for extending the amount of time before dirty pages are written to disk as long as possible without making recovery from failure unduly onerous.

Performing database dumps

In the preceding sections, I refer to hard disk storage as stable storage. This type of storage earns that description because it’s nonvolatile. It doesn’t lose its contents when the power goes out. Stable storage isn’t immune to destruction, however. Hard disks do fail, and when they do, all the data on them is lost. To address this problem, system administrators perform periodic database dumps, in which the entire database is copied to offline media and stored in a safe place.

Database dumps are expensive because it takes a significant amount of time to copy a large database to an offline medium. In addition, the offline medium itself has a cost, as does the space taken up by it. For these reasons, you don’t want to do dumps too often. On the other hand, you want to do them often enough. If a hard disk dies, all the changes that have been made since the last dump are gone. Can those lost transactions be reentered? If those transactions are important, you’d better find a way. How much of a hassle would it be to reenter the data? If it would be more than you’re comfortable with, you may need to reevaluate your dump interval.

If you never dump your database, when your hard disk fails, you’ll lose all your data — clearly, not an acceptable outcome. At the other extreme, if you perform dumps too frequently, you won’t get any work done because your system is spending all its time doing database dumps. That’s not acceptable either. Somewhere in the middle is an optimal dump interval.

No dump interval is best for everybody in every situation. Consider several points:

  • How hot is the database? How many changes are being made to it per second, per hour, per day, or per week?
  • How painful would it be to reenter updates that have been made since the last dump?
  • How long does a dump take, and how much does it affect productivity?
  • Am I using dumps for anything besides insurance against failure? Am I data mining dumps, for example? If so, how recent does the information need to be to be useful?
  • How much room for storing dumps do I have?
  • Can I recycle dump media from several generations back and use it again?

In many cases, doing a dump once per day is sufficient; you can schedule it at a time when activity is light. In other cases, doing a dump more or less frequently is appropriate. Taking all the preceding points into consideration, decide on the best dump interval for each of the databases for which you’re responsible.

Setting checkpoints

Hard disk failures are almost always hard failures. In a hard failure, something physical is permanently and catastrophically damaged. In the case of a hard disk, this failure is the infamous disk crash. The term is descriptive, because when a disk crash occurs, you often hear a horrible scraping sound as one of the flying heads slams into its platter at speeds of up to 75 miles per hour.

Luckily, hard failures are relatively rare. Far more common are soft failures. In a soft failure, something unexpected has happened, and processing has stopped. Perhaps the dreaded Blue Screen of Death appears, featuring an indecipherable error message. Maybe the system just freezes and refuses to respond to anything you do. In cases such as these, rebooting the machine often clears the problem. Unfortunately, it also clears all your volatile, solid-state memory. Anything that hasn’t already been flushed to disk is lost. Ideally, the log on disk has been updated recently, and you can reconstruct some of what was lost. The flushing of data from the page buffer to disk is called checkpointing.

You have two methods of writing data in the page buffer to hard disk stable storage:

  • Write the buffer contents to disk one page at a time. In this case, when it turns out that a page not currently in the buffer is needed and, as a consequence, is read from disk, it displaces a page that’s already in the buffer. If that page is dirty, it must be written to disk before the new page can be read.
  • Write the entire contents of the buffer to disk at the same time. This operation is done at regular intervals called checkpoints. The more frequent your checkpoints are, the less data you lose in case of a soft failure or failure of any part of the system other than the hard disk subsystem. For this reason, setting frequent checkpoints is a good idea. Too-frequent checkpoints are bad, however, because a write to disk of the complete contents of the page buffer takes time that isn’t being used productively. Here again, you have a trade-off between normal operating performance and the time and effort it would take to recover from a failure.

If you have good statistics on the frequency of failures that cause the loss of the contents of solid-state memory and the cost — both in lost productivity and extra expense — of recovery from those failures, you can make an informed decision about how frequently you should schedule checkpoints. Checkpoints have a cost in lost productivity, but not as great a cost as a dump. Checkpoints should be performed more frequently than dumps, but not so frequently that multiple checkpoints are likely to occur within the execution time of a typical transaction. Having multiple checkpoints while a transaction is active consumes time but doesn’t deliver a corresponding reduction in recovery time.

Optimizing batch transactions

A transaction that causes multiple updates is called a batch transaction. Batch transactions can be problematic if they’re long. If the page buffer fills while a batch transaction is in progress, it could initiate a rollback, and rolling back a lengthy transaction and then rerunning it can have a major impact on productivity. To address this concern, one option is to break the batch into smaller minibatches — but you must do this carefully. If a batch transaction is rolled back, it undoes every change that was made by the transaction up to the point of the rollback. If a batch is broken into, say, two minibatches, and a rollback occurs while the second minibatch is executing, the changes made by the first minibatch won’t be rolled back, and the database is left in an inconsistent state.

One solution to this problem is to allow rollbacks only during the first minibatch of a series of minibatches. Doing so, however, severely limits what can be done in the subsequent minibatches. If the minibatches are truly independent, no problem exists, but such independence must be guaranteed to preserve the integrity of the database.

Tuning the Operating System

Because your operating system controls an application’s access to input/output (I/O) devices and to hard disk memory, it can affect database application performance in many ways. You can do at least an equal number of things, however, to tune the operating system to improve performance. In the following sections, I touch on some of the major areas in which you can make a performance difference by acting on the operating system.

Scheduling threads

Practically all operating systems these days are multithreaded operating systems. Microsoft Windows is an example of one such operating system. Multithreading is what makes it possible for you to type text in a Microsoft Word document while your Microsoft Excel spreadsheet is recalculating values and your Microsoft Access database is performing a query. Meanwhile, your media player is piping your favorite song into your headphones. Each one of those tasks is performed by a thread of execution.

Multiple threads of execution don’t operate simultaneously, although they seem to. Instead, they operate concurrently. First, a little bit of one task is executed; then a context switch suspends that thread of execution and activates another thread. This swapping goes on so rapidly that to a human observer, the threads appear to be executing at the same time. Operating systems have moved from single-threaded to multithreaded operation because of the great performance boost you get.

You may wonder why multithreading improves performance. After all, whether you interleave the execution of five tasks or have them run sequentially, you still have to execute all the instructions involved in those five tasks. In fact, it seems that multithreading should be slower than single threading because nonproductive overhead is involved with every task-switch operation.

Despite task switching overhead, multithreading substantially improves throughput — the total amount of work that gets done per unit time. Suppose that you have a typical organizational database system, with a database server holding the database and multiple client machines making queries or updates to that database. On the server, some of the database is in high-speed cache memory, some is in slower dynamic RAM, and some is stored on very-much-slower hard disk.

If the currently active thread needs to access a data item that is neither in cache nor RAM, a long delay occurs before it can proceed. There’s no point in stalling everybody while that one thread waits for a response from the hard disk subsystem. As soon as the operating system is asked to go to hard disk for data, it can initiate that operation and then immediately suspend execution of the active thread and activate another thread that’s ready to go. When the needed data becomes available from the hard disk subsystem, the original thread can be reactivated and run at full speed again.

Any application that’s operating in a multithreaded environment with, say, four other applications isn’t speeded by the multithreading process. All five applications, however, will finish sooner than they would have if they’d been run sequentially, because as one application is waiting for a slow operation to complete, the processor can productively be used by another application that’s performing fast operations.

The scheduling of threads is a ripe area for tuning. One area in which tuning can make a big difference is context switching.

Context switching

At any given instant, the data that’s being acted on in the processor is the context at that instant. All the data in all the processor’s user-accessible registers make up the context. An instant later, a computation has been performed, and the context is different.

Before a context switch occurs, the context of an application must be saved so that when the application’s thread is reactivated, the context can be restored and execution can proceed as though nothing had happened.

A context switch takes time. It takes time to save the context of the thread that’s being replaced, and it takes time to restore the context of the thread that’s replacing it. Some operating systems are more efficient at context switching than others. That efficiency rating could be a factor in choosing an operating system, depending on the mix of applications that will be running and on the number of applications that typically will be running concurrently.

Round-robin scheduling

Round-robin scheduling is the simplest algorithm for selecting which one of all the threads that are ready to run should run next. The operating system kernel maintains a queue of pointers to threads that are ready to run. It grabs the pointer to the next thread from the head of the queue and places the pointer to the currently executing thread at the tail of the queue. New threads are placed at the tail of the queue when they become ready to run.

Round-robin scheduling treats all threads as though they have equal importance. In a database environment, this assumption is likely to be valid, so round-robin scheduling is appropriate.

Priority-based scheduling

In priority-based scheduling, each thread is given a priority number, with higher-priority threads receiving higher numbers. Higher-priority threads are given preference by a part of the operating system called the scheduler. They run more often and for longer periods than other threads. The priority assigned to an application, as well as the level of service that a given priority level confers, are quantities subject to tuning.

Priority-based scheduling has a potential pitfall, called priority inversion.

Priority inversion

Suppose that you have three applications running on three different threads. One has high priority, one has medium priority, and one has low priority. Suppose further that the high-priority thread and the low-priority thread require the same resource. Here’s what could happen:

  1. The low-priority thread starts running and acquires an exclusive lock on the key resource.
  2. The high-priority thread starts running and tries to acquire the key resource, but it fails because the resource is locked by the low-priority thread. This situation is a priority inversion, but it isn’t too bad. It’s called a bounded priority inversion because it lasts no longer than the critical section of the low-priority thread, after which the lock is released.
  3. The medium-priority thread preempts the low-priority thread during the inversion. Now both the high-priority thread and the low-priority thread are idle. The medium-priority thread may even be preempted by another, higher-priority thread, which could delay the high-priority thread for an unacceptably long time, causing it to fail.

    Technical stuff This kind of unbounded priority inversion happened to the Mars Pathfinder spacecraft in 1997, halting the exploration of the red planet by the Sojourner rover until NASA engineers could figure out what had happened and upload a fix to the code.

Here are two possible solutions to the priority inversion:

  • Priority Inheritance Protocol (PIP): One solution to the priority inversion problem is to institute the PIP. When a high-priority thread attempts to lock a resource that’s already locked by a lower-priority thread, the priority of the lower-priority thread is automatically raised to match the priority of the high-priority thread. Thus, it can’t be preempted by a medium-priority thread, and the priority inversion is only a bounded one.
  • Priority Ceiling Protocol (PCP): Another solution is provided by the PCP. When a thread locks a resource, regardless of what its priority is, it’s immediately promoted to the highest priority. It can’t be preempted. Thus, when it exits its critical section and releases its lock on the resource, the ready thread with the highest priority can acquire that resource. This example is another case of a bounded priority inversion.

Clearly, if you use priority-based scheduling, you should also use some scheme such as PIP or PCP.

Deadlock

Deadlock is another problem related to resource acquisition. Suppose that two threads both need the same two resources, A and B:

  1. Thread 1 acquires an exclusive lock on resource A.
  2. Thread 2 acquires an exclusive lock on resource B.
  3. Thread 1 attempts to acquire a lock on resource B but can’t, so it waits, pending the availability of resource B.
  4. Thread 2 attempts to acquire a lock on resource A but can’t, so it waits, pending the availability of resource A.
  5. Neither resource A nor resource B ever becomes available, and both applications 1 and 2 are deadlocked.

A common solution to this problem is for the operating system to notice that neither thread 1 nor thread 2 has made any progress after an interval during which progress should have been made. The operating system drops all locks held by both threads and delays them for different intervals before allowing them to run again. The delay intervals are tunable quantities. The best intervals are successful at breaking deadlocks practically all the time.

Determining database buffer size

Earlier in this chapter, I mention that the storage in a computer that runs database applications comes to two varieties: volatile and nonvolatile. Volatile memory is considered to be unstable storage because a power interruption or other failure that causes a machine reboot erases everything in it. Nonvolatile memory, by contrast, retains its information when such problems occur. The reason why nobody puts all his code and data in nonvolatile storage is that it’s about a million times slower than the slowest form of volatile storage.

Clearly, the less often you have to go out to the nonvolatile storage on hard disk, the better. You enjoy a million-to-one performance advantage if you operate out of semiconductor RAM. For many applications, it’s not feasible to retain your entire database in RAM, but you can afford to keep some of it in RAM (ideally, that portion of the database that you’re most likely to need to access often). The portion of RAM that holds that heavily used portion of the database is called the database page buffer. Your DBA’s tuning option is to decide on the size of the page buffer. If the page buffer is too small, you’ll be going out to disk more often than you need to and will suffer serious performance degradation. If the page buffer is larger than it needs to be, you’ll be paying for expensive RAM when you could be getting the same performance out of dirt-cheap disks. Somewhere in the middle is the optimum size.

If your hit rate on the page buffer is between 90 percent and 95 percent, you’re probably doing about as well as you can expect. That result means that 9 times out of 10 or 19 times out of 20, when you need to access a data item, that item is in the buffer. If you’re not getting a hit rate in that range, perhaps you could do better with a larger buffer — or perhaps not. Depending on your database, a 70 percent hit rate may be the best that you can do.

You can test the situation by gradually adding more RAM to the buffer until your hit rate plateaus. At that point, you know that adding any additional RAM to the buffer won’t improve your performance and will only add to your cost.

Tuning the page usage factor

Another system parameter under the DBA’s control is the amount of space on a page that’s holding data. This parameter is called the page’s usage factor. The higher the usage factor, the more data you can store in fast RAM compared with the amount that must remain on slow hard disk. This is a vote for high usage factors. If the usage factor is too high, however, a problem arises when you make several insertions or update records by replacing NULL values with data. Overflowing a page causes a major performance hit.

Tuning is important here. You need to be aware of the kinds of operations that are typically performed on the database. Are insertions common? If so, a lower usage factor is called for. If not, you’ll get better performance by raising the usage factor.

Page usage factor is one of the many places where tuning means trying a setting, taking data, and then trying something else. After you have a good understanding of how your workload performs under the various usage factors, you can pick the one that will serve you the best most of the time.

Maximizing the Hardware You Have

In addition to the tweaks your DBA can make to the recovery system and the operating system, she can improve performance by making better use of the hardware. In this section, I look at just a few ways to maximize your hardware.

Optimizing the placement of code and data on hard disks

You already know from the discussion of hard disk drive construction earlier in this chapter that you can improve performance by locating data that will be read or written in sequential locations on the same track on your hard disk. Keeping the read/write head on the same track eliminates the delay due to seek time and settling time. For data transfers larger than a single track, staying on the same cylinder on the disk maintains the performance advantage.

When a hard disk is new or has recently been reformatted, you can enjoy good performance by carefully choosing the way in which you copy data to it. As time goes on and updates are made, that beautiful organization is gradually degraded, and your performance is degraded with it. One thing you can do to combat this degradation is defragment your disks regularly.

Another thing your DBA can do to improve performance is locate your most frequently used data on the cylinders in the middle of your hard disk. If your disk has cylinders numbered 0 through 1023, you should put the most heavily used data on cylinder 511, or at least near it. This practice is due to a statistical consideration. If the most heavily used data is clustered in the center of the disk, seeks tend to be short, decreasing the time it takes for the heads to settle over a track. Furthermore, on those occasions when the heads are over a high-numbered track or a low-numbered track, a high probability exists that they’ll next need to go to one of the heavily used tracks, which is only half the radius of the disk away. Long seeks from, say, track 5 to track 1020 are rare.

Tuning the page replacement algorithm

The page replacement algorithm is the code that decides which page in the database page buffer to flush to disk when the buffer is full and a new page is needed. You want to flush out a page that is highly unlikely to be needed again soon. The best page to flush is predictable with a high degree of accuracy due to the fact that most applications have a key property called temporal locality. Temporal locality means that a page in the buffer that has been used recently is likely to be needed again soon. The flip side of this coin is that a page that hasn’t been accessed in a long time probably won’t be needed any time soon. Such a page is a prime candidate for flushing out to disk.

One page replacement algorithm that follows this line of reasoning is the least recently used (LRU) algorithm. Whenever a page must be replaced, the LRU algorithm flushes the page that has been in the buffer the longest time without being either read or written to. The LRU algorithm works very well in most cases. Depending on what the application is doing, however, the LRU algorithm may be the worst possible option from a performance standpoint. If you monitor performance and notice excessive buffer page swapping, changing the page replacement algorithm may give you a substantial performance enhancement.

Tuning the disk controller cache

The disk controller cache is another area that the DBA can tune. Not only is a page replacement buffer located in the system’s main memory, but also, a cache is located in the hard disk subsystem. How this cache is used can affect performance.

Cache usage is regulated by two distinct protocols. As it happens, the performance of read operations isn’t affected by which of two protocols you use, but write performance definitely can be affected. The two protocols are

  • Write-through: When the write-through protocol is in effect, writes to disk are simultaneously written to both the cache and the disk. This means that every write operation is as slow as a disk write operation rather than as fast as a cache write operation. When you’re operating under this protocol, the cache gives no advantage to write operations, but reads of data in the cache are fast.
  • Write-back: When the write-back protocol is in effect, writes to the disk subsystem go only to the cache, and a dirty bit is set to indicate that the contents of the cache differ from the contents of the corresponding locations on disk. Dirty pages are flushed to disk when convenient or when the page replacement algorithm replaces a dirty page with a new page loaded from disk.

For a lightly loaded system, the write-back protocol usually gives better performance because disk accesses are rare. For a heavily loaded system with frequent page swapping and more reading than writing, the write-through protocol may be better. Depending on your job mix, it may pay you to try both protocols, taking statistics for both. After you analyze your statistical data, choose the protocol that performs better.

Adding Hardware

If you didn’t read the material in the preceding sections of this chapter, probably the first thought to come to mind when you determine that your system isn’t performing the way it should is “I need new hardware.” Perhaps that’s true, but new hardware shouldn’t be your first option. As demonstrated in the preceding sections, you can try a great many things besides adding new hardware.

When you’ve exhausted all the possibilities for improvement of the hardware you have, consider adding hardware. If you add hardware before performing the optimizations discussed in the preceding sections, you could easily be out the money without having addressed the real problem. Performance may not improve because your system’s lagging performance wasn’t due to a deficiency in the hardware after all. Nevertheless, if your system is optimized to the point at which it’s doing the best that it can with the hardware it has, and you’re still not getting the performance you need, perhaps upgrading your hardware will help.

Faster processor

One obvious choice is moving to a faster processor. This choice can be an expensive one, however, because you can’t just plug a faster processor chip into your existing motherboard and expect a speedup. The support chips on the motherboard need to match the new processor, which means that you probably need to replace the motherboard and may have to move to faster RAM at the same time. You may as well buy a whole new box and give your existing box to your kid to play computer games on. (Wait — that won’t work! Today’s fast-action computer games with hyper-realistic graphics require the fastest processors on the planet. Your offspring will probably feel dissed if you try to palm off your obsolete processor on him.) At any rate, expensive or not, moving to a faster, more capable CPU may give you a significant performance boost.

More RAM

A less drastic upgrade than a computer switch-out, which may nonetheless make a big difference in performance, is adding RAM to your existing system. Adding RAM may enable you to support a bigger page buffer than you currently have, enabling you to keep more of your data in fast semiconductor memory. If this addition improves your page buffer hit rate, it could be a very economical way to improve performance.

Faster hard disks

Hard disk drives don’t all have the same performance parameters. Seek time, settling time, rotational latency, controller cache size, and disk interface bandwidth are all things to look at (refer to “Hard disk drive performance considerations,” earlier in this chapter). If the disks you currently have aren’t up to the current state of the art, you might consider replacing them. Think carefully, however, before spending a lot of money on this idea. Although processor performance, RAM densities, and hard disk capacities have been improving at an exponential rate in accordance with Moore’s Law, hard disk performance specifications haven’t scaled nearly as rapidly. Although this year’s hard disks have a lot more capacity than last year’s, there may be little or no improvement in the speed at which you’re able to read from and write to them.

More hard disks

Although trading up to faster hard disk drives may not give you the performance boost you’re looking for, adding disk drives that are no faster than the ones you are using now may do the job. The advantage of having multiple disks is that while one disk is busy performing one operation, a second disk can be performing a second operation. Because the processor is operating so much faster than the disk drives, in an operation that entails a lot of disk accesses, multiple disks can operate at the same time. This parallelism could translate into a significant improvement in overall performance.

Solid State Disk (SSD)

For applications that demand the ultimate in performance, such as high-speed gaming, having one or two hundred gigabytes of SSD sitting between your RAM and disk storage can make a tremendous difference. It will also speed things up for business applications that have a large working set. If that working set is all on electronic media instead of having some of it on rotating media, you can expect a dramatic improvement in throughput.

RAID arrays

If you’re going to spread your database across multiple disks anyway, you may as well configure those disks as a RAID array. RAID is an acronym for Redundant Array of Independent Disks, although it previously stood for Redundant Array of Inexpensive Disks. The disks in a RAID array are inexpensive because at any given time, the hard disk market has a sweet spot where you get the most bang for your buck, which in this case means the most megabytes per dollar. If the sweet spot happens to be 4TB, it’s cheaper to buy five 4TB drives and configure them as a RAID array than it would be to buy a single 20TB drive (if you could even buy a 20TB drive at any price). The disks in the array are redundant in that your database information is recorded in more than one place. This safety feature is important for critical databases. If one of your disk drives were to fail and lose all its data, you could keep on operating by using the remaining disks. I give detailed coverage to RAID in Book 4, Chapter 1.

Working in Multiprocessor Environments

Until now, I’ve been talking primarily about a system that may have multiple client computers engaging in transactions with a database stored on a single database server. For large-enough databases in which performance is an important consideration, a distributed solution may be called for. This solution means not only multiple client computers, but multiple servers too.

Distributed databases are significantly more complex than single-server databases, and I don’t go into that complexity much here. Instead, I briefly mention three main architectural choices for such systems.

The architecture chosen has a major effect on overall performance. For some types of applications, one architecture is clearly superior to the other two. For another architecture, the advantage goes to one of the other configurations. Your choices are

  • Tightly coupled architecture: Several processors share the same main memory and disks. The processors operate in parallel on the same data, so this architecture is often best for large, highly integrated tasks.
  • Shared-disk architecture: Each processor in the system has its own private main memory, but the hard disks are shared. This architecture is often best when a lot of computation must be done on related tasks that operate on the same data.
  • Shared-nothing architecture: All the processors have their own private main memory and their own private hard disks. This architecture is appropriate when the application being run can be divided into independent subapplications.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.36.30