Gathering the Evidence

The Scene of the Crime

Very seldom will you have the luxury of investigating a problem where the perpetrators are still on the scene of the crime. Most business systems must continue running whenever possible, and it is rare for the customer/user to be able to hold off on everything else while you are poking around in the database. You'll be lucky to even be able to get access to the system, much less have access to it by yourself. Most often, you'll need to reconstruct what happened from the user's reports, from the logfiles, and from operating system utilities.

Usually, you're really trying to do two things, fix the problem, getting the customer back online, and correct the problem, making sure that the problem doesn't happen again. These are often at cross purposes. Many times, patching the problem destroys the symptoms. You'll have the choice of trying to fix or trying to patch, all under the constraint that the customer absolutely must be back up and running within an hour.

The point here is, no matter how much pressure you are under to get the customer up and running again, you absolutely must get to the root cause of the problem. If possible, save off the bad environment so that you can come back to it later for further postmortems. If you have the luxury of time, before starting any debugging would be a very good time to do a database archive.

We often overlook the operating system's utilities when reconstructing a problem. Es-pecially in UNIX, there's a lot of information available to you. For example, if the user is using the Korn shell (ksh), there'll be a file in his home directory called ".sh_history," containing the last hundred or so commands that the user ran. I've actually used this to solve problems that involved a developer who was "sanitizing" the logfiles to hide his errors.

Determining the Nature of the Problem

The user will probably have already told you where the problem is. Their view of the problem is: "The computer system's not working. Fix it." That's probably as fine a point as they'll put on it. From the user's view, everything from the terminal to the database engine is part and parcel of the same thing, "the computer system," You'll probably need to refine the problem definition a little further. Working from the top down, you will find your problem lurking at one of these levels.

Application

Problems here include logic problems, code problems, data problems, and operational problems. Check the program's logs for any hints of the problem. Most processes that die will leave some trace or some error messages in the logs.

I'll throw in a common red herring here, since it seems to invariably catch up with us. I don't know what it is about software types, but we're awfully anxious to find problems in others' code. Maybe it's the "not invented here" syndrome, maybe it's just that we've seen lots of lousy code, but we are always quick to accept the premise that it's a code problem and we'll need to make a patch. I think we should be a little more cautious here. In my experience, many database problems will be data related. Those that aren't strictly data-related will probably be of a conceptual nature. We don't understand exactly how something either does work or is supposed to work. Thinking about going in and modifying code should be the last trick in our arsenal of debugging techniques.

Database and Connectivity Problems

Examples of potential problems here include poor table design, poor table layout, inadequate Informix tuning, inadequate resources for the database operations, and poor or inefficient communication between client and server.

While this book will concentrate on the Informix side, a tuner cannot ignore the items either higher or lower in this hierarchy. Unless the application is efficiently designed and the operating system is set up properly and all of the hardware is working properly, you'll never realize the full potential of the system. If you as a tuner are not familiar with the application, the operating system, and the hardware find someone who is and cooperate closely with her.

Operating System Problems

The operating system is the underlying platform on which the Informix database depends for CPU cycles and operating system services. If the OS is not functioning properly, the Informix engines will be hamstrung from the beginning.

Common operating system problems are:

  • Inadequate kernel resources to support the Informix engine and applications

  • Swapping in the operating system

  • Random crashes of the OS

  • Load balancing problems due to heavy processing loads from other processes

  • Informix-specific requirements not supported such as kernel async i/o

  • Improper or noninstalled patches applied to the OS

  • Problems with process priorities

When approaching a debugging session on a new system, always check the kernel parameters. Read all of the text files in the $INFORMIXDIR/release directory tree and be sure that all required kernel parameters, operating system patches, and environmental variables have been correctly set.

Once you are familiar with the system, it may be possible to become a little more lax in this area, but don't get complacent. If a new problem arises on an older system, look first to see if someone has changed something at the OS level. Pay attention to OS upgrades, patches, and new programs that have been added to the system. If the OS is upgraded, be sure that your version of Informix is certified on that version of the OS. Sometimes an OS upgrade will require a corresponding upgrade of Informix.

Hardware

Everything depends upon the hardware. If it is not working optimally or if it has intermittent problems, the whole system will have problems. Be sure that you know where hardware error messages can be found on your system. Know whom to contact to have the hardware checked out for problems. Have alternative methods of checking out the hardware. For example, on a UNIX system you can verify the readability of a hard disk partition by using the UNIX dd command. It won't tell you whether or not the data is correct, but it will allow you to verify that it is readable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.213.209