Look at the History

There are a lot of analogies that you can use in debugging computer systems. One is a trip to the doctor's office. When you first go to a doctor, the first thing that happens is that he takes your medical history. You need to do the same thing with your database system. It is very rare that a problem just comes out of the blue. Something usually has to happen; something needs to change in the system to cause a new problem to occur. It's your job to evaluate whether or not one of these changes has caused your system to begin generating errors.

Keep a System Log

If you want to evaluate the effects of past system changes on your database system, it helps to have one place where all changes to the system are tracked. Various parts of your system will automatically generate logfiles, but they will probably be scattered across the landscape. For example:

  • Informix configuration changes will be logged in your online.log.

  • Operating system changes may be logged in various OS logfiles.

  • Your application may generate their own logfiles.

  • Hardware components may generate their own logfiles

Be sure and differentiate between operational logfiles and setup logfiles. An operational logfile is generated as part of the daily operation of the system. This type of logfile will list the normal types of tracking information that your system generates. An example would be the Informix online.log, which notes such things as checkpoints beginning and ending, Informix errors, login errors, and the like. A setup logfile will note changes in the operating parameters of the system. The Informix online.log also serves this purpose, as it logs any changes made in the Informix tuning and configuration parameters.

Good applications should always generate useful logfiles. What's useful in a logfile? First, everything should have a time and date stamp. Most applications do this, but some do not do it very well. The Informix online.log is an example of a logfile that could be more convenient to use. It registers dates in one section of the log, and it registers times alongside each of the individual events. It would be much more useful if every timestamp included the date. That way, it makes scanning through the logfiles with such utilities as grep or awk much easier for the debugger.

If you are dealing with a system of programs and cannot change or influence how the logfiles are generated, you can at least try to enforce a policy of keeping a manual operational log. This log should contain changes in setup parameters, operational incidents such as system restarts or crashes, hardware and software additions, deletions, updates, and bug fixes applied or attempted on the system.

You could probably make a pretty good argument that this system log should be a manual affair kept in a hard-copy logbook. After all, if the entire system were to go down or if you were to lose the hard disk that holds the electronic copy, you would not have access to the logbook.

Keep an Incident Log

Just as it is most important to maintain an operational log, it is also important to log error reports or other incidents. Try to have the user log any problems immediately and to note the exact times that the problem occurred. Usually a reportable problem will occur when other personnel are running jobs on the computer. Most often the troubleshooting and debugging experts will be called later, after the problem has been cleared up. It will be your job to find out why it happened and how to keep it from happening in the future.

Sometimes you may be called in to fix a production problem while the problem is actually going on. This can either be an ideal situation for debugging or the world's worst situation for debugging, depending upon the needs and priorities of the users. It'll be the debugger's hell if the system is mission-critical and high-use. Here, the most important thing is clearing out the problem and restoring functionality. Everything else is secondary to getting back online.

The situation can be ideal if the system is either not mission-critical or is low-use. If the users can stand to give you an hour or so to dig around when an error is reported, you'll be able to avoid many of the difficulties associated with trying to duplicate an error. The error will already be staring you in the face.

Unfortunately, we don't usually see the ideal situations. The best we can do is to anticipate the types of investigations that we would do when confronted with various types of errors. Knowing that, have the users record the data that you'll need for those investigations at the time the problem occurs.

The general idea is to be able to take a snapshot of the system at the instant the error occurs. The more information you can get, the better off you'll be when it comes time to duplicate and correct the problem later.

Has it Ever Worked Before?

This question is a key one for your debugging effort. If it has never worked before or if you are testing something for the first time, it will be relatively simple to determine if the database engine is working satisfactorily. If it is running properly, then the error is probably with the new program or process. If you are the one developing the new program or process, then you will already know how to continue to develop and debug your program. If it's being developed by someone else, let her know what you know about the problem and let her solve it. She is probably better equipped for the task.

Has Anything Changed?

If the system has worked correctly in the past, then something must have changed to make it not work now. Go back through all of your system and configuration logfiles and see if any changes have occurred on the malfunctioning system. Try to include all changes to the system, whether they appear to be database-related or not. Some things that are notorious for breaking Informix systems are:

  • Operating system upgrades

  • Network reconfiguration

  • Adding or removing other software (did it change the services file?)

  • Changing kernel parameters

  • Adding new Informix third-party programs or applications

  • Adding new Informix programs (did you mess up the TEN installation order?)

Has the System Load Changed?

There will be times when you absolutely cannot find any changes that have been made to a system, and yet it has suddenly broken. If you're sure that there have been no overt changes made to the system, you have several areas that you should investigate:

  • Number of users on the system. (Has it recently increased?)

  • Number of Informix jobs running on the system.

  • Number of other jobs running on the system.

  • Jobs running from cron.

  • Change in the types of jobs that run.

  • Change in timing of when jobs are run. You see this a lot when someone decides to run a database archive during the day instead of running it at night as they usually do.

  • Amount of data being processed in batch jobs.

  • Differences in the data.

  • Changes in operational or functional staff. Maybe a new person is doing something differently, or maybe the usual person is on vacation.

Hardware Problems

If it's not a change in configuration and it's not a change in load, look to a hardware problem as a potential culprit. No matter how capricious we think databases are, they usually don't break without reason.

Note that a hardware problem may not be catstrophic. In fact, it may be subtle and not really noticeable except in the context of your error. An example would be a hard disk that is going bad or that has developed media defects. Depending on where the problem is on the disk, the problem may become evident only when you try to access the one table or one data record that is in the bad spot. Another example could be a slowdown or other intermittent problem in the network. Certain jobs, or only certain jobs that have to pass through a network switch or router, may be the only hint of network problems.

Software Bugs

If all else fails, you could be running into a latent bug either in Informix or in the operating system. I've met a lot of people who, the instant they have a problem with a product, are on the phone to their product support trying to report a bug in the product. Certainly, there are more than enough software bugs out there that could be causing a problem, and you cannot totally eliminate them from consideration. The odds are, though, that you have changed something else or that you are doing something wrong.

I've mentioned before that we software types are awfully eager to find bugs in other people's work, but in the 15 years I've dealt with Informix products, I've probably reported only three or four new bugs to Informix. True, I've run across other known bugs and shortcomings, but the new bug is a rare find. To be truthful, I've always had the fear that Informix treats bugs like astronomers treat stars. You find it; it's named after you. Actually, this may grant you more immortality than having a new star named after you. The star will probably explode within a couple of million years, but Informix bugs seem to be somewhat more persistent.

One of the things that you should do occasionally is go through the bug reports for your particular hardware and software configuration. The release notes in $lNFORMIXDlR/release will contain information about bugs that have been fixed or that are known to be outstanding for your particular version. A much better source is the full bug list found at the Informix home page: http://www.informix/com. You'll need a special password to get to the actual bug list. Get it through your friendly Informix salesperson.

Have you Seen This Before?

If you have gotten this far and still have not zeroed in a little more on the source of the problem, you have a few more possibilities before you have to start tracing your way through code. Check to see if you or anyone else in your organization has ever run across this type of problem or a similar problem before. Hopefully, your system logs will have notations when others have solved problems. If you can't find anyone internally who has seen the problem, try posting a message on the Internet newsgroup favored by Informix cognoscenti; comp.databases.Informix.

Just because this is at the bottom of the list does not mean that is the last thing to do. When you approach a new problem, your mind automatically places this item pretty close to the top of the list. After all, if you've already solved the problem once, why go through all the rigmarole of working your way through it again?

Just beware of the red herring effect. Don't allow yourself to develop the mindset of "knowing" where the problem lies just because you've solved it this way before. Invariably, there will be some twist or permutation that makes it almost like a previous problem but different enough that the same old solution won't work again. Even worse, the old solution might actually make the problem worse.

Track the Logic Flow

Assuming that you've been able to duplicate the problem, either on a development or on a production system, you still need to be able to zero in on a potential cause. If you've been able to spot differences in the system that can generate the error, you've been lucky and you are well on you way to fixing the problem. If not, it's time to go to a deeper level of debugging, which is tracing the logic of the system. Here's where understanding the specifications and flow of the system comes in handy.

Formulate a Hypothesis

By now, you should have developed an idea about the underlying cause of the problem you're trying to fix. Since we're espousing a scientific approach to debugging, we may as well go all the way and give this idea its proper scientific name, a hypothesis.

Your hypothesis should be as specific as possible. Examples of good hypotheses are:

  • The system is too slow because of too-frequent checkpoints.

  • The system is too slow because the table have too many extents.

  • The system is crashing because of a defect on the disk.

  • Archives are failing because the TAPESIZE is set too high.

Your hypothesis should be both specific and testable.

Test the Hypothesis

With your new hypothesis in hand, develop a way to test it and see if it is correct. You should be able to make a guess at what should happen in certain situations if your hypothesis is correct. It's important to formalize this and to make the guess before you start testing. Otherwise, no matter how right your eventual answer may be, you're really just guessing and taking a haphazard approach to the analysis.

Once you've worked out an organized method of testing the hypothesis, you need to determine some criteria for scoring the results. If you're trying to speed something up, measure the speed before the change, make the change, and then remeasure the performance after making the change. Determine in advance how you will decide if the differences are really significant. Determine how to be sure that your changes are indeed the factor that is affecting the performance.

Try as much as possible to control the user load and system load during the before and after stage of testing. Only by controlling the environment to the greatest possible degree will you be able to attribute any performance changes to your work and not just to the phase of the moon.

Change One Thing at a Tme

Inherent to the concept of controlling the environment is the idea of making only one change at a time. If you throw in five or six changes to the system at one time, you are not doing a methodical, scientific job of debugging. True, you may fix your problem, but you can never be sure exactly what you did to fix it.

Sometimes you will have to deviate from the structured, scientific debugging method because of expediency. You may have a system that does not allow much downtime, thus limiting your opportunity to experiment. You may be under the gun to fix all the problems at once. You may just not have enough time to do it right, and "almost right" may be good enough.

In these cases, you may be forced to make multiple changes at once. If possible, phase in the changes so that multiple changes at one time don't interact with one another. I'd feel better having to throw in multiple changes if the individual changes were to different areas of the system that probably don't interact with one another. This is often a hard call, as sometimes things that you don't think affect one another actually do have an effect. Here's where all of your rigorous scientific method in the past can come to your aid. If you follow a method for most of your debugging, in the rare case when you have to shoot from the hip, you'll have more to go on and it won't be random shooting.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.79.70