Isolate the Problem

A main principle of debugging is "divide and conquer." It's impossible to address problems when their domain is the entire system. Certainly, you need to be able to take a global view at times, but you will seldom if ever find a system that is totally messed up. If you do find such a system, the best recommendation is usually to scrap it and start over. In most systems, the problems will most likely be restricted to one program or one function or one functional error. Until you can narrow down your search, you'll be searching for the proverbial needle in a haystack.

One time when you need to be able to take an overall or global view of the system is when you are trying to visualize the logical layout of your system. You have to be able to understand all of the components and to understand how they fit together. If you don't know this, you'll never be able to tear the pieces apart and decide exactly where to look for your problem. Once you understand the functions and roles of all of the various components, you're ready to begin to focus in and try to isolate which functions are failing and where the problem may be found.

Duplicate the Problem

The first step is always to duplicate the problem. This is often the hardest step, especially if the problem is intermittent or load-related. Unless you can reliably duplicate a problem or error, you are basically shooting in the dark. Often, just the process of duplicating the problem will point you toward the solution.

Duplicating the problem can be a frustrating exercise, especially if you are working on a system that is not the same one that is experiencing the problem. It is usually preferable to have a development system and a production system so that you can do your debugging on the development system without fouling up the production system. Even if your development system is identical to the production system, it is sometimes difficult to put the same amount of user and data loading on a development system as you see on the production system.

There will be times when you have to perform your investigations on the production system if you cannot duplicate the error on a development system. This sometimes leaves the debugger with a difficult decision. When you are finally able to duplicate a problem on the production system, do you try for a quick workaround to get production back up or do you let the production system stay fouled up or even worse, down, until you can track down the problem?

The ivory tower answer is never to debug on the production system. This is good in theory and can sometimes be the best solution, given that you can maintain the development environment as an exact duplicate of the production environment. Oftentimes, though, this is impossible and you'll find yourself debugging on the production system. In this case, follow one of the same principles as medical doctors must follow. "First of all, do no harm." If your debugging efforts on a production system make the problem worse or introduce new problems, you can expect a few angry phone calls from users and managers.

Check it in Different Environments

The very fact that you cannot duplicate an error on a development system can allow you to eliminate some potential dead ends in your search for a solution. If an error occurs on one system but not on another, you must determine where the systems differ. Assuming that the systems are supposed to be alike, you will have several possibilities for the differences:

  • Hardware: Are there hardware differences between the systems? Look at things like disk capacity and memory. Are there potential hardware errors? Check operating system error logs to see if you may have a hardware problem.

  • Operating system: Are the OS versions the same? Are the patch levels the same? Has something been changed on one of the systems? Are the swap spaces the same for UNIX? How about paging files on NT? What about temp space?

  • Users: Are you running the jobs under the same user ID? Are their environments the same? Are their permissions the same?

  • Applications: Are you sure that the applications on both systems are exactly the same? Were they compiled the same way, with the same compile options? Are all of the setup variables identical between the two systems?

  • Load factors: Is the production system under an especially heavy load? Are there more users than normal? Is the mix of jobs running on the system different? Is there something new running that hasn't been running before?

  • Database engine: Are you using exactly the same versions of Informix? Are there differences in tuning? Are there differences in PDQ tuning?

  • Network: Are there network slowdowns in the production system? Can you ping all of the parts of the system? Can you connect to Informix from various components?

  • User factors: Are there new users running on the production system? Could they be doing something different? Could you be running the program differently from the way the users are running it?

Each of these potential areas of difference can be viewed not only as areas of difference between a test and a production system but also as differences within the malfunctioning system itself. Have any of these items changed in the system?

If you find differences, either between test and production or between an earlier version of the production system and the current version, this may very well point you in the direction of the solution.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.125.2