Obstacles to Debugging

The Unreliability of Witnesses

In the real world, witnesses have ulterior motives, they forget things and embellish others, they make unwarranted assumptions, and they misidentify suspects. Why should it surprise us that humans who cannot or will not get it straight in a life-or-death murder mystery should automatically turn into paragons of the scientific method when reporting computer problems? Of course they don't.

Anything reported to you needs to receive the same critical treatment that the mystery sleuth gives his prime witnesses. Of course you cross-examine them! You usually don't need bright lights and rubber hoses, although there have been times when I wished for them. Question everything. "What do you mean it's too slow? Too slow compared to what? How are you measuring slow? Was it ever fast?".

Questioning everything can be a very useful tool in dealing with an irate customer/user, also. Occasionally reflect back to them what they just said, "So, you believe that the posting process is now taking three times as long as it used to, did I get you right?" Don't be afraid to say, "Slow down, I don't understand." Take your time in getting this information from your customer. Don't rush, and pay attention if the customer starts off down another track. You may gain some good data. In addition to helping you get the complaint straight, this type of questioning lets the customer know that you're listening.

Don't be shy about wanting to see the evidence for yourself. If possible, try to see everything yourself. Don't be happy just getting some logfiles and core dumps. Can the customer replicate the problem? Have them try to show you the whole process. An interesting example of this occurred after a 6 a.m. emergency flight to Tampa to fix a customer's problem

A business-critical system that had worked for a year was suddenly getting overloaded and crashing in most un-business-critical ways. Since this system costs the customer over $20,000 an hour when it's down, I got a five a.m. call from my CEO telling me to get on the nearest airplane and go solve the problem. When I got there, I was totally stymied. I did my usual snooping around in logfiles. They showed abrupt crashes with no warnings in the logs. I did my usual poking around for about an hour. By then I was certain that nothing "normal" was happening. I asked the users to really crank up their use of the system so that 1 could see it crash. The users in this case were about 40 keyers keying financial data into the system. Sitting back in a cubicle, I could not see any indications that the system was under stress. Finally, I heard one of the keyers say to the group at large, "This status job is frozen up again. Should I kill it?" The manager was just ready to answer "yes" when I yelled out, "Don't touch it!" and went to find out what was going on.

It turns out that there was a script that gave work status to the keyers. It was home-grown, and was not part of the supported product. This script had just been modified to give more information, and the SQL had some problems. It was doing sequential scans on big tables. What happened was that the majority of the info needed by the keyers popped up in the fast part of the query. Rather than waiting for the query to complete, the users learned that they could "control-c" out of the process. In fact, the first one who learned it sent e-mail to all the other keyers. What was happening was that exiting with a control-c was leaving database backends strewn all about the landscape, eventually exhausting system resources and causing a crash.

This problem would never have been solved without getting out there with the keyers and watching what they were doing. Pay attention to everything. Sometimes even the most insignificant item will lead to cracking the mystery.

Red Herrings

Red herrings are little false clues that mystery writers love to leave scattered about their work just to throw the readers off the track. The same dark beings that write good mysteries must also be in the bug-generating business because the red herring is one of the most common inhabitants in the bug universe.

Maybe a few examples of red herrings will clarify how they behave. I've had an instance where we replaced a SCSI tape drive several times due to hardware failures. We'd get the drive running, and a day or so later, it would turn itself into a smoking mass of aluminum and plastic. Two replacements, two tape drives dead on arrival. Finally, we got it running and surprise, the next day the database archive failed with a hardware error. We immediately got onto the line with the vendor, arranging to have another one shipped. Once it came in, it had the same problem. After a lot of debugging, we determined that the tape drive was not part of the problem. The archive had just reached a 2-gig size, and the operating system wasn't able to write a file that big. We were so primed to find a hardware problem that we were blind to the real nature of the problem.

Everyone who deals with multiuser systems is probably aware of the next red herring. Have you ever been in the middle of doing something on a UNIX machine, only to have the system either freeze up or crash? What's the first thing you think? Of course, it's "What did 1 do to crash/freeze up the system?" You start looking at what you were doing just prior to the problem, looking for excuses to use when the sysadmin comes calling. Actually, you probably had nothing to do with the problem. You were just a victim just like all the other users.

There's probably a good psychological reason why computer types are so prone to falling into red herring traps. We're judged by others based upon how fast we can come up with a solution to a problem. As a group, we're pretty sure of ourselves and usually have pretty big egos. If a red herring presents itself as a quick, easy answer to a problem, we tend to seize it as the answer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.100.221