Chapter 2. My System Is Hung!

System hangs can be a great source of frustration for system administrators. At some time or another, every system administrator has looked at a system and wondered if it was alive, dead, or just incredibly slow. After a few long moments, the admin begins to realize that he has been staring at a “hung” system.

What is a system hang?

System hangs come in all sorts of varieties, but they exhibit one common symptom: the system is no longer completely usable. Unlike panics, which instantly render the system completely unusable, a system hang might slowly eat the system resources, finally resulting in a completely useless system.

What conditions cause hangs?

One common source of system hangs is a deadlock, or a situation wherein one process is waiting for something that is locked up by another process, which is itself waiting for something that the first process owns. Deadlocks can be caused by what is referred to as “race conditions.” Put more simply, a race condition occurs when more than one program tries to use a given resource without agreeing on some sort of locking or control mechanisms. For example, if two routines both try to manipulate the same data structure in memory without locking mechanisms, who can predict the results?

Race conditions exist outside of computers. Imagine sharing a bank checking account with ten other people. If you don’t agree ahead of time on some sort of rules on how to manage the account, you could soon find yourself with a financial mess on your hands.

System hangs can also occur when resources dry up and the system has to sit around waiting for more resources before it can continue doing what was asked of it. In this case, it would make more sense for the software to report the resource problem to the system administrator, but in some cases this may not have been a predicted scenario, so the code may not have been designed to handle it well, if at all.

Occasionally, system hangs can be caused by hardware problems. For example, if a problem develops with the data transfer cable attached to a disk drive (part of the “bus”), the communication between the system and the disk drive could become so unreliable that the two would no longer be able to work together. The result might be a hung bus, or a system so confused that it gets stuck in a loop trying to do nothing else but communicate with the dysfunctional drive.

How do you know if your system is hung?

Depending on the cause of the hang, some users on the system might be able to continue working while others see the system as dead. On some occasions, you might not be able to remotely log in, with rlogin, from another system, but will be able to log in to the system at the console. Some hung systems will respond to low-level network commands such as ping, while others will not. Finally, some systems will slow down, creeping toward the hang, giving you, the observant system administrator, a hint of what is to come, while other hangs will appear to be instantaneous.

If the system is hung, you will not see panic messages on the console. However, if you are lucky, again depending on the reason for the hang, you might see output on the console that will point you to the source of troubles. The problem could simply be that someone powered down the disks or that an Ethernet cable became disconnected.

Unfortunately, system hangs can also be caused by programming problems or bugs at the system or kernel level.

Unless you immediately locate a simple hardware problem or a rather embarrassed programmer who just discovered the side effects of running his simulation program in real time on a heavily used DBMS server, you will need to attempt to force a system panic in order to get an image of memory for analysis.

What is a program hang in comparison?

Let’s say you write a tiny C program that simply loops forever, as this program does.

Example 2-1. loop.c

  main () 
  {
    while (1) 
    {
    } 
  } 

This program will run, happily circling in a while loop, until interrupted by the user or terminated via the UNIX kill command.

Now, convert the loop program into a subroutine or function. Write thousands of lines of complex application code and have the new code jump into the little loop subroutine at noon on Thursdays.

You’ve now created a program that will hang at noon on Thursdays. The application will have to be kill’ed. Normal execution is now impossible because you are stuck in that simple little loop.

While the application is hung up in a loop, the rest of the users on the system are doing just fine. A hung program doesn’t affect the rest of the system unless it happens to be eating a disk or two or other kernel resources in the process. So, it is very important to remember that if one user calls you and reports that the system is hung, it may not be true. Dig a bit deeper before you force a system panic!

Capturing system hang information

In most cases, a system crash dump of a hung system can be forced. However, this is not guaranteed to work for all system hang conditions.

To force a dump, you need to drop down to the boot PROM monitor, suspending all current program execution. On Sun systems using Sun monitors for the console, this suspension is done via what is referred to as “L1-A.” L1 was the label on the earlier Sun keyboards for the top left key on the console keyboard. On the newer keyboards, this key is labelled “Stop.” Some keyboards are labelled both ways. While holding down the L1 key, you press the A key. On systems using ASCII terminals for the console, usually the Break key can be used to get to the boot PROM monitor.

Depending on the boot PROM that you have, the boot PROM monitor will respond with:

Type b (boot), c (continue), or n (new command mode) 
> 

or:

Type 'go' to resume 
ok 

or simply:

> 

If you don’t see one these messages, you were probably not successful in stopping the system.

If you find you are at the > prompt, enter n to get into the new command mode which will give you the ok prompt. Once at the ok prompt, enter sync. The system will immediately panic. Now the hang condition has been converted into a panic, so an image of memory can be collected for later analysis. The system will attempt to reboot after the dump is complete.

If you have an older Sun that doesn’t have the new command mode, enter g0 at the > prompt.

Both the sync and the g0 commands force the computer to illegally use location 0, thus forcing a panic: zero.

Not all hang situations can be interrupted. If L1-A or Break doesn’t work, sometimes a series of the same will do the trick. Some hangs are even more stubborn and can only be interrupted by physically disconnecting the console keyboard or terminal from the system for a minute.

If all these attempts fail, you will have to power down the system, thus sadly losing the contents of memory. With luck, a subsequent hang will be interruptible.

Let’s move on now to the next step, using the savecore program.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.183.89