Chapter 4. Hey! We Got One!

Whether you were expecting it or not, you’ve discovered that your system has panic’ed. If all went well, savecore did its job and there are now system crash dump files in the savecore directory for you to analyze. However, not every crash goes so well.

Before we move on to analyzing postmortem files, let’s discuss a few other issues regarding system crashes.

What to do when your system has crashed

Depending on the cause of the system crash, the system may not have been able to reboot itself successfully. Cases where this would be true include:

  • Catastrophic hardware failure, such as faulty memory or a crashed disk

  • Major kernel configuration faults, such as a buggy device driver

  • Major kernel tuning errors, such as maxusers being much too big

  • Data corruption including corruption of the operating system files

  • Manual intervention is needed, for example, fsck needing answers to its queries

Was the system recently tuned?

If you just tuned your system and tried to reboot under the new kernel and the system panic’ed, you already have a good idea where to start your search for the cause of the panic. If you named your new, untested kernel /vmunix on your Solaris 1 system or if you directly edited /etc/system on your Solaris 2 system, you will most likely find the system in an endless boot and panic loop. Rebooting the “generic” kernel for Solaris 1 will get the system back up. For a Solaris 2 system in this scenario, you can use boot -a and choose /dev/null as your /etc/system file to return to a generic kernel.

When tuning systems and testing the new kernel changes, it’s a good idea not to use /vmunix or /etc/system until you know the changes are good. Instead, use /vmunix.test or /etc/system.test, for example. That way, should the system panic, at least the system will have a better chance of coming back up under a known good kernel. This is particularly sound advice if you are planning on going on vacation right after tuning a new kernel and booting it up.

Has anything else changed recently?

If the system had been running beautifully for the past year, suddenly died, and now won’t come back up, you will need to read the messages that appear during the boot attempts. Look for messages that might point to hardware trouble. It would be a good idea to check all of the cables for proper connections. Also, make sure all the disk drives and other peripherals are still getting power. If everything seems to be in order, attempt to run diagnostics on the hardware.

On occasion, systems demonstrate sensitivity to their environment. With a workstation sitting on your desk next to your plants and your coffee mug, it’s sometimes easy to forget that computers are ultrasensitive electronic devices. Always remember:

  • Proper air flow is required for cooling the electronic components.

  • If the environment is much too hot for you, it is probably also too hot for your computer. Power down your computer equipment if you expect the air-cooling systems in your area to be shut down.

  • Unless protected by an Uninterruptible Power Supply (UPS), your system can suffer damage during electrical storms and interruptions of power.

  • Dirt and dust inside some computers can lead to problems over time. Discuss with your vendor whether Preventative Maintenance visits are recommended.

  • Unless a system is designed to ruggedized standards, it can be damaged by high vibration and excessive movement.

  • Power down all components of the system whenever you need to do hardware repairs, replacements, or rearrangements. Don’t, for example, change SCSI devices while the system is running.

  • Electrostatic discharge will easily damage your computer. Never touch or let anyone else touch the internal workings of your system without proper ESD protection.

Is the system still usable?

If the system reboots itself after a panic, chances are good that the system will be usable, if only for a short while. Some panics and crashes will show up once in a blue moon, whereas others, once encountered, will increase in frequency. It all depends on the nature of the crash.

Assuming that the system is usable for now, you can use the system to analyze the savecore files that are awaiting you in the savecore directory.

If your system is one that serves several users, whether directly or indirectly as a data server, you may want to notify your user base that the system may be going down unexpectedly in the near future. Although not the best of news, it does give the users the option of backing up their work more frequently. For the moment, though, assume that the system is usable.

If you have not backed up your file systems recently, now would be a good time! However, just to be extra safe, use a different set of tapes, in case damage has already been done and you need to revert to the prior set of backups.

Turn off savecore? (How many dumps will you need?)

Once an image of a system crash has been captured, you need to again assess how you are doing on disk space. Do you have room for a subsequent set of savecore files should the system crash again? If not, you might want to move the files to another file system for analysis, clearing up space for the next crash. If you don’t plan to analyze the files yourself, archive them to tape as soon as possible and free up the disk space.

At this time, the second question you need to ask yourself is whether you really need another set of postmortem files? To answer this, you need to consider the recent history of the system’s performance. Has it been crashing a lot lately and you’ve just enabled savecore to capture one crash? Have the symptoms of the past crashes been reliably predictable?

For example, if your system crashes only when you boot a certain kernel, you probably only need the one set of savecore files and can disable savecore for the time being. If, however, your system has never crashed before, it would be wise to keep savecore enabled for now. It is often a good idea to have at least two or more sets of crashes for comparison.

Generally speaking, we feel savecore should always be enabled and ready to go in case the worst happens and your system decides to panic.

If you choose to maintain the savecore files on disk, use the UNIX compress command to squeeze them down to a smaller size. This will gain you some disk space. If you’ve never used compress before, here’s an example that might convince you of its worth. The following savecore files are from a large Sun SPARCcenter 2000 server.

Example 4-1. Compress your savecore files to save disk space

Hiya... ls -l 
total 268154 
-rw-rw-rw-  1 kbrown   15       1272308 Sep  1 12:28 unix.0 
-rw-rw-rw-  1 kbrown   15     135077888 Sep  1 12:29 vmcore.0 
Hiya... compress unix.0 vmcore.0 
Hiya... ls -l 
total 51082 
-rw-rw-rw-  1 kbrown   15        669336 Sep  1 12:28 unix.0.Z 
-rw-rw-rw-  1 kbrown   15      24592643 Sep  1 12:29 vmcore.0.Z 
Hiya... 

The 135-megabyte vmcore.0 file compressed to less than 25 megabytes — a huge saving!

Saving the crash to tape for shipment or archives

When archiving a set of crash dumps to tape, you may wish to first compress the (vm)unix.X and vmcore.X files, again, to use less media space. This also makes life a bit easier for the person who will later read the tape onto his own system to analyze the files, initially allowing him to use less disk space until he is ready to uncompress the files and start the analysis work.

When compressing the files, please use the standard UNIX compress command instead of your favorite public domain or third-party compression utilities. Don’t assume that the person to whom you are sending the tapes uses nonstandard programs.

After writing the files to tape, write-protect the tape and, only then, verify that you can read the tape successfully. Too many potentially valuable system crash files have been lost due to faulty tapes!

Finally, label the tape!

Once the savecore files are safely archived, you can remove them from the disk. In general, it is a good idea to maintain the bounds file, which contains the next sequence number to use. Not only does it help provide a history of how many crashes have been captured, but it helps prevent you from ending up with a dozen vmcore.0 files on tapes over time. It also, again, makes life just a bit easier for the person you send your crashes to for analysis. He won’t have to keep shuffling things around to avoid overwriting the previous crash that had the same sequence number and thus the same file names.

If you plan to send the tape to another person for analysis, it is best to provide the following information:

  • System activity as best known at the time of the crash.

  • A brief description of the crash history for this system from the system administrator’s point of view.

  • The system configuration and tuning files. From a Solaris 1 system, provide the kernel configuration file and the param.c file. From Solaris 2 systems, provide the /etc/system file.

  • List of software modifications and patches installed, showrev -p output if Solaris 2.

  • General system and network information, including:

    • Hardware configuration. From a Solaris 1 system, devinfo -vp output is helpful. From a Solaris 2 system, provide prtconf -vp output.

    • List of third-party drivers and applications.

    • Network-based server and client relationships.

  • The /var/adm/messages* files.

The more information you can provide to the person who will analyze the crash files, the better idea he will have of where to start his search for the cause of the problem.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.9.115