Chapter 1. My System Has Crashed!

What is a system crash?

Since the beginning of time, midnight January 1st, 1970 according to UNIX, computer systems have crashed. A system crash often refers to several different conditions where the system has suddenly become useless. These include:

  • System panics & bad traps

  • Watchdog resets

  • Dropping out (to boot PROM or bootstrap level)

In this chapter, we will limit our discussion to panics and bad traps.

What conditions cause panics?

While some folks see panics as horrible things, they really should be seen as system and data integrity safeguards. A good operating system programmer will embed calls to the panic() routine throughout his code when checking the integrity of the system resources he is referencing and manipulating.

For example, if the system programmer’s section of code is about to free up a block of disk that is known to be in use, he might have his program first verify that the disk is still marked as in use. If the block is suddenly found to be marked as free before he freed it, then his code should not be freeing it. But how did the block magically become free? Somehow, somewhere, something went terribly wrong. By calling panic(), the system programmer can bring the system to a sudden stop, thus safeguarding the system and its data from additional corruption until the problem is found.

panic() can only be called by the operating system while in kernel mode. No user, not even the super-user, whom we will often refer to simply as “root” throughout this book, can actually write an application program that calls panic(). However, any program that exercises a bug in the operating system might trigger a panic. For example, if the user’s program uses a new device driver that is still being debugged, program execution moves into kernel mode whenever the driver is needed. Once in kernel mode, panics are possible. It may appear to the user that his program panic’ed the system, but in reality, his program only triggered the chain of events that led to the panic.

Simply put, if a system panics, it is really because the operating system detected a condition where the integrity of the data was suspect or the data was in danger of being corrupted.

Let’s try this data integrity concept again from a user level programming point of view. If you write a program that opens a file by using the open() system call, you will presumably check the return status of open() to make sure the desired file was indeed opened successfully before going on to the next step. If the open() status shows that an open failure occurred, you will probably have your program report this condition and either exit, prompt the end user for a new filename, or simply take a different course of action. If you open the file and ignore the status returned from the open() system call, you are asking for potential problems further down the line. The integrity of your data is at risk.

Does the automobile that you drive have something similar to the panic() routine? If it is equipped with air bags, then the answer is yes. When your car senses that something is very wrong, such as the front bumper suddenly being involved in a high-speed collision, chances are good that the air bag will inflate, thus (hopefully) preserving its driver.

As another example, this time from a chef’s point of view, how many times does he want to add one pinch of salt to the dish he is preparing? If a chef could use panic(), he might use it to prevent someone from adding a second, third, or hundredth pinch of salt that would most certainly ruin the dish. The integrity of his epicurean masterpiece would be protected.

Given a choice, it is better to have more data integrity checks throughout the system code rather than less. Your data is safer in an operating environment that has built in safeguards throughout the code.

Note

Technically speaking, a Solaris 2 system programmer may call panic() or he may choose to call the cmn_err() routine giving a severity error condition code of CE_PANIC. Cmn_err() acts as a common, multipurpose, error-handling routine, whereas panic() is designed specifically for panic scenarios. In Sun’s Solaris 2, both cmn_err(), when called with a code of CE_PANIC, and panic() call do_panic(), which in turn calls setup_panic() and finally, complete_panic(). From this point on, for purposes of simplicity, when we refer to panic() we also are talking about cmn_err() and other vendors’ routines that help to eventually panic the system.

A word about bad traps

A computer system will also crash if it detects a condition in the hardware that should not happen. On UNIX systems, this type of crash is referred to as a “bad trap.” From the system administrators’ point of view, bad traps and software panics are handled in the same way.

UNIX systems perform millions of traps each day, so please don’t panic when you hear the word trap. However, on rare occasion, you may encounter a bad trap. When your UNIX system sees that, it will invoke panic().

Later on, in great detail, we will explore traps; both the expected good traps that occur on your system all day long and the unexpected, unwelcome bad traps.

The panic() routine

Let’s talk about how panic() actually works. The panic() routine abruptly interrupts all normal scheduling of processes. From the user’s point of view, the system is suddenly dead.

panic() copies the contents of the memory in use to a dump device. By default, the dump device is usually the primary swap device. It is rare to see a system that specifically has a separate chunk of disk set aside solely for dumps; however, it can be set up that way. On most UNIX systems, the dump device must be a disk partition. On some, a tape drive may be specified.

The dump image is written to the back end or high end of the dump device, unless a tape drive is in use. The beginning and end of this image contain a duplicated header record that includes a special code, called a magic number. The magic number simply identifies the current contents of the dump device as a system crash dump image. The duplication of the header record is used to identify whether the dump image has been partially overwritten by swap activity.

panic() records critical information about the current state of the Central Processing Unit, or CPU, that called panic(). This information includes the CPU registers, the stack pointer, and various state registers. We will be looking at these in greater detail later on.

We will talk about how to retrieve the system crash dump image of memory in Chapter 3, when we discuss using the savecore command. For now, it is important to note that unless you set aside a special dump device, if for no other reason than being prepared for a panic, your primary swap device should be large enough to hold a complete image of memory.

Once panic() has completed its task of dumping memory to the dump device, it initiates a reboot of the system.

How do you know if your system has panic’ed?

If your system panic’ed and rebooted while no one was witness to it, for example at 4 a.m. on Sunday morning, you may notice that the uptime of the system is not what you expect it to be. Also, using the last command on some UNIX systems, you might see entries such as:

kbrown    console        Thu Jan 20 20:03 - crash 
root      /dev/ttya      Wed Jan 19 16:40 - crash 

These entries are a fairly reliable indication that the system crashed while folks were logged in.

If you were logged in when the crash occurred, you will find that you are no longer logged in. If you suspect that the system panic’ed and you have set up your system to capture system crash dumps, you’ll find new entries in the savecore directory, assuming everything went well. If disk space was full, you will not find new dump files. Again, we will talk about savecore in greater detail in Chapter 3.

During a panic, the system no longer functions as expected. Those logged in will get no response from the system. Those utilizing data via NFS and other network-based data retrieval systems will no longer have access to that data. Only the person sitting at the system console will see actual evidence as it happens that the system is panic’ing.

During the panic, the system console displays some information about why the system is panic’ing. This information alone, however, is only partially useful. The contents of memory, now being safely stored onto the dump device by the panic() routine, will later be a critical piece of the overall puzzle as to why the panic occurred.

While sitting at the system console during a system panic caused by a bad trap condition, you will see something like the following.

Example 1-1. Example of console messages seen during a panic triggered by a bad trap

BAD TRAP 
sh: Data fault 
kernel read fault at addr=0x0, pme=0x0 
Sync Error Reg 80<INVALID> 
pid=556, pc=0xf000aaa8, sp=0xf0331670, psr=0x4000c4, context=3 
g1-g7: 0, 0, ffffff80, 0, f03319e0, 1, ff467800 
Begin traceback... sp = f0331670 
Called from f0050668, fp=f03317e0, args=f0331844 0 f033184c 0 0 ff35be08 
Called from f0093b68, fp=f0331850, args=0 0 1 0 f03318b4 f00c5b70 
Called from f00245e4, fp=f03318b8, args=f0331e94 f0331920 0 0 4f074 f00b5218 
Called from f0005acc,fp=f0331938, args=f00bc334 f0331eb4 0 f0331e90 fffffffc ffffffff 
Called from 13c24, fp=effff678, args=4f074 effff6d8 3a 2f 1 4dc00 
End traceback... 
panic: Data fault 
syncing file systems... done 
static and sysmap kernel pages 
  56 dynamic kernel data pages 
 168 kernel-pageable pages 
   0 segkmap kernel pages 
   0 segvn kernel pages 
  51 current user process pages 
total pages (1892 chunks) 
dumping to vp ff1e9d84, offset 116888 
rebooting... 

The panic sequence consists of:

  • The actual panic message

  • A stack traceback if a bad trap occurred

  • Dump messages

  • Reboot or reboot attempt

Let’s talk about each of these.

Panic messages

Again, depending on the system programmer and the current operation, some panic messages are quite brief, whereas others provide great detail. Sometimes you will see messages that include the name of the calling program, the variables in use, as well as the line number of the source! Others might simply be a cryptic word that only the programmer will easily recognize.

The example above shows that the program sh, the Bourne shell, which was running as process ID#556, generated a bad trap. Specifically, the trap was a data fault, in this case an illegal attempt by the kernel to read memory address 0x0. This illegal action triggered the bad trap and panic.

This is an easy panic to force by altering a critical value in the kernel, rootdir, while the system is running. Later on, we will cause a similar panic and use it as a practice system crash dump for analysis.

Stack traceback

panic() shows the current stack traceback if a bad trap occurred. This is a history of sorts, showing the hexadecimal addresses of the routines that were called by other routines, working from the most recent kernel routine down to the least recently called, usually a system call or an interrupt handler. Shown along with the addresses of the routines will be the calling parameters used, again in hexadecimal. It won’t be until we look at the crash’s savecore files that we will know which routines were at those addresses and thus in use at the time of the crash.

The stack traceback only goes back to the point where the kernel was most recently entered. A stack traceback will not show the routines in use by the application that made the system call. To find out what application was actually in use, we will examine the user area, executing threads, and process structures.

Dumping messages

When panic() writes the contents of memory to the dump device, you will see several messages that describe how the pages of memory were in use, followed by the total number of pages.

This will be followed by a message telling us where the image of memory is being dumped, giving us the pointer to the vnode structure, which in turn points us to the device. Later on, we will look at the vnode structure in greater detail.

Reboot

Once an image of memory is saved to the dump device, the system will attempt to reboot. Depending on the nature of the panic, the system may reboot without incident and not panic again for hours, months, or years. However, again depending on the problem that initiated the first panic, the system may get in a loop of panic’ing and rebooting until the system administrator intervenes.

Capturing system crash information

It is very important to find out why your system crashed. After all, a system panic means that somewhere something in the system went wrong. There are usually only three ways you can collect system panic or crash information.

First, sometimes you will find information in the /var/adm/messages* log files. Second, if you were sitting at the console at the time of the crash, you can try to record as much of the data as you can on paper. However, to capture the best, most complete system crash data, you have to use the savecore program, which we will cover shortly.

What is a program crash in comparison?

When running a user program, you might see a message that alerts you to a condition and then announces “core dumped.” A program that contains a bug in it might never fail, might stomp on good data, might generate faulty results, or might result in a core dump. It all depends on the nature of the bug in the program.

When the message “core dumped” is displayed, you will usually find a file called core in the directory from which the program was executed. This file contains information about the program, allowing the programmer to debug the program and locate the source of trouble. To anyone but the programmer, core files are generally not of interest and just eat up disk space. C shell users can use the limit command to prevent core files from being left behind, as shown below.

Example 1-2. Limiting the core dump file size in the C shell

Hiya 3: limit 
cputime         unlimited 
filesize        unlimited 
datasize        2097148 kbytes 
stacksize       8192 kbytes 
coredumpsize    unlimited 
descriptors     64 
memorysize      unlimited 
Hiya 4: limit coredumpsize 1 megabyte 
Hiya 5: limit 
cputime         unlimited 
filesize        unlimited 
datasize        2097148 kbytes 
stacksize       8192 kbytes 
coredumpsize    1024 kbytes 
descriptors     64 
memorysize      unlimited 
Hiya 6: 

In this example, the user has limited the size of program core dumps to one megabyte.

Program core dumps and their core files are not to be confused with system crash dumps and their savecore files. A thousand users can all be crashing their own programs and the system will still be running like a champ and your data will still be safe… well, unless you are running one of the buggy programs!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.93.210