Chapter 3. The savecore Program

We’ve already mentioned the savecore command and the savecore files quite a few times. Now we will talk about them in much greater detail.

What is savecore?

savecore is the command we use to transfer the system crash dump image, generated for us by the panic() routine, from the dump device to a file system where we can later access it for analysis.

How does savecore work?

During a system panic, whether forced by the system administrator in response to a system hang or caused by the system, the contents of memory in use at that time are written out to the dump device. When the system is rebooted, savecore can be run to retrieve the image from the dump device and archive it to a disk file. Let’s examine this process in more detail.

When panic() executes, it looks into the kernel for the name of the preselected dump device. By default, the dump device is the primary swap device. However, in certain system configurations, such as when the primary swap partition is smaller than memory, a separate, more adequately sized dump device may be chosen.

panic() places the contents of memory in use at the time of the panic toward the back end of the dump device. This little-known fact works to the well-read system administrator’s advantage should savecore need to be run manually shortly after booting, assuming that swapping hasn’t already overwritten any of the image.

During the next boot up of the system, savecore may be called. savecore is invoked with the name of a preexisting directory where the copy of the image currently held on the dump device is to be saved. We’ll refer to that directory as the “savecore directory.”

Note

Should the savecore directory not already exist, savecore will not create it automatically. Therefore, it must be created prior to calling savecore.

savecore examines the image stored on the dumpfile device, testing for two conditions.

  • The two magic numbers stored in the image on the dumpfile device (one at the beginning of the image, the other at the end) must both match the DUMP_MAGIC value defined in /usr/include/sys/dumphdr.h, 0x8fac0102 on Sun systems. This dual, magic-number test is used to verify that any swapping activity hasn’t yet damaged the dump image.

  • The dump must be from the same revision of the operating system that is currently executing.

If these conditions are met, the image of memory (core) will be placed in the savecore directory as a file called vmcore.X, where X is simply a sequence number. A copy of the kernel namelist will be built and also placed in the savecore directory. It will be named unix.X or vmunix.X, depending on the operating system in use.

The sequence number will start with zero. Once savecore has been run, a file called bounds will exist in the savecore directory along with the savecore files. The bounds file simply contains the sequence number to use for the next execution of savecore.

If you are interested in learning more about the details of the actual dump file layout on your system, refer to /usr/include/sys/dumphdr.h.

Disk space requirement & locations

At boot time, not all of a system’s file systems may be mount’ed at first. Some system administrators may choose to mount certain file systems by hand after the system comes up. When selecting a file system for use with savecore, be sure to select a file system that will be mount’ed when savecore is run.

Depending on the usage of the system, specifically memory, at the time of a panic, the resulting vmcore.X file may be quite large. At most, it will be the size of memory. Therefore, the system administrator of a system experiencing frequent panics will need to keep an eye on the disk usage on the file system where the savecore files are being stored. If a large server with 512 megabytes of memory panics four times under full usage and load, you could easily find yourself with 2 gigabytes of postmortem files! Archiving the savecore files to tape or another disk may be wise. Remember also that the UNIX compress command can be very helpful in managing the disk space.

If maintaining a certain availability of disk space on the file system where the savecore directory resides is a concern, a file called minfree can be placed in the savecore directory. This file can be used to specify how much space, in kilobytes, must remain available on the file system once the savecore operation is complete.

Note

Early releases of Solaris 1 (SunOS™ 4.X) and Solaris 2 (SunOS 5.X) contained a bug in that the minimum free value was interpreted as being the amount of space, in kilobytes, required to be free before savecore was run instead of after.

Security issues

By default, savecore creates the savecore files with rather open permissions. This allows any knowledgeable user on the system to read and analyze the files. As you’ll soon learn, even a user with few skills will be able to glean some information from the files.

Since the vmcore.X file provides the contents of memory at the time of the crash, the data contained in the file may include data that was not intended to be viewed by a wide audience. For example, if, at the time of a panic, someone was manipulating classified data, that data will probably be tucked away somewhere within the vmcore.X file. If security is a concern, the system administrator might want to check and tighten the access rights of the savecore directory and files.

Generally speaking, the system administrator is the person on the system who has access to most, if not all, of the data on the system. This includes the system crash dump files. However, if he is not trained in system crash dump analysis and needs to rely on the skills of another, less trusted person, it would be wise for him to closely monitor the analysis work performed by that person.

Note

If the system administrator is not trustworthy, security is already at great risk.

Solaris 1: How to set up savecore

Let’s talk about how to enable the savecore command so that we can capture the image of a system crash. Although the concept is basically the same for both Solaris 1 and 2, there are some subtle differences. Here is the procedure for Solaris 1 systems.

Customizing /etc/rc.local

In Solaris 1, using the BSD-based SunOS 4.X, the savecore command is called from the boot-time, run-command script /etc/rc.local. By default, the savecore commands are commented out, as shown in this partial view of /etc/rc.local.

Example 3-1. Savecore commented out in /etc/rc.local

# 
# Default is to not do a savecore 
# 
# mkdir -p /var/crash/`hostname` 
# echo -n 'checking for crash dump... ' 
# intr savecore /var/crash/`hostname` 
# echo '' 

To enable savecore at boot time, the /etc/rc.local file needs to be modified so that the savecore command is enabled. Be careful not to uncomment the actual comment, which reads “# Default is to not do a savecore”. Once modified, this portion of /etc/rc.local should read:

Example 3-2. Savecore enabled in /etc/rc.local

# 
# Default is to not do a savecore 
# 
mkdir -p /var/crash/`hostname` 
echo -n 'checking for crash dump... ' 
intr savecore /var/crash/`hostname` 
echo '' 

This sequence of commands will create a directory in /var/crash named after your system’s hostname. The “hostname” portion of the command is interpreted to mean: First, run the UNIX hostname command, then use the output of that command as part of the mkdir command. For example, if your system’s name is “maugrim”, the resulting command will be:

mkdir -p /var/crash/maugrim 

The -p option of mkdir says to create the parent directories if they don’t already exist.

If you want to use a different savecore directory, modify both the mkdir command and the savecore commands. As an example:

Example 3-3. Savecore and the verbose option enabled in /etc/rc.local

# 
# Default is to not do a savecore 
# 
mkdir -p /opt/spare/crashes 
echo -n 'checking for crash dump... ' 
intr savecore -v /opt/spare/crashes 
echo '' 

In this example, the hostname command is not used at all, as we’ve specified the full directory name instead. Also, here we’ve called the savecore command with its only option, -v, which generates more verbose output when it runs. By default, savecore is called without options. You’ll notice that we are also calling savecore via the intr command. Commands run within the /etc/rc* scripts are not normally interruptible; however, when called with the intr command, they are.

Finally, the two echo commands are there simply to print useful information to the console during the boot-up process.

Configuring a special dump device

We’ve already talked about the dump device and know that the primary swap device is usually used as the dump device. On Solaris 1 systems, it is very easy to specify a dump device other than the swap device. This is done through the system configuration file, on the config line, which describes where your root file system and swap partition exist. For complete details, refer to the man page on config(8).

As an example, here are config lines from two system configuration files. The first, shown here, will result in system crash dumps being stored on the swap device,

config vmunix swap on sd1b 

whereas the second, shown below, specifies a dump device different from the swap device.

config vmunix swap on sd1b dumps on sd2f 

If you want to specify a separate dump device, you will need to modify a kernel configuration file, config and make a new kernel, and boot up the new kernel. Please refer to the appropriate System Administrator’s manual for guidance if you aren’t already familiar with the procedure to build a new kernel.

Configuring a special dump device

When specifying a special dump device on any UNIX system, do not under any circumstances specify a device that has a file system on it, nor a partition used in raw mode by a database application. Like a swap partition, the dump device knows nothing about file systems, superblocks, inodes, and data. During the panic, the contents of memory are written to the dump device without regard for what is being overwritten. Please choose your system crash dump device carefully!

We hope you understand and appreciate the importance of this.

Solaris 2: How to set up savecore

Here is the method for enabling savecore in Solaris 2 systems. Note the differences from Solaris 1 as we point them out.

Customizing /etc/rc2.d/S20sysetup

On Solaris 2 systems, the savecore command is called by the run-level-2 script /etc/rc2.d/S20sysetup, which is hardlinked to /etc/init.d/sysetup. By default, savecore is commented out, thus disabling it when transitioning to run level 2, as shown in this portion of the script.

Example 3-4. Savecore commented out in /etc/rc2.d/S20sysetup

## 
## Default is to not do a savecore 
## 
#if [ ! -d /var/crash/`uname -n` ] 

Example 3-4. Savecore commented out in /etc/rc2.d/S20sysetup

## 
#then mkdir -p /var/crash/`uname -n` 
#fi 
#                echo 'checking for crash dump...c ' 
#savecore /var/crash/`uname -n` 
#                echo '' 

To enable the savecore command, uncomment this area of the script, as shown.

Example 3-5. Savecore enabled in /etc/rc2.d/S20sysetup

# 
# Default is to not do a savecore 
# 
if [ ! -d /var/crash/`uname -n` ] 
then mkdir -p /var/crash/`uname -n` 
fi 
                 echo 'checking for crash dump...c ' 
savecore /var/crash/`uname -n` 
                 echo '' 

Unlike the Solaris 1 /etc/rc.local script, this script first tests for the existence of the savecore directory and if the directory is not found, calls mkdir to create it. This is done by an ifthenfi Bourne shell command sequence. Be careful to uncomment or recomment all portions of this sequence or the script will fail.

Another difference you may note is that the UNIX command uname -n is being used. This command is the Solaris 2 equivalent of the Solaris 1 hostname command.

Again, if you want to use a different directory for your savecore files, change the if, mkdir, and savecore lines accordingly.

Configuring a special dump device

Solaris 2 supports much larger systems than does Solaris 1, allowing for up to 20 CPU modules and massive amounts of memory. In Solaris 2, we also have newer, more advanced swapping techniques. You’ll read more about this in the advanced chapters later on.

The Solaris 1 informal and rather crude rule of thumb of having twice as much swap as memory doesn’t apply to Solaris 2 systems. Indeed, some of the larger Solaris 2 systems run well with nearly no swap space defined at all!

Solaris 2 systems that have a minimal amount of swap space will need to have some sort of dump device at hand when system crashes occur. As with Solaris 1, you can specify a dump device other than your primary swap device. On the releases of Solaris 2 up to and including Solaris 2.4, this is not quite as easy to do as it was in Solaris 1, however, we will tackle this tricky subject anyway!

Both the panic() routine and the savecore program need to know where the dump device is located. Therefore, we need to define this before either executes. We cannot predict when panic() will run; however, we do know when savecore is executed. We need to redefine the name of the dumpfile, which is how the kernel refers to the dump device in Solaris 2, before /etc/rc2.d/S20sysetup is run. To do this, we will create our own script, /etc/rc2.d/S19dumpfile. The “S” or “Start” run-command scripts are executed in alphabetical order by init during run level transitions. Because this is so, we know our S19dumpfile script will be run before the S20sysetup script, as S19 comes before S20 alphabetically.

Before we can write this script, we need to know where to locate the current dumpfile name in the running kernel. Jumping ahead of ourselves, we are going to take a quick peek at the kernel by using the UNIX adb command. By the end of this book, you’ll be a wizard when it comes to using adb, so don’t get too worried if this seems a bit scary at first.

You need to be the super-user, root, to view and modify the kernel by using adb.

Example 3-6. Displaying the dumpfile kernel variable via adb

# adb -k /dev/ksyms /dev/mem 
physmem  1b24 
dumpfile/20X 
dumpfile: 
dumpfile:  0           0          0          0 
           2f646576    2f64736b   2f633074   33643073 
           31000000    0          0          0 
           0           0          0          0 
           0           0          0          0 
dumpfile+10/X 
dumpfile+0x10:  2f646576 
dumpfile+10/s 
dumpfile+0x10:  /dev/dsk/c0t3d0s1 
$q 
# 

A full 32-bit word of memory can store 4 characters of a string, as a character only requires one byte, 8 bits, of storage. There are 4 bytes per full 32-bit word. Each byte has a unique address in memory; however, using adb we write to memory in full and half-words. Throughout this book, we usually reference full-word, hexadecimal addresses, which end in 0, 4, 8, and c.

In the above adb session, we start at the kernel symbol or variable name dumpfile and display 20 full words of memory in hexadecimal. The first 4 words contain zero. The fifth word contains 2f646576. This is actually the first 4 bytes or characters of the null-terminated string “/dev/dsk/c0t3d0s1.”

The dumpfile string starts at address dumpfile+0x10. The kernel string that we need to modify is actually stored in memory this way:

Full word 
 address      Characters 
------------------------
dumpfile+10 = "/dev" 
dumpfile+14 = "/dsk" 
dumpfile+18 = "/c0t" 
dumpfile+1c = "3d0s" 
dumpfile+20 = "1"       (The last character is followed by three nulls or zeros) 

Of these addresses, only the last three might require changing. The first two, representing the /dev/dsk portion of the device name, will not need to be changed.

Note

We do not specify a raw disk partition name; however, please remember that, in effect, the dump device is treated as such!

The next important thing for you to know is the hexadecimal values for the ASCII characters you might need to use to identify which dump device you want. You can refer to the ascii(5) man page to view the complete ASCII chart.

Character    0  1  2  3  4  5  6  7  8  9 
Hex value   30 31 32 33 34 35 36 37 38 39 

Character    a  b  c  d  e  f  k  s  t  v  / 
Hex value   61 62 63 64 65 66 6b 73 74 76 2f 

Let’s have our S19dumpfile script change the dump device to /dev/dsk/c1t2d3s4. As you learn more about adb in later chapters, this will all become clear. Here’s our script:

Example 3-7. S19dumpfile script

:  Automatically executed by the Bourne shell 
# 
#  S19dumpfile - Change dumpfile name 
# 
# 
echo 
echo "Changing dumpfile name to /dev/dsk/c1t2d3s4' 
adb -k -w /dev/ksyms /dev/mem << END 
dumpfile+18/W 2f633174 
dumpfile+1c/W 32643373 
dumpfile+20/W 34000000 
END 
echo "Done changing dumpfile name." 
echo 
# 
#  end of S19dumpfile 
# 

When transitioning into run level 2, /etc/rc2.d/S19dumpfile will generate output similar to the following. Your physmem size may differ.

Changing dumpfile name to /dev/dsk/c1t2d3s7. 
physmem 1b24 
dumpfile+0x18: 0x2f633174 = 0x2f633174 
dumpfile+0x1c: 0x32643373 = 0x32643373 
dumpfile+0x20: 0x37000000 = 0x37000000 
dumpfile+0x10: /dev/dsk/c1t2d3s7 
Done changing dumpfile name. 

Alternatively, you may have your S19dumpfile script write to locations dumpfile+18 and dumpfile+1c by using commands such as the following.

dumpfile+18/W '/c1t' 
dumpfile+1c/W '2d3s' 

However, take care not to use this method for location dumpfile+20, since a null is required at the end of the string. Replace dumpfile+20 with a hexadecimal value instead, as shown in the earlier example.

In future releases of Solaris 2 and other UNIX systems, the starting location of the dumpfile string may differ. Always check before you modify the string. Also, in future releases of Solaris 2, it may become possible to simply set the dumpfile string or something similar via the /etc/system configuration and tuning file, by specifying, for example:

set dumpfilename = "/dev/dsk/c1t2d3s4" 

and then rebooting the system. However, as of Solaris 2.4, this is not possible.

Shouldn’t I copy the kernel first?

This is a good question! However, as you’ll come to understand later when we talk about adb in greater detail, the S19dumpfile script modifies only the contents of /dev/mem.

The kernel variable dumpfile is initially set to all zeros. During the booting process, the name of the dump device is stored in dumpfile in memory, /dev/mem. Therefore, we use adb to modify /dev/mem after dumpfile is set.

Swapless systems

Finally, here’s one last note for those of you who are administering swapless systems running Solaris 2.0 up through 2.3. Due to a bug, savecore will not work unless you have at least a minimal swap space set up. Create a swap partition of at least 8K in size and a custom dumpfile, make the swap space available to the system so that it is accessible to savecore, and all will work!

Our system is now ready to capture a system crash dump image. Let’s move on.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.98.108