The tbstat program gives you instantaneous data about the contents of the shared memory structures. It takes very little CPU time, and locks no tables and creates no latches. As such, you can consider that tbstat is essentially "free"; that is, it has a minuscule effect on database performance. You can let it run continuously in the background if you wish, collecting performance data and usage data with little or no performance cost.
There is only one option that could possibly cause you any problems, and that is the tbstat -o filename option. This option places the entire contents of shared memory into a file specified by filename. If you have a large shared memory, be sure that you have sufficient room in your UNIX file system to save the file. This option will also cause significant I/O in the system if you have it saving a large shared memory file to disk.
The tbstat utility is often used as a troubleshooting tool. When you encounter problems in the engine such as unexpected freeze-ups, long transactions, deadlocks, and the like, you will turn to the tbstat utility to tell you what's happening in the engine. Usually, this is an effective technique, but sometimes your problem will be transient.
Since tbstat gives you an instantaneous snapshot, by the time you enter the correct tbstat command, the problem may be gone. If you are looking for solutions to these transient problems, you may need to use the tbstat -o filename option. The trick is to catch the database in the act of doing whatever it is doing and save the shared memory to a file.
If you can get the output file saved before the problem goes away, you can then inspect the output file at your leisure and run any or all of the tbstat options against the file without worrying about the transient problem going away. To do this you would use the following series of tbstat commands:
tbstat -o any_filename (takes the snapshot) tbstat -"any_options" "filename" (reports from snapshot file)
The first step in fully understanding the tbstat utility is understanding what all of the options mean. This is the help screen for tbstat (commonly used items are marked with an asterisk):
joe 60>tbstat -- usage: tbstat [-abcdklmpstuzBDFPRX] [-r secs][-o file][infile] * -- print this help screen - print just the tbstat header (undocumented feature) a print all info (options: bcdklmpstu) b print buffers c print configuration file * d print dbspaces and chunks * k print locks * 1 print logging m print message log * p print profile * s print latches * t print tblspaces * u print users z zero profile counts B print all buffers D print dbspaces and detailed chunk stats F print page flushers P print profile, including BIGreads R print LRU queues X print entire list of sharers and waiters for buffers * r repeat options every n seconds (default: 5) o put shared memory into file (default: tbstat.out) infile use infile to obtain shared memory information
Using this option displays the help screen for the tbstat command. This is actually the "minus" option, not the "minus minus" option. The first minus sign indicates that a command options follows; the second minus is the actual option.
This convention is followed by most of the OnLine utilities. Just enter commandname -- and you'll see the help text.
Using just the first minus and a null option will show just the header of the tbstat series of commands. This header looks like:
RSAM Version 5.01.UC1 -- On-Line -- Up 7 days 23:54:27 -- 18704 Kbytes
This header includes five important sections. The first section gives the version number of the OnLine engine. The second section shows the current operating mode. Possible values of the mode are:
Offline: | Engine not running. |
Quiescent: | No user can start a server process. User informix can use administrative options of tbmonitor. Any user can use tbstat or view options of tbmonitor. |
Online: | Fully usable by all users. |
Shutdown: | System is in transition from online to quiescent or from quiescent to offline. Cannot be canceled. |
Recovery: | Moving from offline to quiescent. Fast recovery is performed here. |
The third section is the (checkpoint) indicator. If applicable,this section can contain:
(CKPT REQ) | Some process has requested a checkpoint, and the engine is trying to initiate the checkpoint. This flag is often seen if the database is hung waiting for a checkpoint. |
(CHKT INP) | Checkpoint is in progress. Users have read-only access to the database. No changes to data are allowed until the checkpoint ends. |
The fourth section shows the real time that the database has been running, that is, the time since the last database startup. The fifth and final section shows the amount of shared memory in use by OnLine in kilobytes. This is roughly the size of the output file generated by the tbstat -o command.
The tbstat -d command provides several important pieces of information, the most important being the disk and chunk usage for the database. The output is divided into two sections, Dbspaces and Chunks.
joe 43>tbstat -d RSAM Version 5.01.UC1 -- On-Line -- Up 7 days 23:48:38 -- 18704 Kbytes Dbspaces address number flags fchunk nchunks flags owner name e0016e88 1 1 1 3 N informix rootdbs e0016eb8 2 1 4 2 N informix logspace e0016ee8 3 1 5 2 N informix dbspacel e0016fl8 4 1 7 2 N Informix logspace2 e0016f48 5 1 10 1 N Informix slowdbspace 5 active, 20 total
Chunks address chk/dbs offset size free bpages flags pathname e0014e78 1 1 0 135000 35718 PO- /dev/rootdbs3 e0014f10 2 1 0 57000 56997 PO- /dev/rootdbsl e0014fa8 3 1 0 75000 74997 PO- /dev/rootdbs2 e0015040 4 2 0 147500 24542 PO- /dev/log1 4 active, 27 total
On most terminals, the screen displays for the O and the D letters are easily confused. It's often hard to tell from the screen whether a chunk is up or down. Versions prior to 5.0 can run with a chunk down, so it's important to watch your status.
If you're like most DBAs, you'll spend a lot of time fighting the database "battle of the bulge." Murphy's Law says that any database will expand to fill the amount of space available. OnLine does not bother to warn you when you're beginning to fill up your disks.
The first sign of a problem will be when users get messages saying that OnLine cannot allocate more extents for a table. Especially when you're using raw devices, there is no easy way to tell how much disk space you are using.
The tbstat -d program is your most powerful tool for monitoring the usage of your disks. This invocation of tbstat is used to develop percentage full statistics for various dbspaces in the status script that we develop later. You'll probably develop various scripts on your own to massage the output of all the utilities. With tbstat -d, you'll find yourself writing scripts to answer such questions as "How much space is there left in dbspace X?"
Here it becomes important to have some naming conventions for your chunks and dbspaces. If all of the chunks contained in dbspace dbsl begin with the letters dbsl; for example, it will be a simple matter to grep for dbsl followed by the UNIX command, awk '{print $6}' to get the free space in dbsl. If you have the luxury of creating your databases from scratch, keep in mind this consistency in naming. Do the same with the names of your dbspaces dedicated to logfiles and with the names of your tables too if possible.
You may notice that the examples in this book do not always exhibit this level of foresight. I don't have the excuse of having inherited the databases. I developed many of these databases before I began developing DBA scripts to manage them. If you follow these recommendations, you'll do a better job. I know that doing so would have made my life as a scriptwriter a lot easier.
If you're not familiar with the capabilities of the UNIX commands grep, sed, and awk and with the operations of UNIX regular expressions, it may make some sense to learn them early in the game. Using these UNIX tools on the outputs of OnLine utilities can allow you to generate just about any sort of reports and data
This is a useful variant of the tbstat -d command. It provides exactly the same output as the tbstat -d command, with the exception that in the Chunks section the size and free statistics are replaced by page Rd and page Wr numbers representing the numbers of pages read from and written to in each of the chunks.
This option provides a very useful tool for locating "hotspots" in your databases, areas that have abnormally high I/O activity. Since disk I/O is most often the slowest activity your computer system will perform, the elimination or minimization of these hotspots can go a long way toward improving the performance of your databases.
One minor drawback to tbstat -D is that it reports cumulative data rather than instantaneous data. The statistics shown cover the time period beginning when OnLine was either last started or when the statistics were zeroed out. While it would be easy to write a script that subtracts out the last totals and only shows you the differences from the last time the script was run, there is an even easier way to do this. Write a short c-shell script and name it hotspot:
joe 51> cat hotspot #!/bin/csh tbstat -D tbstat -z
Every time you run the hotspot script, it will zero out the tbstat statistics when you finish. Each subsequent run will show the actual activity that has occurred since the last run. This is suitable for running hotspot repeatedly in one sitting or over a short period of time. It could be modified to save the date and statistics of the last run, but why get complicated? Just remember that running hotspot will zero out all of the tbstat statistics. If you (or any other user) are running cumulative statistics monitoring, the tbstat -z's will interfere with this statistics gathering.
The most important thing I've discovered by running hotspot is the tremendous amount of activity that occurs to the physical logfile. Conventional wisdom says to locate the logfiles in an area that is optimized for heavy write activity. I had taken this advice to mean the logical logfiles. Study of hotspot output will show that it's the physical logfile that needs to be optimized. It is relatively easy to relocate and/or resize the physical logfile (the database does need to come down however). Watch your hotspots for a while and then experiment with relocating and resizing the physical log. You'll see how much activity it gets. It is also possible to experiment with the size of the physical log buffers as a way to optimize the writing activity.
Since we just used tbmode -z in a script, this may be a good time to talk about this command option. There's not much to say. It just zero's out everything. This does not affect anything in the database. You can zero it out all day long if you wish. Don't confuse these statistics with the statistics kept by the OnLine UPDATE STATISTICS command. That's a completely different subject. Just be aware of the fact that you may interfere with other statistics-gathering programs. I discourage my users from using any of the OnLine utilities, preferring to give them access through scripts that I control. That way, they won't be stepping on my toes.
If your OnLine system stays up for long periods of time without zeroing out the tbstat statistics, you'll eventually reach your UNIX limits on integer sizes and some of the numbers will roll over, showing negative numbers. To correct this, just run the tbstat -z program.
This option gives you an overview of the database processes that exist at any particular time. This option gives you data from the user pseudotable in the shared memory. It is useful as a debugging tool when the system has inexplicably frozen up and the DBA is attempting to find out what is going on. In the case where everything is frozen, you will need to compare information from this option with data from other tbstat options to trace down the offending process. This usually takes some time because you have to track the transactions through several invocations of the utility. If the database is hung up, probably nothing in the user table is changing and you can run multiple tbstats against the shared memory and get the same data. If the database is not frozen, you often run into a problem because the data in the user table is changing. In this instance it is often useful to take a snapshot of the shared memory by running the tbstat -o file command and then running your tbstat options against file.
With a hung database in a production environment, it is often tempting to simply restart the database and hope the problem goes away. While this is often a viable solution to the problem, there are instances when the recovery time would be excessive and when it would be better to try to track down the specific problem.
You can expect such excessively long recovery times in recovery from such problems as long transactions interrupted while in rollback or other occasions in which much logfile data has not been released. Whether to try to debug or restart the database is a judgment call. Even if you decide to shut down the database, do a tbstat -o debug.filename so that you can later try to find out what caused the problem.
If you learn nothing else of tbstat, learn how to read the tbstat -u output. OnLine is a black box and tbstat is the main way of looking into this black box. This data is impossible to decipher from a UNIX ps output. It just shows up as sqlturbo. You'll note that many of the flags correspond with the invocation flags for tbstat. Keep working with it. Soon you'll be reading these outputs in your sleep.
The output of tbstat -u changed from OnLine Version 4.11 to Version 5.0. Both 4.11 and 5.0 have the same first section. OnLine 5.0 gives you a second section, Transactions, that applies to the X/A environment and to transactions that are running under INFORMIX-Star.
Here is a greatly abbreviated sample of the output from the tbstat -u utility:
joe 46>tbstat -u address flags pid user tty wait tout locks nreads nwrites e0001b70 ------D 555 informix console 0 0 0 6746 2607 e0001f30 ------- 21059 dba - 0 0 1 31538 3813
address | Address of user in user table in shared memory. This is the only point of contact between user addresses found from other tbstat command forms. As you trace a problem using the -s output (latches), the -b -B and -X outputs (buffer status), and the -k output (locks held), you will eventually trace it to a user address. Then match the address with a PID or user name. | ||
Flags | Probably the most important output of this command. Tells you what the process is doing. Has a four-position alpha output: | ||
Position 1: | B | Waiting for Buffer | |
C | Waiting for a Checkpoint | ||
L | Waiting for a Lock | ||
S | Waiting for a Latch | ||
X | Long trans. Awaiting rollback | ||
G | Waiting for loG buffer | ||
T* | Waiting for transaction* | ||
Position 2: | B | Transaction w/Begin work | |
T | In trans. Logging occurred | ||
R | In Rollback | ||
A | Archive process | ||
C* | Committed or is committing* | ||
H* | Heuristically aborted or aborting* | ||
P* | Prepare state. Precommitted* | ||
X* | X/A prepared or is doing so* | ||
Position 3: | R | Reading from the database | |
X | Inside a CRITICAL section | ||
Position 4: | M | Running tbmonitor program | |
D | Running a Daemon | ||
C | Corpse. Dead process awaiting cleanup | ||
F | Page Flusher daemon | ||
pid | UNIX process identification number. | ||
user | Name of the user running the process. | ||
tty | TTY associated with the process, if any. | ||
wait | Latch or lock ID being waited for, if any. | ||
tout | How many times a lock timeout has occurred for the process. | ||
locks | Number of locks held by this process. | ||
nreads | Number of read calls executed by this process. | ||
nwrites | Number of write calls executed by this process. |
Note
Items marked with * used in Online V.5.0 and above only Versions 5.0 and above include this Transactions section in the output of the tbstat -u command:
Transactions address flags user locks log begin isolation retrys coordinator e0004a24 A---- e0001bbc 0 0 NOTRANS 0 e0004c04 A---- e0001c28 0 0 NOTRANS 0 e0005384 A---- e0001eb0 1 0 COMMIT 0 e0005564 A---- e0001f1c 1 0 COMMIT 0
This invocation of the tbstat utility provides information from the tblspace pseudotable in shared memory. These may be regular database tables currently in use, temporary tables created by users, or temporary tables created implicitly by the OnLine system for such tasks as SORT or ORDER BY. Each table in use will have only one entry.
joe:/devel/usr/joe> tbstat -t RSAM Version 5.00.UC3 -- On-Line -- Up 2 days 08:09:41 -- 16080 Kbytes Tblspaces n address flgs ucnt tblnum physaddr npages nused npdata nrows nextns 0 c0095c98 1 1 1000001 10000e 2214 2214 0 0 3 1 c0095da8 1 1 2000301 200004 1285 71 0 0 1 2 c0095eb8 1 1 3000001 300004 245 245 0 0 1 3 c0095fc8 1 1 4000001 400004 115 2 0 0 1 4 c00960d8 1 1 5000001 500004 285 2 0 0 1 5 c00961e8 1 2 6000001 700004 15 2 0 0 1 6 c00962f8 1 2 7000001 800004 15 2 0 0 1 7 active, 1500 total, 512 hash buckets
The final line of the tbstat -t output is
2 active, 1100 total, 512 hash buckets
The active represents the number of tablespaces that are currently active, while the total figure is the configuration parameter found in the $TBCONFIG table.
A hash bucket is part of an algorithm to speed up access to the tblspace table. It's a method of dividing up all of the possible values and cross-referencing them to values in the hash table. The cross referencing is done by an internal mathematical algorithm that takes each entry in the tblspace table, performs some math on it, and relates it to an entry in the hash table. Each entry in the hash table is called a hash bucket.
The tbstat -s invocation provides information on the active latches in the OnLine system memory. An active latch represents a user process that is modifying a shared memory structure. In normal operations, latches are granted and released rapidly. At any particular time, a run of tbstat -s will most often reveal no latches held.
A latch is a type of lock that is placed on shared memory. Most often, latches are placed on the pseudotables in shared memory. When a user needs to access one of these pseudotables, he first must acquire a latch on the table, then make his changes, and then release the latch. If the user cannot obtain the latch, the user process will wait for that latch to be freed. If for some reason it can never obtain the latch, the user process will hang.
When the database is hung, the DBA usually runs tbstat -s to see if someone is frozen on a latch. If anything is found, this is usually a good place to start looking for the bottleneck.
joe 53> tbstat -s Latches with lock or user set name address lock wait user bbf[0] e075adb8 1 0 e0002770 bf[2070] e071af2c 1 0 e0002a10 pt[33] e067afa0 1 0 e0002590 bh[4749] e06e9cb0 1 0 e0001e70
Should a persistent latch be found in the tbstat -s output, it is a straightforward job to track down the culprit. A definite sign of a problem is a persistent latch that has many other users waiting for the latch. If the database is frozen, you can expect to see quite a few users stacked up waiting for this latch to release. The information given in the name field will provide detailed information about what kind of latch is giving the problem. Knowing the type of latch that is hung up can point to the underlying problem in the database.
Once the latch is identified and the fact noted that several users are waiting on the latch, you can run tbstat -u and note which users have this latch address in their wait field. If these users and processes are the ones that are hung up, you've found your problem. Going back to the tbstat -s output for the offending latch, you can check its user field and get the address of the user process that is blocking everything. Going back to tbstat -u will identify the process and give you a UNIX process ID number. You can then decide whether or not to kill the offending process.
This process is the heart of troubleshooting with tbstat. You usually poke around until you find something that is out of place or just doesn't seem right. Then you backtrack using the tbstat -u command along with specific commands that seem to be related to the problem. The eventual goal is to find a user process of a job that can be killed to clear up the problem.
Of course, the DBA should be certain that the user process is neither holding a latch nor in a critical part of the code before killing a sqlturbo process, as either case will cause OnLine to abort, crashing the system.
The tbstat -s output can also alert you to possible bottlenecks and resource limitations in your OnLine system. If you find yourself with an abnormally high percentage of jobs waiting for a short time to get latches, you may be able to improve your performance by increasing the numbers of the structures that are controlled by the latch. You cannot alter the number of latches themselves, because they are set based upon the size of the resource's parameters.
There are several statistics in the tbstat -p output that can point to problems with inadequate resources. When we get to that section, I'll point out several ratios that are listed and several ratios that you can derive from the statistics that can give you an early warning to some resource inadequacies.
The status script presented in Chapter 6 makes use of several of these ratios and will give you a central place to look to evaluate the performance of your OnLine database engine.
The tbstat -k invocation of tbstat provides information about the state of the locks in the shared memory lock table. The lock table is accessed through a lock hash table in much the same way as the tblspace table. Access to the lock table itself is controlled by a single latch.
Before a process can modify the lock table, it must first acquire the latch. Once the modification is complete, the latch is released. Likewise, each entry in the lock hash table is also guarded by one latch that much be acquired before modifications can be made to any particular hash bucket.
Locks are on tablespaces. Their types and scope (table lock, page lock, row lock, key lock, or byte lock) can be deciphered by using the data and a few rules.
INFORMIX 206>tbstat -k RSAM Version 5.01.UC1 -- On-Line -- Up 5 days 04:09:22 -- 18752 Kbytes Locks address wtlist owner lklist type tblsnum rowid size 11056e48 0 110024b0 0 S 1000002 206 0 11056ea8 0 1100243C 0 S 1000002 206 0 11056f08 0 1100209C 11057548 HDR+IX 200035a 0 0 11056f28 0 11002354 110572c8 HDR+IX 200035f 0 0 11056f48 0 11002168 110574c8 HDR+IX 2000354 0 0 11056fc8 0 11002184 110577a8 HDR+IX 200035e 0 0 11057008 0 11002524 0 S 1000002 206 0 110571c8 0 110026f4 0 S 1000002 206 0 110571e8 0 110023c8 110570c8 HDR+IX 2000360 0 0 11057208 0 110027dc 0 S 1000002 206 0 11057288 0 1100243C 11056ea8 HDR+IX 200035c 0 0 110572c8 0 11002354 0 S 1000002 206 0 11057308 0 11002028 0 S 1000002 206 0 11057388 0 11002524 11057008 HDR+IX 2000356 0 0 28 active, 200000 total, 16384 hash buckets
These two options will print information about buffer utilization by the OnLine engine. The difference between the two is that the lowercase tbstat -b will list information only about buffers that are in use by engine processes, while the uppercase tbstat -B will print out information about all buffers provided for in the configuration file, whether in use or not. Buffers may and probably will contain data even though they are not currently being accessed by an engine process. The fields reported by both commands are identical. An OnLine buffer structure is actually composed of three separate components:
Each of the three elements of the buffer has latches that must be acquired before the buffer can be modified. These latches will show up in the tbstat -s output. Each buffer has an individual latch. The buffer header table has a single latch that controls access to the table. Each hash bucket has an individual latch.
joe 31> tbstat -b RSAM Version 4.10.UD4 -- On-Line --Up 8 days 17:26:14 - 18704 Kbytes Buffers address user flgs pagenum memaddr nslots pgflgs xflgs owner waitlist e070976c 0 0 651566 e0937000 29 1 400 e00023b0 0 e06fff6c 0 6 50479c e0807000 23 1 0 0 0 2526 modified, 4750 total, 8192 hash buckets, 2048 buffer size
This invocation mode is used to display information about OnLine's Least Recently Used (LRU) queue structures. These queues are a means that OnLine uses to distribute the allocation of the buffer pool in as even a method as possible. When OnLine determines that it needs to place a page into the buffer pool from disk, it needs to be able to find a free buffer in which to place the page.
Since OnLine tries to keep pages in shared memory that may be useful, it uses an LRU algorithm to determine which pages to flush to disk. Depending on the OnLine configuration parameters, there are between 3 and 32 pairs of LRU queues. These queues are in pairs, consisting of a clean queue and a dirty queue. When a page is initially brought into shared memory, it is on a clean queue. When it is modified, it is moved to a dirty queue. Every time a buffer is accessed, it is moved to the head of its queue, called the MRU (most recently used) end.
This movement from the LRU end to the MRU end is done in two steps. First, a latch on the queue is acquired and the buffer is removed from the queue. The latch is then released. Now, OnLine attempts to reobtain the latch. If the latch is reobtained, the buffer is placed on the MRU end of the queue. If OnLine cannot obtain the latch for the second time, it tries the next available queue (either clean or dirty as the case may be). It keeps trying to obtain latches on subsequent queues until it gets one, at which time it places the buffer on the MRU end of the queue.
The effect of this queue dance is that buffers may move from queue to queue in an attempt to balance the load between the queues. The tbstat -R command lets the DBA see how often the queues are being accessed and how their cleaning is going on. This is more of a tuning tool than a debugging tool. The DBA will use this command to decide whether or not he has enough queues allocated and to decide threshold points for the page cleaner daemons. See the section on tuning in Chapter 8 for more details.
joe 58> tbstat -R LRU 0: 6 ( 0.9%) modified of 643 total .... (several lines deleted) 57 dirty, 4750 queued, 4750 total, 8192 hash buckets, 2048 buffer size start clean at 60% dirty, stop at 50%; first pass search 70%
The first line of the output gives the total number of queues allocated by the LRUS parameter in $TBCONFIG. For each pair, the output gives you a raw number of modified buffers, its percentage of the whole, and the total number of buffers on each queue. In the last line, the start clean is the $TBCONFIG LRU_MAX_DIRTY parameter and the stop at is the LRU_MIN_DIRTY parameter. Notice that the "total" buffer counts are not necessarily equal for each queue pair. This is because of the "queue dance" just discussed in which the buffers move from queue to queue to balance the load.
The tbstat -F invocation provides information about the page cleaners or page flushers allocated to your OnLine system. It gives you historical data on the different types of writes that have been used to flush your buffers off to disk. It also tells you instantaneous status on the page cleaners currently allocated for your engine.
Different types of writes are more or less efficient than others. By noting the types of writes that your system is performing, you may be able to affect your overall throughput by minimizing some and maximizing others.
joe 59>tbstat -F Fg Writes LRU Writes Idle Writes Chunk Writes 0 279744 5142 471793 address flusher snooze state data e0000lc30 0 60 I 0 = 0 e0001c90 1 60 I 0 = 0 states: Exit Idle Chunk Lru
Fg Writes | Writes performed by the server process itself. If your system is doing a lot of these writes, your server is constantly having to interrupt itself and go off and do the page cleaner's job. Either increase the number of page cleaners or start them working earlier by lowering LRU_MAX_DIRTY. |
LRU Writes | Writes performed by the page cleaners when woken up by the engine, not during the normal operations of the page cleaners. The page cleaners can have their sleep time overridden by the engine in two ways. First, if 16 fg writes occur the engine will assume that the page cleaners need to have their sleep interrupted and will wake them up. The other reason is when the percentage of dirty buffers on the LRU queue goes above LRU_MAX_DIRTY before the cleaner's snooze time is up, the engine will awaken the page cleaners and have them begin cleaning the buffers. In this example, this is the case, as LRU_MAX_DIRTY is set at 60%. |
Idle Writes | Writes performed by the page cleaners as part of their normal operations as they wake up at the expiration of their snooze time. |
Chunk Writes | Writes that happen during checkpoints. Chunk writes are the most efficient writes performed by OnLine because they are performed as sorted writes. They are also performed when no other writes are allowed and thus can receive more CPU time. |
Address | Memory address of the cleaner in the page cleaner table. Although the number of page cleaners is allocated according to the CLEANERS parameter in $TBCONFIG, the table will always have 32 slots, which is the maximum number of page cleaners. | |
Flusher | Sequential ID number assigned to this page cleaner. | |
snooze | Current sleep time for this cleaner. This value is set by the page cleaners based upon the amount of activity it is getting, but is a maximum of 60 seconds. | |
State | Code indicating what the page cleaner is doing at this instant: | |
I | Idle (sleeping). | |
C | Checkpoint (chunk) write in progress. | |
L | LRU write in progress. | |
E | Exiting. Either the system is shutting down or the page cleaner timed out on its last run, often due to a heavy workload. If it timed out, you'll get a note in your online message logfile. | |
Data | Used in conjunction with the state field above, this code is written in decimal, followed by an equals sign, then in hex: | |
If state=C | Data is chunk number where buffers are being written to. | |
If state=L | Data is LRU queue from which the page cleaner is writing. |
The tbstat -F command is often used in conjunction with tbstat -R to tune the page cleaner system. We'll get into more detail in the section on tuning in Chapter 11, but one point needs to be made now.
Different OnLine applications have different tuning needs, and you have to be sensitive to what your system requires. This is one area where the DBA needs to make tradeoffs. Most of the early information on tuning OnLine emphasized the efficiency of Chunk Writes, and a lot of effort was expended to cause systems to do most of their writes during checkpoints to maximize throughput.
The only problem with that technique is that users started screaming. Even though the throughput may have been much better, the users would have frequent interruptions because writes are prevented during checkpoints. The system would appear to freeze up every few minutes. This would be OK in a batch processing environment, but not in an OLTP environment. Here, the perceived utility of the system was enhanced by forcing the less efficient LRU writes to carry the page-cleaning load while minimizing interruptions to the users when the checkpoints were doing their admittedly more efficient but more intrusive work. The lesson: Know your system. Know your users. Don't follow rules slavishly.
This option will give the DBA information about the physical and logical logging of the OnLine instance. There's a lot of information in this invocation, and it is used for both tuning and troubleshooting. We'll break it up into three sections. The first section contains physical logging statistics.
joe 62>tbstat -l Physical Logging Buffer bufused bufsize numpages numwrits pages/I/O P-1 69 128 713655 5901 120.94 phybegin physize phypos phyused %used 117202 6000 2999 69 1.15
Logical Logging Buffer bufused bufsize numrecs numpages numwrits recs/pages pages/I/O L-1 8 64 31030148 668262 21047 46.4 31.8
address number flags uniqid begin size used %used e06d26f0 1 F------ 0 70f4f4 30000 0 0.00 e06d270c 2 F------ 0 716a24 30000 0 0.00 e06d2744 3 U---C-L 2296 907533 30000 14897 49.65
address | Address of the logfile descriptor. | |
number | Logfile number. | |
flags | Status of each logfile: | |
F | Free, available for use. | |
B | Backed up. | |
C | Current logfile, now receiving transactions. | |
U | In use, still contains active transactions. | |
A | Newly added. Will become available for use after the next archive. | |
L | Contains last completed checkpoint. | |
uniqid | Unique ID number of the logfile. | |
Begin | Location of beginning page of the logfile. | |
size | Size of the logfile in pages. | |
used | Number of pages currently used. | |
%used | Used/Size. How full is this logfile? |
The first two sections of this output see most use as a tuning tool. There are more specifics in the section on tuning, but basically the concept is to try to keep the pages per I/O statistics pretty close to the actual number of pages in the buffers, to ensure that your writes are doing the most possible work each time they are done. This is logical, since the write will have to look at all of the data anyway. You may as well have the data it sees as full as possible to wring the most work out of each of your writes.
The last section will become one of your most familiar friends (or enemies) as a DBA. Since the logfiles are so important to recovery and since the consequences of filling them up are so disastrous, you need to keep a close eye on your logs. First, you need to have continuous backup going on your logs at all times. If you forget and don't turn it on, you'll have problems when they get full. Next, you're looking for incipient long transactions. If you see that you have many logfiles that show a B for backed up and do not show an F for free, you know that some sort of a transaction is still pending. If you have lots of logs, you can probably just let it go and wait for the transaction to complete. If you begin to get short of logs, you need to think about killing the runaway transaction. Do it early enough that you'll have ample logfiles to log all of the activities that accompany a rollback. If you allow the logfiles to reach the percentage full specified in the LTX_HWM parameter, the engine will lock all other users out of the system, causing their jobs to freeze up.
These two invocations of the tbstat utility provide the closest thing to a dashboard for the OnLine engine that Informix provides. It reports on statistics kept in shared memory that relate to the performance of the OnLine engine. This is probably the number one tuning and monitoring tool available to the DBA. The statistics are useful in evaluating the overall efficiency of your database system and in measuring the effects of changes in parameters upon your database operations.
The outputs for both the uppercase and lowercase invocations are the same except that the tbstat -P output includes an additional field, BlGReads as the first element in the report. Everything else is identical. The output is divided into four lines of data with headings above each line. We'll look at each line individually:
joe 63>tbstat -P Profile BlGReads dskreads pagreads bufreads %cached dskwrits pagwrits bufwrits %cached 235 14970755 22166265 498552631 97.00 543747 2139408 49870693 98.91
isamtot open start read write rewrite delete commit rolLbk 424714409 5813437 23348999 175473799 10989374 5499045 2603325 8699910 101
isamtot | Total number of ISAM (Indexed Sequential Access Method) Does not necessarily correspond to queries, as a query may make many ISAM calls. |
Open | Calls that open a tablespace. |
start | Increments by one when positioning within an index. |
read | Increments when the read function is called. |
write | Increments when the write function is called. |
rewrite | Increments when an update occurs. |
delete | Increments when a row is deleted. |
commit | Increments when a transaction is successfully committed. |
rollbk | Increments when a transaction is rolled back. |
ovtbls ovlock ovuser ovbuff usercpu syscpu numckpts flushes 0 0 0 0 387006.97 41611.33 294 587
ovtbls | Number of times the tblspace table has overflowed. In the ovrXXXX statistics that follow, a small number of occurrences is not necessarily critical. If the numbers are large or keep increasing, increase the parameter. |
Ovlock | Number of times attempted to exceed LOCKS parameter. |
ovuser | Number of times attempted to exceed USERS parameter. |
ovbuff | Number of times attempted to exceed BUFFERS parameter. |
usercpu | Cumulative total CPU time by OnLine processes, in seconds. |
syscpu | Total system CPU time (UNIX system calls). |
numckpts | Number of checkpoints since last boot. |
flushes | Number of times the buffer pool has been flushed to disk. |
bufwaits lokwaits lockreqs deadlks dltouts Ichwaits ckpwaits compress 21748 8046 476500421 9 0 1594907 1664 2220019
bufwaits | Increments when a process has to wait for a buffer. |
lokwaits | Increments when a process has to wait for a lock. |
lokreqs | Increments when a request for a lock occurs. |
deadlks | Increments when potential deadlock situation is automatically prevented by the engine. (It kills one of the processes.) |
dltouts | Increments when a distributed deadlock timeout occurs. |
Ichwaits | Increments when a process is forced to wait for a latch on a resource. |
chkpwaits | Increments when a process is forced to wait for a checkpoint. |
compress | Increments whenever a data page is compressed. |
The first section concerning buffers and cache percentages is most useful in tuning your OnLine parameters for performance. You are looking mostly at the %cached fields for reading and writing. If these numbers are low, it is possible that you can increase your performance by tuning the buffer parameters. Be cautious about reading these figures immediately after starting the database. Until the database has "warmed up" and has representative data in the buffer pool, these %cached statistics will be low. One good method is to let the database run for a while, and then do a tbstat -z to zero out all of the statistics. The statistics gathered from then on will not be tainted by the initial loading of the caches that occur when the system is initially started up.
The second section will give you an idea of the types of activities that your system is performing. These statistics are meaningful only if you have something to compare them to. As you gather your performance statistics, you will begin to develop a feel for the patterns you can expect.
The ovrXXXX sections of the next line are an early warning of a need to increase the value of their respective parameters in $TBCONFIG file. For example, if you are getting high numbers for ovlock, increase the LOCKS parameter in your $TBCONFIG file. If you're regularly going over the parameters, your users are probably already complaining about jobs that don't complete.
The lokwaits and lchwaits statistics on the last line can tip you off to a need to increase the value of some of your $TBCONFIG parameters. They can also tip you off to poorly designed programs that may be hogging resources or causing contention between themselves and other programs.
The status script included in Chapter 6 makes extensive use of these statistics in preparing a more complete "dashboard" for monitoring the activity of your database engine.
The tbstat command can run with any combination of options. You can stack the options like
tbstat -lsp
You can also use tbstat -r with an optional integer number. This option runs tbstat in a loop with a number second pause between loops. This can be quite useful if you are trying to establish trends or are just anxious to find out when something happens. With no number, it runs with a 5-second pause. For example, the following command runs the tables option every 15 seconds:
tbstat -tr 15
The tbstat -a command gives you one output that includes the following options:
b | buffers |
c | configuration |
d | dbspaces |
k | locks |
1 | logs |
m | message file |
P | profile |
s | latches |
t | tablespaces |
u | users |
Runs the UNIX tail command on your online message log.
Prints your $TBCONFIG file.
Gives you a short combined output of:
u | users |
P | profile |
18.222.120.133