System process monitors can be a vital tool in determining the health of a running machine. Ensuring that the required processes are running and that the total number of each type of running process is appropriate is a good way to maintain system stability. The downside of these types of monitors is that they let you know only which processes are running and how many there are. They don't give you an indication of the health of each individual process.
This script dives a little deeper into the condition of processes. By using the ps
command with a customized format, we'll be able to monitor the age, proportion of CPU usage, virtual-memory consumption, and amount of CPU time consumed by a particular process. If you are monitoring multiple instances of any given process, each instance will be held up to the standard being monitored.
One other feature of this process monitor is that it can be configured not only to warn you of impending peril from processes whose operational values are out of bounds, but also to take action in the form of killing the aberrant process when necessary. The monitor could be modified easily to perform other actions besides killing a process.
Using historical data, you can sometimes predict when a specific application will start to consume too many resources. It was one such application I was working with that prompted me to write this monitor. The monitor helped in characterizing exactly when the application ran out of control and in finding the cause of the behavior. Both were very helpful in fixing the problem.
The syntax for monitor configuration is fairly straightforward, with five colon-separated fields as shown in the following example. The fields are as follows: the process command, the indicator to track, a lower threshold, an upper threshold, and the kill option. You can configure multiple processes by including several records in the configuration string.
kill_plist="dhcpd:pcpu:15:30:1 sshd:pcpu:15:30:1"
The first field is the process command itself. This will be slightly different, and hopefully simpler, than the traditional ps -ef
output. The ps -ef
default output (-e
for all processes, -f
for formatted output) includes the commands that are running, as well as any arguments they were passed. The ps -eo comm
output is formatted to include only the commands that are running on a system without any path or argument information. With this switch combination (-eo
) you can also format your output in many ways to show many other options, such as memory size, process age, process CPU time, and so on. (On some UNIX systems, you may need to define the UNIX95
variable within the script for the ps -eo
command to function properly. The UNIX95
variable can be set to anything you'd like; it just needs to not be undefined.) When specifying the process for our script to monitor, you'll want to use only the command name, as this is what the script will be looking for.
The second field contains the indicator you want to track. The options are cputime
, which measures the number of minutes the cpu has allocated to the process; etime
, which is the elapsed time in minutes since the process began running; pcpu
which represents the current percentage of the CPU capacity the process is consuming; and vsize
, which shows the virtual-memory size in kilobytes for the process.
The third and fourth fields contain the desired lower and upper thresholds for the indicator you're tracking.
The fifth and final field is the kill option. It is a value from 0 to 3:
0
: Send a notification when either the low warning or high error threshold have been crossed, but don't kill the process.
1
: Send a warning notification when the low threshold has been crossed or an error notification when the high threshold has been crossed, and kill the process.
2
: Send only a low-level warning notification when either the low or high threshold has been crossed, and kill the process.
3
: Kill the process without any notification at all.
Note that for safety, if the kill option is not set or is set to anything but one of the values outlined here, processes will not be killed. Notice that there are two levels of notification. I have used alphanumeric paging for the high level (error status) and e-mail for the low level (warning status). You may want to implement the notification method as appropriate for your needs.
The first section of the script sets up a few configuration variables, which alternatively could be stored in a separate configuration file and sourced each time the script runs through the loop. This would allow for live configuration changes to the script. The debug
value is for testing and the sleeptime
value represents the amount of time to delay between each run. The kill_plist
variable is the main configuration value that lets the script know what processes and values it should be watching.
#!/bin/sh
debug=1
sleeptime=3
kill_plist="dhcpd:pcpu:15:30:1 sshd:pcpu:15:30:1"
The following function performs all notifications and process terminations in the script. It is called with seven sequentially numbered parameters. The positional variables are somewhat difficult to understand and their values could have been assigned to more meaningfully named variables before they were used, for ease of debugging later. To streamline the script a little, I didn't do this.
notify ()
{
case $2 in
0)
# Warn/error level and don't kill..
echo "$1: $3 process id $4 found with $5 $7. Should be less than $6."
;;
1)
# Warn/error level and kill..
echo "$1: $3 process id $4 found with $5 $7. Should be less than $6."
test $debug -eq 0 && kill $4
;;
2)
# Warning level only...
echo "Warning: $3 process id $4 found with $5 $7. Should be less than $6."
test $debug -eq 0 && kill $4
;;
3)
# Just kill, don't warn at all..
test $debug -eq 0 && kill $4
;;
*)
echo "Warning: killoption not set correctly, please validate configuration."
;;
esac
}
Here, for ease of reference, I define all of the command-line arguments passed to this function:
$1
: Text passed used for building the notification string; used for the difference between warning and error
$2
: The kill option, which has a possible value of 0-3
$3
: The process name that is being monitored
$4
: The process ID of the process being monitored
$5
: The current value of the indicator you are tracking
$6
: The monitor's lower threshold
$7
: The text equivalent of the indicator you are tracking
This is also a good example of how a function can reduce the length and complexity of a script. The body of this function is code that would have to be repeated eight times throughout the script if it were not placed in a function. An older version of this script was written this way. Putting the code into a function reduced the script's length by roughly 40 percent.
The following code is the beginning of the main loop. The script is intended to be run at system startup; it will then be run continuously through an infinite loop. After each iteration completes, the script will sleep for a predetermined time before the next iteration. The first part here is a nested loop that progresses through each record in the configuration string to parse its fields and set up the monitor.
while :
do
for pline in $kill_plist
do
process=`echo $pline | cut -d: -f1`
process="`echo $process | sed -e "s/%20/ /g"`"
type=`echo $pline | cut -d: -f2`
value=`echo $pline | awk -F: '{print $3}'`
errval=`echo $pline | awk -F: '{print $4}'`
killoption=`echo $pline | awk -F: '{print $5}'`
The process
variable is assigned the first field in the configuration record (pline
). It is possible that the process command name you're monitoring will consist of more than one word, separated by spaces. Such spaces are replaced (here using the sed
command) with %20
, which is a commonly used substitute for the space character, as in URL encoding, for example.
The type
variable is the second field in the configuration record. As mentioned, it specifies the performance indicator to watch: cputime
(amount of CPU time consumed), etime
(elapsed time or age of process), pcpu
(current percentage of the CPU consumed), or vsize
(virtual-memory size).
The value
variable holds the lower warning threshold for the monitored value, taken from the third field.
The errval
variable is assigned the value of the upper error threshold for the monitored value, taken from the fourth field.
The killoption
variable is assigned the final field of the configuration record and specifies an action to perform when the process deviates from the normal range.
If the kill option was not specified initially, we set it to be the default kill option. This makes sure no processes are killed unless one of the options for doing so is explicitly used.
if [ "$killoption" = "" ]
then
killoption=0
fi
test $debug -gt 0 && echo "Kill $process processes if $type is greater than
$errval"
Next we pare down the full list of processes running on the system to the ones running the command being monitored. Then we start a loop that iterates through the remaining processes.
for pid in `ps -eo pid,comm | egrep "${process}$|${process}:$" | grep -v grep |
awk '{print $1}'`
do
For each process ID, the script has to gather the pertinent information. The embedded ps
command gathers only the specific information we want.
test $debug -gt 0 && echo "$process pid $pid"
pid_string=`ps -eo pid,cputime,etime,pcpu,vsize,comm |
grep $pid | egrep "${process}$|${process}:$" | grep -v grep`
The following case
statement is the heart of the monitor. The script tests for the monitor type (cputime
, etime
, pcpu
, or vsize
); the cputime
is the first monitor type listed. The code for each type is slightly different, but all are very similar. Here we obtain the process time from the ps output, as well as the number of fields that the proc_time
variable contains.
case $type in
"cputime")
proc_time=`echo $pid_string | awk '{print $2}'`
fields=`echo $proc_time | awk -F: '{print NF}'`
proc_time_min=`echo $proc_time | awk -F: '{print $(NF-1)}'`
Both of these are needed because the format of the time value varies depending on the amount of time it represents. The cputime
and etime
variables have values of the form days-hours:minutes:seconds or hours:minute:seconds. A low value might look something like 00:28 for 28 seconds. A high value could be 1-18:32:29 for 1 day, 18 hours, 32 minutes, and 29 seconds. Both of these types have to be processed and converted to minutes. (Seconds are dropped for simplicity.)
Of the four performance indicators, the logic for handling the cputime
and etime
values is the most complex because the format used to report them changes depending on the amount of time these values represent.
if [ $fields -lt 3 ]
then
proc_time_hr=0
proc_time_day=0
else
proc_time_hr=`echo $proc_time | awk -F: '{print $(NF-2)}'`
fields=`echo $proc_time_hr | awk -F- '{print NF}'`
if [ $fields -ne 1 ]
then
proc_time_day=`echo $proc_time_hr | awk -F- '{print $1}'`
proc_time_hr=`echo $proc_time_hr | awk -F- '{print $2}'`
else
proc_time_day=0
fi
fi
Once all time values have been determined, we convert them to minutes for comparison with the monitor thresholds.
curr_cpu_time=
`echo "$proc_time_day*1440+$proc_time_hr*60+$proc_time_min"
| bc`
test $debug -gt 0 && echo "Current cpu time for
$process pid $pid is $curr_cpu_time minutes"
If the current cputime
value is between the warning and error thresholds, we call the notify()
function with the appropriate switches. It will handle output and process termination, as described earlier.
if test $curr_cpu_time -gt $value -a
$curr_cpu_time -lt $errval
then
notify "Warning" $killoption $process $pid
$curr_cpu_time $value "minutes of CPU time"
If the current cputime
is greater than the error threshold, we call the notify()
function with a different set of options.
elif test $curr_cpu_time -ge $errval
then
notify "Error" $killoption $process $pid
$curr_cpu_time $value "minutes of CPU time"
The final condition handles the case where there is no issue with the running process: the script just issues a message saying so.
else
test $debug -gt 0 && echo "process cpu time ok"
fi
;;
The etime
monitor is nearly the same as the cputime
monitor. The primary difference is the field that is extracted from the ps
output to get the current process age.
"etime")
proc_age=`echo $pid_string | awk '{print $3}'`
fields=`echo $proc_age | awk -F: '{print NF}'`
proc_age_min=`echo $proc_age | awk -F: '{print $(NF-1)}'`
Once again, you convert the age of the process to values that will then be used to calculate the age in minutes.
if [ $fields -lt 3 ]
then
proc_age_hr=0
proc_age_day=0
else
proc_age_hr=`echo $proc_age | awk -F: '{print $(NF-2)}'`
fields=`echo $proc_age_hr | awk -F- '{print NF}'`
if [ $fields -ne 1 ]
then
proc_age_day=`echo $proc_age_hr | awk -F- '{print $1}'`
proc_age_hr=`echo $proc_age_hr | awk -F- '{print $2}'`
else
proc_age_day=0
fi
fi
Now expressing the process age in minutes makes the threshold check very simple.
curr_age=
`echo "$proc_age_day*1440+$proc_age_hr*60+$proc_age_min"
| bc`
test $debug -gt 0 && echo "Current age of $process pid
$pid is $curr_age minutes"
We now perform the comparison checks against the monitor thresholds as before. The first check determines if the current process age is between the low and high thresholds. The second sees if the current age is above the high threshold. In both these cases, call the notify()
function for end-user output and process termination. The final possibility is that there is no issue, and in this case the script gives a message stating that the process is OK.
if test $curr_age -gt $value -a $curr_age -lt $errval
then
notify "Warning" $killoption $process $pid
$curr_age $value "minutes of elapsed time"
elif test $curr_age -ge $errval
then
notify "Error" $killoption $process $pid
$curr_age $value "minutes of elapsed time"
else
test $debug -gt 0 && echo "process age ok"
fi
;;
The test for percentage CPU usage is quite simple. The value to be compared to the thresholds is obtained directly from the ps
output. There is no need for further calculation as was needed in the code for the cputime
and etime
monitors.
"pcpu")
curr_proc_cpu=`echo $pid_string | awk '{print $4}' |
awk -F. '{print $1}'`
test $debug -gt 0 && echo "Current percent cpu of
$process pid $pid is $curr_proc_cpu"
Once again, we compare the percentage CPU value with the configured low and high thresholds and call the notify()
function to alert the user and perform any required process termination. If the CPU percentage is below either of these values, the code outputs an "OK" message.
if test $curr_proc_cpu -gt $value -a
$curr_proc_cpu -lt $errval
then
notify "Warning" $killoption $process $pid
$curr_proc_cpu $value "percent of the CPU"
elif test $curr_proc_cpu -ge $errval
then
notify "Error" $killoption $process $pid
$curr_proc_cpu $value "percent of the CPU"
else
test $debug -gt 0 && echo "process cpu percent ok"
fi
;;
The vsize
monitor is as simple as the percent-CPU monitor. We obtain the current process's memory footprint directly from the ps
output.
"vsize")
curr_proc_size=`echo $pid_string | awk '{print $5}'`
test $debug -gt 0 && echo "Current size of $process pid
$pid is $curr_proc_size"
We have to check the current memory size against the monitor thresholds one last time. If they are within a low or high warning status, we call the notify()
function for output and termination. If not, the code outputs that the process size is OK.
if test $curr_proc_size -gt $value -a
$curr_proc_size -lt $errval
then
notify "Warning" $killoption $process $pid
$curr_proc_size $value "blocks of virtual size"
elif test $curr_proc_size -ge $errval
then
notify "Error" $killoption $process $pid
$curr_proc_size $value "blocks of virtual size"
else
test $debug -gt 0 && echo "process virtual size ok"
fi
;;
Finally we close the monitor case statement and the two inner processing loops. The script then goes to sleep
for the configured amount of time before starting over again. It will then continue its monitoring until the monitor itself dies or is killed or the system is shut down.
esac
done
done
sleep $sleeptime
done
3.15.2.78