CHAPTER 9

Generating Reports and
Analyzing Logs

You need to know about errors on the systems in your environment before they turn into major problems. You also need the ability to see the actions your automation system is performing. This means you'll need two types of reporting:

  • Log reports that capture bad strings, report them immediately, and ignore the rest
  • Log reports that ignore "okay" strings and report on what's left

You want to know right away if a system has serious hardware issues or major application issues. We're going to run a reporting system that looks for unwanted words or phrases. Here is one such unwanted message:

using specific query-source port suppresses port randomization and can be insecure

This particular message means that you're running a version of BIND that works around a serious vulnerability, but that a configuration directive overrides the workaround. If this were going on in your network, you'd want to know as soon as possible. We'll use real-time alerting to pick up on this condition as well as others.

Reporting on cfengine Status

You have two main ways of tracking the actions and changes that cfengine makes across your infrastructure. First, you set syslog=on globally at your site, so that cfengine logs all actions to syslog. Second, you have the output of cfagent itself, which is (of course) not included in the syslog entries. When cfexecd runs cfagent (as it does at our site), it always stores the output, including the output of any commands run by cfagent, in the $workdir/outputs directory. Also, cfexecd e-mails the output of cfagent to the sysadm e-mail address as defined in cfagent.conf (or as in our case, a file imported from cfagent.conf).

We are very interested in the contents of $workdir/outputs, and would like to aggregate them centrally. We can later run interactive checks such as simple grep searches, or directly view the files when we need to do some investigation. For monitoring purposes, we can write scripts to flag and e-mail particular output-file contents to the administrators. This sort of scheme is useful if you'd rather use custom reporting instead of the e-mail functionality of cfexecd.

The first step is to get the outputs directory contents from all hosts aggregated to a single host. This presents something of a challenge, because cfengine uses a pull model. You don't want to keep an explicit list of all your systems and have one system try to pull the outputs directory contents from each. You'd rather have each system be responsible for pushing its outputs directory contents to a central location. We can take advantage of the rsync daemon that we placed on our cfengine master to accomplish this.

This approach brings with it some important security considerations:

  • We need to grant access via a mechanism that won't allow a malicious user to access or destroy parts of the filesystem on the cfengine master.
  • We need to prevent hosts from overwriting one another's files, whether accidentally or maliciously.

We can chroot the incoming copy with a feature of rsync, but that won't work in our case because we run our rsync daemon as a non-root user. Furthermore, we don't want to start running it as root because software bugs such as buffer overflows would result in remote attackers' ability to execute code as root on our system. We'd rather run the daemon as a non-root user and protect ourselves another way. Rsync allows a pre-exec script to be run before the copy is initiated, and we'll use that functionality to do some basic security checks.


Caution You'll see the code-continuation character (image) in some of this chapter's code sections. This character signifies that the line in which it appears is actually a single line, but could not be represented as such because of print-publishing restrictions. It is important that you incorporate the line in question as a single line in your environment. You can download all the code for this book from the Downloads section of the Apress web site at http://www.apress.com.


First, we add this section to the goldmaster rsyncd.conf, located in our masterfiles repository at PROD/repl/root/etc/rsync/rsyncd.conf-www:

[outputs-upload]
        comment = cfexecd outputs dir uploads
        path = /var/log/cfoutputs
        use chroot = no
        max connections = 400
        lock file = /var/tmp/rsyncd5.lock
        read only = no
        write only = yes
        list = no
        hosts allow = *.campin.net
        hosts deny = 0.0.0.0/0
        timeout = 3600
        refuse options = delete delete-excluded delete-after ignore-errors
                                    max-delete partial force
        dont compress = *
        incoming chmod = Do-rwx,Fo-rwx
        pre-xfer exec = /usr/local/bin/rsync-outputs-dir-pre-exec

We define the pre-xfer exec script location, allow incoming copies via the write only setting, and set restrictive permissions on the copied files via the incoming chmod setting.


Note The refuse options option allows you to specify a space-separated list of rsync command-line options that will be refused by your rsync daemon. We utilize it to keep clients from deleting files or from leaving partially copied files.


On goldmaster we create the directory PROD/repl/root/usr/local/bin, and put a script into it named rsync-outputs-dir-pre-exec, with these contents:

#!/bin/sh
PATH=/bin:/usr/bin

if [ -z "$RSYNC_REQUEST" -o -z "$RSYNC_HOST_NAME" ]
then
        echo "We need to run in a rsyncd pre-exec environment."
        exit 1
fi

HOST_NAME='basename "$RSYNC_HOST_NAME" .campin.net'
case $RSYNC_REQUEST in
outputs-upload/${HOST_NAME}|outputs-upload/${HOST_NAME}.campin.net)
        # We need to rsync to a path with our hostname in it only, looks good
        :
        ;;
*)
        echo "You can only upload to a path based on your own hostname."
        exit 1
        ;;
esac

This script performs a simple check (using a case statement) to make sure the path that the client is copying to matches what we expect. We allow a client to copy only to a directory matching either its short or fully qualified name. Note that this scheme relies on the security and integrity of the DNS. You could easily modify this technique to use IP addresses instead. Feel free to implement it that way at your site, especially if you don't control your DNS servers.

Next, we'll enhance our current rsync server task at PROD/inputs/tasks/app/rsync/cf.enable_rsync_daemon so it looks like this:

classes:  # synonym groups:
        have_usr_bin_rsync      = ( "/usr/bin/test -f /usr/bin/rsync " )
        have_etc_inetd_conf     = ( "/usr/bin/test -f /etc/inetd.conf " )

control:
        any::
                AllowRedefinitionOf    = ( rsyncd_conf )
                AddInstallable               = ( hup_inetd  )

        web_master::
                rsyncd_conf     = ( "rsyncd.conf-www" )

copy:
        web_master::
                $(master_etc)/rsync/rsyncd.conf-www
                                dest=/etc/rsyncd.conf
                                mode=444
                                owner=root
                                group=root
                                type=checksum
                                server=$(fileserver)
                                encrypt=true
                $(master)/repl/root/usr/local/bin/rsync-outputs-dir-pre-exec
                                dest=/usr/local/bin/rsync-outputs-dir-pre-exec
                                mode=555
                                owner=root
                                group=root
                                type=checksum
                                server=$(fileserver)
                                encrypt=true
editfiles:
        web_master.have_etc_inetd_conf::
                { /etc/inetd.conf
                        AppendIfNoSuchLine      "rsync   stream  tcp     nowait  image
daemon   /usr/bin/rsync rsyncd --daemon --config=/etc/rsyncd.conf"
                        DefineClasses           "hup_inetd"
                }

processes:
        web_master.hup_inetd::
                "inetd" signal=hup inform=true

directories:
        web_master::
               /var/log/cfoutputs   mode=750 owner=daemon group=root inform=false
               /usr/local/bin          mode=755 owner=root group=root inform=false

tidy:
        web_master::
                /var/log/cfoutputs pattern=* age=60 rmdirs=false

In this task, we distribute the pre-exec script as well as create the directory where we'll upload the files from clients. We also include a tidy action to remove files older than 60 days from this new directory. The directory will grow without bounds if we don't do this, and a filled disk will surely come back to bite us later.

We don't currently have the tidy action defined in our actionsequence. Let's add it to PROD/inputs/control/cf.control_cfagent_conf, so that it has this for the actionsequence:

actionsequence = (
        directories
        disable
        packages
        copy
        editfiles
        links
        files
        processes
        shellcommands
        tidy
)

To upload the $workdir/outputs directory to the central host, create a task at PROD/inputs/tasks/app/cfengine/cf.upload_cfoutputs_dir with these contents:

control:
        solaris|solarisx86::
                rsync_path      = ( "/opt/csw/bin/rsync" )
        linux::
                rsync_path      = ( "/usr/bin/rsync" )

shellcommands:
        any::
                # the web_master variable holds the name of the host
                # where we run rsync as a daemon
                "$(rsync_path) -a --exclude 'previous' $(workdir)/outputs/ image
$(web_master)::outputs-upload/'hostname'"
                        timeout=600 inform=false umask=022

Add this task to the cf.any hostgroup, so all hosts upload their outputs directory on every run.

We do have a small problem, though. We're missing the rsync program in our base Solaris installation process. We'll want to install it from the Blastwave repository as part of the JumpStart process, as we do for the rest of our open source software additions. Modify the JumpStart postinstall script so that rsync is installed by pkg-get, by changing this line on hemingway (the JumpStart host) in the script /jumpstart/profiles/aurora/finish_install.sh:

pkg-get install wget gnupg textutils openssl_rt openssl_utils image
berkeleydb4 daemontools_core daemontools daemontools_core sudo cfengine subversion

We simply want to append rsync to the list:

pkg-get install wget gnupg textutils openssl_rt openssl_utils berkeleydb4 image
daemontools_core daemontools daemontools_core sudo cfengine subversion rsync

Now that the central host goldmaster is getting populated with the outputs directory output from all clients, you're free to use it as a source for reports. At this early stage in our environment's history, we still like getting the direct e-mail from cfexecd from all runs on each host. Once our site grows beyond a few hundred systems running cfengine, we'll probably find it difficult to keep up with and make sense of the e-mails as a whole.

A simple hourly or daily script to summarize and e-mail the aggregated outputs directory contents would make more sense at that point. Create a simple script for this purpose at PROD/repl/admin-scripts/cfoutputs-report with these contents:

#!/bin/sh
###############################################################################
# This script only works with GNU find, so make sure it's a Linux host.
###############################################################################
PATH=/sbin:/usr/sbin:/bin:/usr/bin:/opt/admin-scripts

case 'hostname' in
goldmaster*)
     echo "This is the proper host on which to run cfoutputs reports, continuing..."
     ;;
*)
     echo "This is NOT the proper host on which run cfoutputs reports, exiting now."
     exit 1
     ;;
esac

THRESHOLD=60
[email protected]

cd /var/log/cfoutputs &&
for dir in *
do
        find $dir -mmin -$THRESHOLD -type f | xargs cat
done | mail -s"cfoutputs report for last $THRESHOLD minutes" $RECIPIENTS

This shell script simply looks for any new files created in the centralized cfoutputs directory during the last 60 minutes (it assumes GNU find is in the path, which is the find command included with Debian GNU/Linux). It outputs the file contents to the-mail command using the cat command.


Note The pipe to the-mail command is outside the for loop, so we don't get a separate-mail for each directory under /var/log/cfoutputs. If you're not sure you understand why this is necessary, try moving the pipe to the-mail command to the same line as the find command. Experimenting with shell scripts is one of the best ways to increase your shell-scripting knowledge.


The contents of PROD/repl/admin-scripts/ are already synchronized to all hosts at the location /opt/admin-scripts, so we don't need to make any changes for the script to be copied to the goldmaster host. Because this script will be on every host at our site, we make sure that it attempts to run only when invoked on the correct host.

Create a simple task to run the script on an hourly basis, in a file called PROD/inputs/tasks/app/cfengine/cf.run_cfoutputs_report, with these contents:

shellcommands:
        # since the script needs GNU find, make sure this is a Linux host
        web_master.linux.Min00_05::
                "/opt/admin-scripts/cfoutputs-report"
                        timeout=300 inform=true

The cfoutputs directory is stored on the host serving the role of web_master—because that's where we run the rsync daemon. To have the task used, add this line to the file PROD/inputs/hostgroups/cf.web_master:

tasks/app/cfengine/cf.run_cfoutputs_report

The e-mail output of the script is very basic:

From:    root <[email protected]>
To:      [email protected]
Subject: cfoutputs report for last 60 minutes

cfengine:goldmaster: Executing script /etc/init.d/autofs reload...
(timeout=60,id=-1,gid=-1)
cfengine:goldmaster:t.d/autofs relo: Reloading automounter: checking for changes ...
cfengine:goldmaster:t.d/autofs relo: Reloading automounter map for: /home
cfengine:goldmaster:t.d/autofs relo: Reloading automounter map for: /mnt/pkg
cfengine:goldmaster: Finished script /etc/init.d/autofs reload
cfengine:aurora: Object /usr/dt/bin/dtsession had permission 4555, changed it to 555
cfengine:aurora: Removing setuid (root) flag from /etc/lp/alerts/printer...
cfengine:aurora: Object /etc/lp/alerts/printer had permission 4555,
changed it to 555

This is a good way to report on cfagent output from cfexecd. When you have new reporting needs, you can build on this short example script. Note that cfexecd offers this useful feature: it doesn't send a new e-mail when the output of the current run matches that of the previous run. Our example script doesn't implement this functionality; this is left as an exercise for the reader.

When all systems are functioning properly, we should see syslog log messages like this on the cfengine master, regarding all clients:

goldmaster cfservd[2004]: Accepting connection from ::ffff:192.168.1.236

This lets us know that the host with the IP address 192.168.1.236 is connected to our cfengine master. This is something we expect and require, and if it stops happening, something is wrong. We won't use log reports to tell us if hosts stop contacting the cfengine master; instead, we'll use another method that leverages cfengine itself.

Add a file at the location PROD/inputs/control/cf.friendstatus with these contents:

control:
        #
        # This fragment, only run on the policy server, generates warnings
        # regarding cfengine clients that have connected before, but have not
        # connected within the past 24 hours (a sign that there is likely a
        # problem with the client in question). Records for clients that have
        # not connected for 14 days are purged from the database, so a down host
        # should generate warnings at that time.
        #
        # Clients unseen for 14 days purged from cf_lastseen.db
        # http://www.cfengine.org/docs/cfengine-Reference.html#lastseenexpireafter
        #
        policyhost::
                LastSeenExpireAfter = ( 14 )

alerts:
        policyhost::
                #
                # Warn about hosts that have not connected within the last X hours
                # http://www.cfengine.org/docs/cfengine-Reference.html#alerts
                #
                FriendStatus(4)
                ifelapsed=60

We define a class for the "policyhost" machine because we currently have only a variable by that name (used in copy actions). Add this line to PROD/inputs/classes/cf.main_classes:

policyhost              = ( goldmaster )

Then import the file into cfagent.conf by adding this line:

# alert on missing hosts
control/cf.friendstatus

Now if a host stops contacting the cfengine master for more than four hours, you'll see syslog messages and alerts in the cfexecd e-mails from the cfengine master host goldmaster.

Doing General syslog Log Analysis

Syslog daemons make it easy to centralize syslog messages. They universally have the ability to forward some or all log messages to other hosts on the network.

We'll use this functionality to send the syslog output from all of our hosts to a single system. Using the syslog-ng open source syslog daemon, we'll store all logs in a directory structure sorted by hostnames and the log-message date. Earlier in the chapter, we set syslog=on globally at our site, so we can use syslog log entries to keep track of the actions that cfengine takes. Between the outputs directory and the syslog messages, we have a complete history and output of the cfengine activity at our site.

Configuring the syslog Server

We'll set up a new role on our network, that of the syslog loghost. We'll use a DNS alias to refer to the host (instead of using its actual hostname), as well as a role-based class in cfengine to control which host collects all the logs. We'll call the role sysloghost, and add a new physical host for this role. All of our Debian hosts are already imaged with the syslog-ng package installed, but we'll need to install a small amount of additional software.

We need to make some additions to FAI on the host goldmaster to support this new installation class. Modify /srv/fai/config/class/60-more-host-classes so that it has these contents:

#! /bin/bash

case $HOSTNAME ni
    etchlamp|lamphost|etchlamp*)
        echo "WEB" ;;
    loghost1)
        echo "LOGHOST" ;;
esac

Then set up disk partitioning so that it resembles the setup for the WEB class, by copying /srv/fai/config/disk_config/WEB to /srv/fai/config/disk_config/LOGHOST. Next, set up packages for this class in the file /srv/fai/config/package_config/LOGHOST:

PACKAGES aptitude
sec
ccze
logtail
mailx

Add the new system into the DNS, using the IP address 192.168.1.234. The entry for PROD/repl/root/etc/bind/debian-ext/db.192.168 is:

234        IN      PTR  loghost1.campin.net.

Here's the entry for PROD/repl/root/etc/bind/debian-ext/db.campin.net:

loghost1      IN   A     192.168.1.234
loghost       IN      CNAME   loghost1
sysloghost    IN      CNAME   loghost1

Add an entry to /etc/dhcp3/dhcpd.conf on goldmaster like this (substitute your host's MAC address), and restart dhcpd:

host loghost1 {hardware ethernet 00:0c:29:09:c1:10;fixed-address loghost1;}

Now we need to set up the imaging job in FAI with this command:

# fai-chboot –Ifv loghost1

Boot the system using PXE, after which FAI will set up the host as our site's syslog server. In 10 or 15 minutes, you can log into the host loghost1 if you'd like. No need to, though, as we'll do everything from the cfengine master, as usual. You might be noticing that pattern by now.

We'll create a syslog-ng configuration file for the syslog server role first. Place a file at PROD/repl/root/etc/syslog-ng/syslog-ng.conf-sysloghost by copying the Debian /etc/syslog-ng/syslog-ng.conf file to that location. We first need to make sure that this file has the desired udp and tcp listen lines enabled. Here's our source s_all section, with the additions in bold:

source s_all {
        # message generated by Syslog-NG
        internal();
        # standard Linux log source (this is the default place for the syslog()
        # function to send logs to)
        unix-stream("/dev/log");
        # messages from the kernel
        file("/proc/kmsg" log_prefix("kernel: "));
        udp();
        tcp(port(51400) keep-alive(yes) max-connections(30));
};

Now that we have syslog-ng on the syslog host configured to listen for syslog connections on the network, we need to add these lines to the end of the file to store the logs in a sorted manner:

# set it up
destination std {
file("/var/log/HOSTS/$HOST/$YEAR/$MONTH/$DAY/${FACILITY}_${HOST}_${YEAR}_image
${MONTH}_${DAY}" owner(root) group(root) perm(0640) dir_perm(0750) image
create_dirs(yes));
};

log {
        source(s_all);
        destination(std);
};

Now create a file at PROD/repl/root/etc/syslog-ng/syslog-ng.conf-debian for all our Debian hosts, again by copying /etc/syslog-ng/syslog-ng.conf to this new file name. Add these lines to the end of the file:

destination loghost {
        tcp("sysloghost.campin.net" port(51400));
};

# send everything to loghost, too
log {
        source(s_all);
        destination(loghost);
};

Now we'll create a task for syslog configuration across all systems at our site. Create a file on the cfengine master at the location PROD/inputs/tasks/os/cf.configure_syslog with these contents:

control:
        any::
                addinstallable          = (     hup_syslog_ng
                                                restart_syslog_ng
                                                restartsyslogd
                                                )
                AllowRedefinitionOf     = ( syslog_ng_conf_file )

        debian::
                syslog_ng_conf_file     = ( syslog-ng.conf-debian )

        debian.sysloghost::
                syslog_ng_conf_file     = ( syslog-ng.conf-sysloghost   )
copy:
        debian::
                $(master_etc)/syslog-ng/$(syslog_ng_conf_file)
                                dest=/etc/syslog-ng/syslog-ng.conf
                                mode=444
                                owner=root
                                group=root
                                type=checksum
                                server=$(fileserver)
                                encrypt=true
                                define=hup_syslog_ng

editfiles:
        redhat::
                { /etc/syslog.conf
                        AppendIfNoSuchLine '*.*         @sysloghost.campin.net'
                        DefineClasses "restartsyslogd"
                }

        solaris|solarisx86::
                { /etc/syslog.conf

                     #AutoCreate
                     AppendIfNoSuchLine '*.info              @sysloghost.campin.net'
                     DefineClasses "restartsyslogd"
                }

processes:
        hup_syslog_ng::
                # when syslog-ng.conf is updated, HUP syslog-ng
                "syslog-ng" signal=hup inform=false

        debian::
                "/sbin/syslog-ng" restart "/etc/init.d/syslog-ng start"
                        inform=false umask=022

        restartsyslogd::
                "syslogd" signal=hup inform=true
shellcommands:
        restart_syslog_ng::
                "/etc/init.d/syslog-ng stop; sleep 1 ; /etc/init.d/syslog-ng start"
                        timeout=10

disable:
        debian::
                # this breaks logrotate, remove them
                /etc/cron.daily/exim4-base
                /etc/logrotate.d/exim4-base

In this task, we copy a complete syslog-ng.conf file to Debian hosts, and edit the syslog.conf file on Solaris and Red Hat to send all syslog messages from the default syslog daemon. Be sure to insert Tab characters into the syslog.conf file when editing it. Putting Tab characters directly into the editfiles entry will do the right thing. We remove some logrotate configuration files that get left behind when the Debian postfix package replaces Exim, the default Debian mail-transfer program. When the logrotate configuration files are in place and the user accounts for Exim are missing (as they are at our site), the logrotate program fails to run. We remove the files to work around this problem.

Outputting Summary Log Reports

We want to keep a general eye on the syslog messages at our site. Programs like logcheck compile a summary of the message traffic: they ignore particular messages and display the rest. This means that over time, we'll have to add messages to an ignore file if we want to stop seeing them in the reports.

The useful nature of such reports becomes apparent when you see new sorts of messages that indicate problems or issues. Once you find these issues, you can program them into your real-time–alerting log tool (such as Simple Event Correlator, or SEC, covered in the next section) to see them immediately if they recur. The problem is that you have to learn about the errors first, and that's where logcheck comes in.

We'll utilize newlogcheck, a modification on the logcheck tool that reports from a central loghost instead of on stand-alone hosts. A second feature of newlogcheck is that it summarizes entries to report how many times each happened, as opposed to the logcheck default, which shows each and every message in its entirety.

The first step is to download logcheck from http://sourceforge.net/project/showfiles.php?group_id=80573&abmode=1. Untar it, and move the systems/linux directory to PROD/repl/root/usr/pkg/logcheck on your cfengine master. Then download newlogcheck from http://www.campin.net/download/newlogcheck.tgz and place the newlogcheck.sh and sort_logs.pl scripts into your logcheck directory. They'll be distributed by cfengine and ready to run from /usr/pkg/logcheck.

Create a task at PROD/inputs/tasks/app/logcheck/cf.logcheck with these contents:

copy:
        sysloghost::
                $(master)/repl/root/usr/pkg/logcheck
                        dest=/usr/pkg/logcheck/
                                        r=inf
                                        mode=750
                                        type=checksum
                                        ignore=/usr/pkg/logcheck/tmp
                                        ignore=tmp
                                        ignore=check
                                        ignore=host
                                        purge=true
                                        server=$(fileserver)
                                        encrypt=true

directories:
        sysloghost::
                /usr/pkg/logcheck/ mode=750 owner=root group=root inform=false

shellcommands:
        #
        # Runnning at 0600 lets us catch the logs before they're rotated.  We
        # then run one more time during the workday to see what's been going
        # on.
        #
        sysloghost.(Hr06|Hr16).Min00_05.weekday::
                "/usr/pkg/logcheck/newlogcheck.sh "
                        timeout=7200 inform=true

        #
        # we rotate logs daily shortly after 6am, so we need a single run on
        # weekend mornings too.
        #
        sysloghost.Hr06.Min00_05.weekend::
                "/usr/pkg/logcheck/newlogcheck.sh "
                        timeout=7200 inform=true

Add this line to PROD/inputs/classes/cf.main_classes:

sysloghost              = (     loghost1 )

Create the hostgroup file for the syslog host role with this file at the location PROD/inputs/hostgroups/cf.sysloghost:

import:
        any::
                tasks/app/logcheck/cf.logcheck

Use this entry to add the new hostgroup into the hostgroup-mappings file PROD/inputs/hostgroups/cf.hostgroup_mappings:

sysloghost::                    hostgroups/cf.sysloghost

This will get logcheck copied to loghost1, and it will start running reports at the times listed in the task. You should run it manually once, to make sure it works properly, and you should get a good start on adding entries to the ignore files.

Once you get your first report, note the log entries that you don't want to see in the reports any longer, and note the ignore patterns to the PROD/repl/root/usr/pkg/logcheck/*ignore files. The egrep command provides the ignore functionality, so the patterns in the files are extended grep patterns.

Doing Real-Time Log Reporting

We will utilize our centralized syslog loghost to alert the SA staff when particularly important or notable syslog messages appear. These might be alert messages sent to a pager about DNS server problems or full disks, or they might simply be informational messages sent to the administrator's inbox about how many SSH logins occurred that day.

The Simple Event Correlator (SEC) can do all of that and more. It is an open source program intended to perform event-correlation duties with network-management systems such as HP OpenView or Micromuse (now IBM) OMNIbus. SEC allows you to use messages to set a particular state, and subsequently other states, depending on further events or the passage of time. At any one of these state changes, SEC can send e-mail or execute other commands, or simply drop the current state. (Visit SEC's home page at http://www.estpak.ee/~risto/sec/.)

SEC will easily handle our simple log-alerting needs, such as e-mailing when it sees a particular event. It will also handle more advanced reporting such as tracking the process identifier (PID) of a process servicing a particular user's FTP login, recording all log messages from that process, and finally e-mailing all those events when the user logs out.

We will call SEC directly from syslog-ng on the loghost, by piping all messages directly into it. Note that the SEC program itself is located in /usr/bin because we installed the Debian SEC package at installation time (using FAI), and that's where Debian places it. First, we'll need a configuration file. Let's place a file at PROD/repl/root/usr/pkg/sec/etc/sec.conf with these contents:

type=SingleWithSuppress
ptype=RegExp
pattern=([-w]+)s+SCSI transport failed: reason 'tran_err': giving up
desc=SCSI giving up error on $1
action=shellcmd echo $0 | /usr/bin/mail -s"syslog alert: SCSI errors" image
[email protected]
window=14400
thresh=1

type=single
continue = dontcont
ptype=regexp
pattern=([-w]+)s+named: .*CNAME and other data
desc=BIND CNAME error on $1
action=shellcmd echo $0 | /usr/bin/mail -s"syslog alert: BIND CNAME errors" image
[email protected]

type=single
continue = dontcont
ptype=regexp
pattern=([-w]+)s+named.*rejected due to errors
desc=$0
action=shellcmd echo $0 | /usr/bin/mail -s"syslog alert: BIND errors" image
[email protected]

type=SingleWithSuppress
ptype=RegExp
pattern=([-w]+)s+genunix: .ID d+ kern.warning. WARNING: Sorry, image
no swap space to grow stack for pid
desc=Swap space problem on $1
action=shellcmd echo $0 | /usr/bin/mail -s"syslog alert: memory errors" image
[email protected]
window=7200

type=SingleWithSuppress
ptype=RegExp
pattern=([-w]+)s+ufs: .ID d+ kern.notice. NOTICE: alloc: (S+): file system full
desc=full filesystem $2 on host $1
action=shellcmd echo $0 | /usr/bin/mail -s"syslog alert: disk full errors" image
[email protected]
window=7200
type=SingleWithSuppress
ptype=RegExp
pattern=([-w]+)s+postfix.w+.*: No space left on device
desc=full filesystem reported by postfix on $1
action=shellcmd echo $0 | /usr/bin/mail -s"syslog alert: disk full errors" image
[email protected]
window=14400

type=SingleWithSuppress
ptype=RegExp
pattern=([-w]+)s+cfengine:.* Clocks differ too much to do copy by date
desc=time problem found by cfengine on $1
action=shellcmd echo $0 | /usr/bin/mail -s"syslog alert: clock sync errors" image
[email protected]
window=3600

The type=single lines execute an action based on a match—in this case, sending the message to the-mail command—and the rule simply ends there. The type=SingleWithSuppress line is more interesting, in that it's basically a throttling mechanism. It uses a unique key to identify the uniqueness of the message that it's throttling. The key is composed of the rule file name, the rule ID, and the event-description string derived from the desc parameter of the rule definition. This means that if you want to throttle a message across different hosts, or across multiple messages that might differ in fields other than the hostname, you'll need to manipulate the desc field in order to normalize the key. Read the SEC man page for more information.

Place a task at PROD/inputs/tasks/app/sec/cf.sync_sec_config with these contents to copy the SEC configuration directory:

copy:
        any::
                $(master)/repl/root/usr/pkg/sec
                        dest=/usr/pkg/sec
                                        r=inf
                                        mode=755
                                        type=checksum
                                        purge=true
                                        server=$(fileserver)
                                        encrypt=true
                                        define=restart_syslog_ng

directories:
        debian::
                /usr/pkg/sec mode=755 owner=root group=root inform=false

Add this task to the PROD/inputs/hostgroups/cf.sysloghost file with this entry:

tasks/app/sec/cf.sync_sec_config

Make the syslog host utilize SEC by adding these lines to PROD/repl/root/etc/syslog-ng/syslog-ng.conf-sysloghost:

destination d_sec {

        program("/usr/bin/sec -input="-" -conf=/usr/pkg/sec/etc/sec.conf image
-intevents -log=/var/log/sec.log");
};

# send all logs to sec
log {
        source(s_all);
        destination(d_sec);
};

Now you'll get e-mail alerts when any of those "bad" messages come through. Read the SEC man page to learn more about what it can do, and examine the many example rules that are distributed with the package from the web site.

Seeing the Light

At the start of this chapter, we had an infrastructure with many applications running in it, such as mail, cfengine, Apache, DHCP, DNS, and more, but we had no visibility into any log messages sent from any of those applications.

Now our infrastructure is in good shape with regard to reporting:

  • We have regular e-mails summarizing syslog messages from all our systems.
  • We have the capability to alert based on particular messages.
  • We have alerts from cfengine when clients stop contacting the cfengine master.
  • We have full cfagent output (as collected by cfexecd) automatically pushed to our cfengine master system for troubleshooting and custom reporting.

One very important area where we're still blind: the availability of the network services at our site. We have no automated method of finding out if our web site goes down, if SSH stops working, or if the disks fill up on any of our hosts. We'll address network and host monitoring in the next chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.253.2