Chapter 10. Cluster Management

Cluster Management

Clusters need care and feeding, just like humans. Of course, you won’t find most clusters chowing down at Burger King, but they do need attention nonetheless. Clusters require maintenance and a watchful eye for them to run at optimal efficiency.

All clusters need some sort of monitoring software to keep tabs on what’s going on. It’s better to find out what might potentially cause problems in the future than to be called by a user wanting to know why they’re unable to access certain resources. And it’s a good thing to have safeguards in place that can notify you when processes or servers aren’t able to function. With the right tools and configurations, you can know something has gone wrong before your supervisor, so that you can be already working on the problem. Surprises aren’t a good thing in this game.

Larger clusters need a team of dedicated administrators on them at all times to ensure that the infrastructure is running. They also need consistent monitoring software to maintain hundreds of servers at once. Have you ever tried to log on to each of your large servers and diagnose each and every one? It’s just not feasible, and that’s why people invent tools to monitor the health of multiple servers at one time.

Learn to Use the Right Tools

Believe it or not, one of the best tools possible for maintaining the health of your servers is plain old pen and paper. Even though most of us like documenting procedures as much as we enjoy stabbing ourselves with a spork, it’s a great way to keep up with changes to our environment and servers. Keeping a binder with up-to-the-minute data on equipment is a great way to forecast possible events and to keep a record of things. It’s also a handy, one-stop shop that can disseminate information about the history, applications, and maintenance contracts associated with each server.

Although at first glance, the idea of a paperless office is appealing, you might want to stay away from publishing your documentation on the web. On one hand, keeping documentation online makes it easily accessible to anyone that needs access to the web page. However, keeping documentation online also makes it accessible to potentially undesirable individuals. Insecure web pages can get broken into if security isn’t up to snuff. Anyone that has access to the web page can extract information about each node, and possibly exploit potential weaknesses. Also, if your documentation server goes down, you lose all hope of pulling up information about other servers.

What’s good to include in documentation about your servers? Basically anything that can potentially help you to recover from a crash or potential problem. It’s also a great place to house information about serial numbers, maintenance contracts, and application information. Following is a list of things that you might want to keep track of in your binder:

  • Server type

  • Number of processors

  • Amount of RAM

  • Maximum amount of RAM

  • Bios information

  • Hard drive/Partitioning information

  • Serial numbers

  • Change management information

  • Maintenance contracts; support numbers

  • Installed applications

  • Application owners

  • Service level agreements (SLAs)

  • List of phone numbers to call should anything happen, with escalation procedures

  • Instructions on how to bring the server and applications back up to production mode from a cold power off

  • Instructions on how to bring the server down to power off

  • Backup schedule and offsite rotation; offsite numbers for tape retrieval

  • Emergency restore procedures (Kickstart, SystemImager, CD, and so on)

It’s a good idea to grab the binder of information each time that there’s a problem with a server. Not only does this keep all the information about each server at your fingertips, but it also makes it convenient to enter any new information.

Change management can’t be stressed enough. Although recommended for each node in the cluster, it’s essential for systems that potentially are touched by numerous people. When several people work on one system or cluster, there is the possibility of overlapping work or making changes that aren’t documented, which can result in an administrator not knowing what changes have been made. It’s also a good idea to have meetings about changes; it’s often difficult to know what ramifications certain changes can have on the rest of the cluster, much less to the environment as a whole.

Configuring syslogd for Your Cluster

One of the best, yet sometimes overlooked tools for administering Linux servers is the simple syslogd facility. This daemon is potentially your best friend. syslogd is responsible for taking all possible information about your machine and putting it into neat little perusable log files that you can manipulate at your disposal. syslogd is great for capturing and displaying information. syslogd notifies you of login events, most importantly, failed login events, File Transfer Protocol (FTP) logins, daemon information, sendmail information, and so on.

One of the best features of syslogd is that it can write information about the events that it receives to a remote host. That means that you can get information about each node in your cluster sent to one main event server. This logging is essential for every type of cluster that you maintain, with the possible exception of distributed clusters. You might not want to get information on 300,000 users’ syslogd events.

syslogd is event driven. In each event, it includes a severity and parses the information depending on how you want it to be transmitted. You can send the information to the console, to certain log files, or even email yourself with certain events through scripting.

syslogd ’s configuration file is typically shipped as /etc/syslog.conf, and is usually started at boot in /etc/init.d/syslog, or some other link. syslogd is started with several options, usually with the –m or –r switch. Starting syslogd with the –m switch tells syslogd to log each event of a certain time period whether or not there was an event that took place. Using syslogd with the –r switch allows syslogd to accept logging information from different hosts. The –d switch turns on debugging information, which logs all debugging information to the current tty.

The main format of the syslog.conf file is facility.level action, as in the following example:

mail.debug /var/log/mail 

This sends all debug information (level) that is generated by mail (facility) and logged to /var/log/mail (action). You can also specify the asterisk wildcard with syslogd, as follows:

cron.* /var/log/cron 

This logs all levels of cron facility information to /var/log/cron.

You can enable remote logging to a log host by appending the @<hostname>, instead of the action attribute, to direct the logging information to the remote log host instead of a local file. It’s a good idea to place an entry for the remote log host in /etc/hosts. Be sure that the log host is running with the –r switch.

You can add services from your /etc/inetd.conf to be logged with syslogd by adding the –l switch to enable logging to that service. To add incoming connections from FTP, you change the FTP service to read the following:

ftp    stream  tcp     nowait  root    /usr/sbin/tcpd  in.ftpd –l 

You can basically add anything to syslogd, if it’s listed in /etc/services.

Creating syslog.conf Entries

It is fairly simple to enable logging when creating entries in inetd.conf. The trick is in finding out what logs to where and how often. For example, you don’t typically want your FTP server to send you messages about email problems. The goal is to maintain as light a load as possible without sacrificing vital logs. You don’t want to spend half your day pouring through log messages.

Following is a listing of various levels of importance for syslogd from the man page. You can assign your own levels of importance based on which criteria best suits your operation:

EMERG          system is unusable 
ALERT          action must be taken immediately 
CRIT           critical conditions 
ERR            error conditions 
WARNING        warning conditions 
NOTICE         normal, but significant, condition 
INFO           information messages 
DEBUG          debug-level messages 

Following is a listing of facility arguments for built-in subsystems:

AUTHPRIV               security/authorization messages 
CRON                   clock daemon (cron and at) 
DAEMON                 other system daemons 
KERN                   kernel messages 
LOCAL0 through LOCAL7  reserved for local use 
LPR                    line printer subsystem 
MAIL                   mail subsystem 
NEWS                   USENET news subsystem 
SYSLOG                 messages generated internally by syslog 
USER                   generic user-level messages 
UUCP                   UUCP subsystem 

Remember that you can use combinations of these arguments, including the asterisk wildcard. If you want to send all line printer (lpr) messages to /var/log/lpr, you can do that with the following code:

lpr.*  /var/log/lpr 

You can also mail all error conditions to users’ root, and news about all news error conditions by entering the following:

news.err  root,news 

You can even send messages to a tty by specifying /dev/ttyS3 for the action:

news.err  /dev/ttyS3 

You can specify different levels of facilty.level by separating them with a ; and the ! expression to denote not. The following example logs all cron entries to /var/log/cronlog, except for informational messages about cron:

cron.*;cron.!=info  /var/log/cronlog 

Following is a sample configuration file that is shipped with Red Hat:

# Log all kernel messages to the console. 
# Logging much else clutters up the screen. 
kern.*               /dev/console 

# Log anything (except mail) of level info or higher. 
# Don't log private authentication messages! 
*.info;mail.none;authpriv.none;cron.none             /var/log/messages 

# The authpriv file has restricted access. 
authpriv.*                                           /var/log/secure 
# Log all the mail messages in one place. 
mail.*                                               /var/log/maillog 

# Log cron stuff 
cron.*                                               /var/log/cron 

# Everybody gets emergency messages 
*.emerg                                              * 

# Save news errors of level crit and higher in a special file. 
uucp,news.crit                                       /var/log/spooler 

# Save boot messages also to boot.log 
local7.*                                             /var/log/boot.log 

You can also use the logger command to test and place entries in your syslogd messages. It provides a shell interface to the syslogd module.

The basic format is logger –p <facility> -t <tag> -f <file>. The –p switch determines the priority to log at, while the –t switch places a user-defined entry in –f, which is the specified file. You can manually use the syslogd facility by entering the following:

# logger –p user.notice –t PRINT_NOTICE "Printer queue successfully flushed." 
–f /var/log/printlog 

This enters in /var/log/printlog a message of PRINT_NOTICE: Printer queue successfully flushed.

You can even try out logger’s capabilities simply by entering the following:

# logger This is a test, boyo. 

Check out your /var/log/messages after the test message. You see something similar to the following:

Nov 14 16:04:22 matrix root: This is a test, boyo. 

General-Purpose Reporting with mon

Although parsing system messages that are logged by syslogd can be indicative of problems that are fast approaching, it takes a lot of time to parse messages on a daily basis. Because of time restraints, it’s helpful to be able to glance at icons to determine the status of your cluster, or your entire environment.

mon takes service monitoring a few steps beyond syslogd reporting. It is designed to monitor applications and respond to them based on events. mon is written completely in Perl 5, and is designed to monitor network and server availability. Because mon is written in Perl, you need the following modules installed before you install it:

  • Time::Period

  • Time::Hires

  • Convert::BER

  • mon::*

You can grab these all from www.cpan.org. Actually, most of the scripts that mon uses are built around these modules. Bookmark www.cpan.org if you haven’t already.

mon comes packaged with several monitoring services that you can add to your configuration, but it doesn’t include the modules for them. You need to grab these from cpan if you’re going to include more services, depending on what you’re implementing.

Configuring the mon Client

The mon client is distributed separately from the server. You can get it from ftp://ftp.kernel.org/pub/software/admin/mon. Uncompress the software, which is in the clients subdirectory. The mon client is a series of Perl scripts that query the mon server. You need to install the current version of the mon perl module, which is currently available at Mon-0.11.tar.gz from www.perl.com/CPAN.

Installing the module is done by uncompressing the file, and entering the following lines:

# perl Makefile.PL 
# make 
# make test 
# make install 

Implementing the mon Server

Now that you’ve installed the required Perl modules, you need to grab the latest mon source from ftp://ftp.kernel.org/pub/software/admin/mon/. Uncompress the file, edit either the example.cf or the m4-based example.m4, and save it as either mon.cf or mon.m4. These files are in the <MonRoot>/etc directory. Make sure that it best reflects your environment.

The resulting mon.cf file can reflect any number of configurations.

Following is a sample mon.cf file:

# 
# Example "mon.cf" configuration for "mon". 
# 
# $Id: example.cf 1.1 Sat, 26 Aug 2000 15:22:34 -0400 trockij $ 
# 
# global options 
# 
cfbasedir   = /root/mon/etc 
alertdir    = /root/mon/alert.d 
mondir      = /root/mon/mon.d 
maxprocs    = 20 
histlength = 100 
randstart = 60s 
# 
# authentication types: 
#   getpwnam      standard Unix passwd, NOT for shadow passwords 
#   shadow        Unix shadow passwords (not implemented) 
#   userfile      "mon" user file 
# 
authtype = getpwnam 

# 
# NB:  hostgroup and watch entries are terminated with a blank line (or 
# end of file).  Don't forget the blank lines between them or you lose. 
# 

# 
# group definitions (hostnames or IP addresses) 
# 
hostgroup serversbd1 matrix neo dumass 

# 
# For the servers in building 1, monitor ping and telnet 
# BOFH is on weekend call :) 
# 
watch serversbd1 
    service ping 
        description ping servers in bd1 
        interval 5m 
        monitor fping.monitor 
        period wd {Mon-Fri} hr {7am-10pm} 
            alert mail.alert [email protected] 
            alert page.alert [email protected] 
            alertevery 1h 
        period NOALERTEVERY: wd {Mon-Fri} hr {7am-10pm} 
            alert mail.alert [email protected] 
            alert page.alert [email protected] 
        period wd {Sat-Sun} 
            alert mail.alert [email protected] 
            alert page.alert [email protected] 
    service telnet 
        description telnet to servers in bd1 
        interval 10m 
        monitor telnet.monitor 
        depend serversbd1:ping 
        period wd {Mon-Fri} hr {7am-10pm} 
            alertevery 1h 
            alertafter 2 30m 
            alert mail.alert [email protected] 
            alert page.alert [email protected] 

        monitor telnet.monitor 
        depend routers:ping serversbd2:ping 
        period wd {Mon-Fri} hr {7am-10pm} 
            alertevery 1h 
            alertafter 2 30m 
            alert mail.alert [email protected] 
            alert page.alert [email protected] 

You don’t need to place mon in any special directory to run, but make sure that the mon configuration file reflects where you put it.

Next, edit the auth.cf file. This file authenticates which users can perform which command.

Add the following to /etc/services of your machines so that mon can run as a service:

mon             2583/tcp                        # MON 
mon             2583/udp                        # MON traps 

You also want to add a cname of monhost to alias your mon server in your Domain Name System (DNS) server. mon does a lookup of monhost through DNS, so make sure that this is set. If you’re not running DNS, place an entry in /etc/hosts. Also, add an environment variable to be this host, such as the following:

# export MONHOST="host.domain.com" 

Next, try starting mon from the distribution directory. You can start the mon.c f version with the following:

# ./mon -f -c mon.cf -b `pwd` 

You can start the mon.m4 version with the following:

# ./mon -f -M -c mon.m4 -b `pwd` 
# ./clients/moncmd -s localhost list pids 

This tests if mon is running correctly on your local machine.

Big Brother Is Watching

Although mon is a great, lightweight reporting tool, it’s not designed to give the administrator or operators a quick glance view at the environment. Thankfully, there’s graphical monitoring tools that can be implemented to react to user-defined events that not only give the administrator an overall picture of the environment, but react in certain ways, similar to syslogd. Tivoli and OpenView do a great job of monitoring and reacting to events, although they might be cost prohibitive for smaller organizations.

Enter Big Brother, which is free for non-commercial use, with their “better-than-free” license scheme. This means that you get to try out the software if you’re a commercial organization, or simply use it if you’re not making any money with your large parallel cluster, trying to solve the mysteries of the universe.

Big Brother shows data about several user-defined processes by displaying through a web page, or through Wireless Markup Language (WML)-enabled wireless application protocol (WAP) devices. Six different icons that represent severity levels about any number of processes are displayed, which allows an administrator to know at a glance how any number of nodes is behaving. Big Brother not only offers these status icons, but also displays the background screen in red, yellow, or green, so that you know just how your environment is doing.

You can get Big Brother server and clients for Linux, other UNIX variants and Windows NT/2000, and clients for Netware, Mac OS 9,VMS, and AS/400. Big Brother also supports extensible plugins, with several already available for download. You can get plugins to monitor toner in a laser printer, NetApp filers, Oracle table space monitoring, and even stock price fluctuations.

Installing and Configuring Big Brother

The first thing you must remember is that Big Brother needs a working web server because it’s displayed in web page format. Apache works well for this because it’s included in many Linux distributions by default. Big Brother needs to implement its scripts in your cgi-bin to work correctly.

You can get Big Brother at http://bb4.com/download.html. You can download the server and clients from there after you accept the license agreements.

To install it, you create a separate user that Big Brother can run under. Call the user whatever you want, but be sure that this user owns the resulting directories and files. It’s recommended that you install the files initially by root and then chmod and chown to the Big Brother user afterwards. You log in to the bb account after creating it, uncompress all the files to the root directory, and change to superuser to install the files.

Change to the bb<version>/install directory and run bbcon fi g <OS-NAME>. In this case, you probably choose Linux; however, the Big Brother installation is distro-specific. You can choose from Red Hat, Debian, Caldera, Mandrake, or Linux. For example, the following starts an install of Big Brother on Caldera Linux:

# ./bbconfig caldera 

This starts the install script, which asks you a series of questions. It’s helpful to prepare a checklist first, or to have a good memory of where files are installed on your system. The configure script displays a copy of the license. If you agree to the license, hit Enter to continue and install the program. The installation script goes through its installation routine, and suggests options as you go. Remember that it’s not a great idea to run Big Brother as root.

After running the configure script, change back into the src directory (cd ../src), type make, and then make install.

After that’s done, enter into the etc directory (cd ../etc) and edit bb-hosts. You also edit bbwarnsetup.cfg to tell Big Brother how to contact someone if something goes wrong. The install/README has more information on this. Read the documentation and edit as necessary.

Run ./bbchkcfg.sh and then ./bbchkhosts.sh. These check your configuration files for errors.

Make a symbolic link from your bb<version> directory to bb. This makes things easier in the future, and allows you to use the preconfigured scripts without modifying everything. For example, make a link that is similar to the following:

# ln -s /home/bb/bb18c1 /home/bb/bb 

Go into the bb directory by using the link that you just created and chown everything to the Big Brother user that you created:

# chown –R <bbuser> . 
# cd ..; chown –R <bbuser> bbvar 

Make a symbolic link from the full path to the bb/www directory to the bb directory in the web server’s document root directory. Make sure that that this directory has the proper permissions. You might need to specify a different document root directory in your server configuration. For Apache, the default configuration file is httpd.conf:

# ln -s /home/bb/bb/www /usr/local/apache/htdocs/bb 

Enter in your bb directory (/home/bb/bb) and type ./runbb.sh start.

Adding More Clients to Big Brother

Now that your server is installed, you also need to install the bb monitoring service on each of the clients that you wish to monitor. Because this is potentially a time consuming operation, you might be better of writing a script to do this on each client, and share both the script and the executables over the Network File System (NFS). Either that, or you can make a golden client with SystemImager and install Big Brother from there (see Chapter 3, “Installing and Streamlining the Cluster”).

To install the client on each machine, copy the distribution to each client and uncompress the distribution there. The installation is mostly the same; however, you only point the Big Brother host to the master server in your bb-hosts file, and then run ./runbb.sh to start. Big Brother service runs as a dae-mon, which is polled by the master server, so you only need a web server that is running on the master.

Using Big Brother

Using the Big Brother monitor is self-explanatory. The default web screen shows either a green, yellow, or red background on a black web interface. Each server shows the status of its services on the main screen with any one of six different status markers. Contained in these status markers are the following:

  • ok—Everything is fine. No problems noted here. If everything is green, you see a green and black background.

  • attention—These reflect warning messages in the message logs.

  • trouble—These reports reflect a loss of functionality in the desired service(s). A report of this nature turns the background red.

  • no report—This reflects a loss of connectivity with that service for over a half-hour. This turns the screen purple. This might also potentially be the result of a severely overloaded system. Further investigation is warranted.

  • Unavailable—This results in a clear icon, and no background is changed.

  • Offline—This appears when monitoring for this service is disabled.

Each status icon can be clicked to give a more accurate report of the error code. Big Brother also gives a history of the status by clicking the History button. Adding more hosts and services can be done by editing the bb-hosts file and restarting Big Brother.

Summary

Configuring clusters, nodes, and servers is only half the battle. As the environment grows, it becomes difficult to keep tabs on each system. As the administrator tends to go from a proactive stance to fire fighter, more tools are needed at everyone’s disposal to keep systems operational and in line.

Some of the best tools at the system administrator’s disposal include monitoring tools that keep a watchful eye on their entire servers and critical systems. There’s nothing like being able to have processes watch other processes so that you don’t have to. Thankfully, there are enough monitoring tools out there so that you don’t have to write your own. Between being able to configure syslogd messages to your hearts content, and free (including free for non-commercial usage) monitoring software, an administrator is able to keep a tight reign on their clusters. You might want to also look at other freeware monitoring tools such as netsaint, netview, nocol, and mrtg. These monitors all work well enough that you won’t need to spend tons of money on commercially available software.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.29.119