I get knocked down, but I get up again, 'cause you're never gonna keep me down.
—From the song Tubthumping by Chumbawamba
Much of modern computer hardware is self-monitoring and self-correcting. It tests itself and reports real and impending errors so that preemptive maintenance can be performed, often in the form of "hot swap" components that can be replaced without interrupting system activity. What would a similar approach to system software look like? It would need a framework for identifying and classifying services and their dependencies, for monitoring and reporting their status, and for some form of autorecovery. UNIX has historically lacked such a framework, relying instead on ad hoc solutions to determine which services are running, which services are not running that should be and why, and which potential services are available.
Think about how you have typically configured service programs in early versions of UNIX and Linux. You created a shell script in one of the /etc/rc.d
directories, prefixed the script name with an S or a K to identify it as a start or kill script, and gave the script a number that determines when it is run. Thus, the /etc/rc.d/rc5.d/S80sendmail
script starts the sendmail
service when the system enters run level 5. It starts up after the sshd
service specified in /etc/rc.d/rc5.d/S55sshd
. So, does sendmail
depend on the sshd
service being ready before it starts? You can't tell from the scripts! An error in the name or location of your service script can prevent it from running; locating such errors has also been difficult. Administrators usually resort to searching the system log files and process tables using grep
; such a simplistic approach often results in incomplete information about the nature of the problem:
To address these issues, OpenSolaris includes the Service Management Facility (SMF), which defines a framework and administrative tools for configuring and monitoring system services.
You already know what a service is; it's a persistently running application, usually started at system boot time and generally not associated with an interactive user's login session. Programs such as the Apache web server, the MySQL database, NFS file servers, the sendmail
email daemon, firewalls, DNS servers, and the sshd
login daemon are all typical examples of services started when you boot your system. Services listen for and respond to requests for some action such as opening and sending a file with NFS, queuing and printing files, delivering and forwarding email, or responding to database queries.
SMF provides a framework to assign OpenSolaris services a standard state model, naming standards, dependency assignments, and restarter methods, all under control of a service daemon (svc.startd
), which is notified of service outages and recovers them according to your specifications. You can install OpenSolaris and use it to develop and run applications without using its special features such as containers and DTrace, but because SMF replaces the familiar /etc/rc*
files and methods for managing services, this is one new feature of OpenSolaris that you shouldn't ignore.
Note Your custom rc
service scripts and those installed by certain ISV application software will still work; they are executed within their assigned run level after SMF-managed services are started. You just won't be able to manage these services with SMF until you prepare and register a service manifest that calls your script.
To understand OpenSolaris services, you need to learn how to refer to them by their true names. Services are referenced using a Fault Managed Resource Identifier (FMRI), which is a character string that looks a lot like a URL. For example, the sshd
service daemon's full FMRI on your local system is svc://localhost/network/ssh:default
.
FMRIs have the following components:
svc
for an SMF-managed service or lrc
for a legacy rc
script–managed service.localhost
, but later versions of SMF will allow other locations for dependency purposes.inetd
So, the FMRI for the sshd
service daemon, svc://localhost/network/ssh:default
, indicates that ssh
is an SMF-managed network service running with one default instance on the local system.
When you need to refer to a service using any of the SMF programs, you often don't need to give its full FMRI. Like with OpenSolaris's path conventions for file names, you can use the FMRI's absolute or relative name depending on where you are or what program you are using. So, when referring to the FRMI of the sshd
service, you could use any one of the following:
svc://localhost/network/ssh:default
svc:/network/ssh:default
network/ssh:default
ssh
Now that you've seen how to refer to services by their names, you can start using the SMF tools shown in Table 5-1 to monitor and manage services.
*svcs and svcprop are in /usr/bin , and svcadm and svccfg are in /usr/sbin ; set your path appropriately. |
|
Program Name | Description |
svcs |
Reports service status information, dependencies, instances, and error diagnostics |
svcadm |
Administers individual service instances, enables, disables, and restarts |
svccfg |
Configures service parameters and data files |
svcprop |
Reports service properties and privileges |
Every service has a state that indicates its current functional activity. Services move from one state to another because of system events (such as run-level changes), error conditions, or administrator actions. A service might not be able to move to a desired state because of unfulfilled dependencies or other conditions. Table 5-2 shows the possible states for OpenSolaris services.
State | Description |
uninitialized |
This is the starting state for all services before svc.startd moves the service to a new state. |
disabled |
The service has been disabled by the administrator. |
offline |
The service is enabled but not yet online, usually because it's waiting for a dependency to be satisfied. |
online |
The service has been enabled and has successfully started; all its dependencies have been satisfied. |
degraded |
The service is enabled and running but with a level of degraded performance that is specified in the service's configuration. |
maintenance |
The service cannot be started by svc.startd because of an error or unsatisfied dependency and must be manually administered to clear the fault conditions. |
You can now explore the services on your OpenSolaris system using the SMF tools and observe their states. We'll show some examples next to get you started. First, list the services on your system using the svcs
command; use the -a
flag to list all the registered services. Figure 5-1 shows typical output from this command (some output lines have been deleted to shorten the list for printing).
Notice the variety of service types and their states. The pppd
point-to-point network protocol daemon, for example, which is started by the /etc/init.d/pppd
script, is listed as an lrc
, or legacy rc
, service. Remember that this is all you can learn from SMF about such services—the fact that they are running and the time that they were started—because only svc
services are managed by SMF. Also note that some services are in the disabled state, while some are running, that is, in the online state.
You'll notice in the output listed in Figure 5-1 that there are several milestone services listed. Milestones group services for administrative and end user availability. These groups correspond with the traditional UNIX/Linux run levels shown in Table 5-3.
SVR4 Run Level | SMF Milestone |
- |
none ; no services are enabled, and only the kernel is running |
s, S |
single-user ; traditional single-user mode for administrative purposes |
2 |
multi-user |
3 |
multi-user-server |
5 |
all |
If you need to put your system into single-user
mode, for example, you can still use the /usr/sbin/init s
command for this. SMF recognizes that run levels are groupings of services, so it provides specific FMRIs for each run level. Thus, the "SMF way" of going to single-user
mode is as follows:
# svcadm milestone single-user
and the command to return to run level 3 would be as follows:
# svcadm milestone multi-user-server
The following command will enable all services dependent on the multi-user-server
milestone:
# svcadm milestone all
Let's examine the ssh
service in more detail. The old ways of stopping this service would be to kill its process or to run its rc
script with the stop
parameter, something like this:
# /etc/rc.d/init.d/sshd stop
If you kill the OpenSolaris sshd
process, as shown in Figure 5-2, and then check to see whether it's been stopped, you see that it's still there but running with a new process ID! How did that happen? It was restarted by the SMF service daemon, svc.startd
.
So, how do you stop the sshd
service? You use the svcadm
command-line program, as shown in Figure 5-3, or use the System Administration Services menu and uncheck the SSH Server box, as shown in Figure 5-4.
Note that in Figure 5-3 we first used both the ps
command and the svcs ssh
command to show that the ssh
service was running. We then disabled the service with svcadm
and verified that it was indeed disabled and that its process was gone. Also note that we did not need to give the full FMRI for the service since there was only one local instance; recall that this is like using absolute or relative path names for files.
You use the svcadm
command for typical service administration tasks by using the flags shown in Table 5-4.
Action Flag | Description |
enable |
Sets the service as enabled and starts it if all of its dependencies are satisfied |
disable |
Sets the service as disabled; stops it and doesn't restart it |
restart |
Stops and restarts the service, assuming its dependencies are satisfied |
refresh |
Reloads the service's configuration files and restarts the service |
clear |
Removes the "maintenance" state after a repair; if the service was previously set as enabled, restarts it |
When you disable a service, it stays disabled even after a system reboot unless you indicate that the service is being disabled only for the current boot session. For example, the following command will disable the ssh
service, and it will not restart at the next reboot:
# svcadm disable ssh
If you intended to disable ssh
for only the current boot session, you would use the -t
(temporary) flag so that normally enabled services will start again at the next system reboot:
# svcadm disable -t ssh
The power of SMF is really revealed in your ability to define and manage interservice dependencies in the service's manifest file; if a service is not working, you need to know whether something is amiss with the service program itself or with some file or process that the service needs in order to function. SMF's svcs
program lets you display a service's dependency relationships along with critical state information. Figure 5-5 shows a series of example svcs
commands.
The first command, svcs ssh
, simply displays the current state and start time of the default instance of the ssh
service. More detail is shown in the "long" listing using svcs -l ssh
; this listing provides a wealth of information about the service. Table 5-5 briefly explains this output. Later in this chapter you'll see where all this configuration detail is defined.
Field | Description |
fmri |
The registered FMRI of the service |
name |
The name given to the service by the writer of the service definition |
enabled |
Indicator of whether the service has its enabled state set (true/false) |
state |
The current state of the service |
next_state |
Indicator of whether the service is transitioning from one state to another, the next state |
state_time |
The time the service entered its current state |
logfile |
The location of the log file used by the service |
restarter |
The name of the service used to restart; this can be the default system restarter or a custom procedure |
contract_id |
The registration number of the service |
dependency |
Listing of services and files needed to be online and available in order for the service to start |
It's worth reemphasizing the value of such service details. Each service can have its own log file and restarter process, making it much easier to diagnose service startup errors. Additionally, all of the services needed to support a service are easy to determine. It's almost always the case that services fail because some dependency is not met. Let's see how that works by creating an artificial missing dependency example.
You may already know that sshd
needs the /etc/sshd/sshd_config
file to configure itself before starting up. Suppose this file is missing. What can SMF tell you when you discover that sshd
is not running? Figure 5-6 shows this scenario. The administrator notices that ssh
is offline and tries unsuccessfully to enable it. The -x
flag of the svcs
program provides an explanation.
The svcs -x ssh
command reveals that the reason the service is offline is the missing configuration file. Additionally, it refers you to the man page for the service daemon and its log file, along with a URL that provides an online interpretation of the error condition (Figure 5-7). On other UNIX and Linux systems, depending on your system and logging configuration, information about the missing file might not even be logged by sshd
in /var/adm/messages
. SMF identifies the exact problem for you.
The URL lists details about the error, its impact on the system, and suggestions for administrator action. In fact, any system error will generate and log a message ID that you can enter into the Solaris/OpenSolaris search tool at http://www.sun.com/msg/ to get an explanation of the error condition. Admittedly, some of the explanations and suggested actions at this site can be somewhat generic, but even that is far more helpful than silent service failures or indecipherable error codes.
Occasionally, simply fixing a dependency is not enough to restart a service; it will remain in an offline or maintenance state until all the error conditions are eliminated and all dependencies are met. After you have diagnosed the problems and taken appropriate administrative actions, you can clear the maintenance state and restart the service. Figure 5-8 shows you such a scenario.
Say the administrator notices that the keyserv
service for storing private encryption keys is not running and is in the maintenance state. Checking the man page, she discovers that the keyserv
daemon won't start if the system has no domain name, so she assigns one. She then attempts to restart the keyserv
service, but it stubbornly remains in a maintenance state. But she soon remembers that this state must be explicitly cleared using the svcadm clear keyserv
command, after which the service enters the online state.
Tip If a service has multiple dependencies that are not yet enabled, you can enable them all recursively at one time using the -r
flag of the svcadm
command: svcadm enable -r ssh
.
If you take another look at Figure 5-5, you'll note that the ssh
service has a contract_id
. A contract defines a relationship (dependency) between a service process and another resource managed by the kernel, such as processors, memory, devices, or other service processes. SMF uses contracts to organize notification events; if a device or service fails, the kernel will notify the owner of the contract for that resource. SMF services are contracted to the svc.startd
daemon so that if they fail or exit, then the appropriate restarter action can be taken. The default restarter will try to restart a service if any of the service's contract members fail.
You can examine and monitor service contract relationship activities using the ctstat
and ctwatch
commands; they provide a means to get detailed information on failing services.
You've seen that existing OpenSolaris services have their own FMRIs, service names, log files, restarters, and dependencies. Where are these characteristics defined? And, more importantly, how can you define your own services?
Each OpenSolaris service is configured using a manifest file that defines the service's name, start and stop methods, restart conditions, and dependencies. Manifests are XML files that reside in the /var/svc/manifest
directory tree; each service functional category has its own subdirectory for its manifest files. For example, the manifest file for the ssh
service that we have been examining is /var/svc/manifest/network/ssh.xml
.
Tip Before you decide to create your own service manifest, remember that you are part of the OpenSolaris developer community and that there are other users who may have already created one that you can use. You can find sample manifest files for many types of services at http://blastwave.org/smf/manifests.php and at http://opensolaris.org/os/community/smf/manifests.
Service manifests can be easily created by copying and modifying existing manifests or by using generic manifest templates such as the one at http://www.sun.com/bigadmin/content/selfheal/smf-hds/template.xml, shown in Figure 5-9 (note the "REPLACE_ME"
locations in this template; that's where you define the service name, timeout values, and other characteristics of your service).
These are the key manifest components you need to define:
application/oracle
or network/nfsV4
./lib/svc/method
directory, which call the service programs. These scripts are very much like the familiar rc
scripts.Other manifest components include the number of service instances, service model, fault response, and reference documentation. Let's continue examining the ssh
manifest, /var/svc/manifest/network/ssh.xml
, to see how each of these components have been defined; because the file is rather long, we'll show only the relevant sections of the manifest and highlight the key components in bold.
The ssh
service name tag also includes a version number for change documentation purposes:
<service
name='network/ssh'
type='service'
version='1'>
The ssh
service is dependent on other services such as the local file system, network, and crypto services. It's also dependent on the presence of the sshd_config
file and is started within the multi-user-server
milestone; in turn, that milestone is defined to be dependent on the ssh
service and will not complete until the ssh
service is online.
<dependency name='fs-local'
grouping='require_all'
restart_on='none'
type='service'>
<service_fmri
value='svc:/system/filesystem/local'-/>
...
<dependency name='net- physical'
grouping='require_all'
restart_on='none'
type='service'>
<service_fmrivalue='svc:/network/physical'-/>
</dependency>
<dependency name='cryptosvc'
grouping='require_all'
restart_on='none'
type='service'>
<service_fmri-value='svc:/system/cryptosvc'-/>
</dependency>
...
<dependency name='config_data'
grouping='require_all'
restart_on='restart'
type='path'>
<service_fmri
value='file://localhost/etc/ssh/sshd-/>
</dependency>
...
<dependent
name='ssh_multi-user-server'
grouping='optional_all'
restart_on='none'>
<service_fmri
value='svc:/milestone/multi-/>
</dependent>
The service's start
and refresh
methods reference shell scripts in the /lib/svc/method
directory that accept the parameters start
or refresh
as input and execute the sshd
daemon. The stop
method executes a kill
on the service's process, as you would expect. All of these actions, however, are performed under the control of the SMF daemon to provide and manage the service's states and transitions.
You can also specify online documentation references for the service; this assists administrators when error reports are logged:
<template>
<common_name>
<loctext xml:lang='C'>
SSH server
</loctext>
</common_name>
<documentation>
<manpage title='sshd' section='1M' manpath='/usr/share/man' />
</documentation>
</template>
After you have copied or created your service's manifest file, move it to the appropriate functional category directory. You can verify that your file is valid using the svccfg
command since it has a built-in XML validator. The following command will validate your file and register it with the SMF service daemon:
# svccfg import yourmanifest.xml
You will then be able to see that your service is available (using the svcs
command), and if you've specified your dependencies correctly, you can use the svcadm
command to enable your service and the svcs
command to examine its state.
Service manifests can be complicated, but you can create some that are quite basic, such as this simple example for starting the MySQL database (after downloading and installing mysql
using Package Manager). Create a file, mysql.xml
, containing the following:
<?xmlversion="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type="manifest" name="MySQL">
<service name="application/database/mysql" type="service" version="1">
<single_instance/>
<dependency name="filesystem" grouping="require_all"
restart_on="none"-type="service">
<service_fmri value="svc:/system/filesystem/local"/>
</dependency>
<exec_method type="method" name="start" exec="/etc/sfw/mysql/mysql.server start"
timeout_seconds="120"/>
<exec_method type="method" name="stop" exec="/etc/sfw/mysql/mysql.server stop"
timeout_seconds="120"/>
<instance name="default" enabled="false"/>
<stability value="Unstable"/>
<template>
<common_name>
<loctext xml:lang="C">MySQL RDBMS</loctext>
</common_name>
<documentation>
<manpage title="mysql" section="1" manpath="/usr/sfw/share/man"/>
</documentation>
</template>
</service>
</service_bundle>
Note the service name, MySQL;
its dependency on the local file system service svc:/system/filesystem/local
; the start
and stop
methods that call the /etc/sfw/mysql/mysql.server
executable; and the documentation pointer to the mysql
man page.
Copy the file into the /var/svc/manifest/application/database
directory, activate it by running svccfg import mysql.xml
, and then enable the service using svcadm enable mysql
.
Editing manifest files can be tedious, and it's easy to introduce XML syntax errors as well as SMF errors. Fortunately, there are tools to assist you in creating and managing these files. One such tool is the Java-based SMF Manifest Creator that was a prize winner in the OpenSolaris Community Innovation Awards contest; download it at http://opensolaris.org/os/project/awards/awards_land/Entries/. Another tool that we've mentioned in earlier chapters is Webmin, a community-developed system management tool for Linux and UNIX systems, including OpenSolaris (see http://webmin.com/). Webmin is also in the OpenSolaris software repository's Administration and Configuration collection, so you can download and install it using Package Manager. After it's installed, you access it with your browser at http://localhost:10000, as shown in Figure 5-10.
Webmin includes interfaces to most OpenSolaris system management and configuration tasks (Figure 5-11) including the creation and activation of SMF services, which creates the service manifest files for you (Figure 5-12).
OpenSolaris's Service Management Facility is designed to provide better control over system services and daemons than traditional UNIX/Linux initialization/termination scripts. It's the one "different" OpenSolaris feature you shouldn't ignore. Even though your legacy rc
script methods still work, you will benefit from converting these scripts to SMF-managed services.
18.225.149.232