Chapter 25. Tivoli System Automation for Multiplatforms: Concepts

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Tivoli System Automation for Multiplatforms: Concepts

This chapter introduces the basic concepts of Tivoli System Automation for Multiplatforms. It describes only a subset of the components that are available in Tivoli System Automation for Multiplatforms, and which are required for automating Tivoli Workload Scheduler.

This chapter contains the following topics:

•25.1, “Overview” on page 746

•25.2, “Resources” on page 746

•25.3, “Resource managers” on page 747

•25.4, “Resource groups” on page 748

•25.5, “Managed relations” on page 750

•25.6, “Equivalencies” on page 752

•25.7, “Quorum” on page 755

•25.8, “Behavior patterns” on page 757

25.1 Overview

Tivoli System Automation for Multiplatforms delivers a high-availability environment for AIX, Linux, Solaris, and Windows in which systems and applications are continuously monitored. The self-healing features of Tivoli System Automation for Multiplatforms alleviates downtime that is caused by various kinds of problems. High availability is achieved through the use of automation policies in which common relationships between various components are defined. As a result, the time that is required to recover from an outage can improve significantly. The following sections describe the most important cornerstones that can be used in the automation policies for Tivoli Workload Scheduler.

25.2 Resources

A resource is the most basic building block in a Tivoli System Automation for Multiplatforms cluster that represents an instance of a resource class. A resource is an abstract view of either a physical or logical component in the system, such as a file system, a network adapter or a piece of software. A resource class defines the set of common characteristics that instances of the class can have. A resource can have persistent and dynamic attributes.

25.2.1 Persistent resource attributes

Persistent attributes describe enduring properties of a resource, such as the name of a file system or the speed of a network adapter. The most important persistent attributes of a resource in the context of this chapter are NodeNameList and ResourceType:

•The NodeNameList persistent attribute represents the collection of nodes in the cluster on which the resource is eligible to run.

•The ResourceType persistent attribute specifies whether a resource is fixed or floating:

– Fixed

The NodeNameList attribute of the resource contains a single entry. As a result, the resource can only run on the specified node in the cluster.

– Floating

The NodeNameList attribute of the resource contains multiple entries. Although multiple nodes are defined, only one instance of the resource may be active at any time.

25.2.2 Dynamic resource attributes

Dynamic attributes represent changing characteristics of a resource. The most important dynamic attribute of a resource in the current context is OpState, which specifies the operational state of a resource.

Table 25-1 lists the possible states.

Table 25-1 Dynamic resource attributes

State	Description
Offline	The resource is not started.
Pending Online	The resource has been started but is not ready for work.
Online	The resource is ready for work.
Pending Offline	The resource is in the process of being stopped.
Failed Offline	The resource is broken and cannot be used.
Stuck Online	The resource cannot be brought offline.
Unknown	Unable to obtain reliable state information of the resource.

25.3 Resource managers

A resource class is a collection of resources of the same type which defines a common set of attributes for its instances. Resource classes are managed by their respective resource managers. The two most important resource classes are as follows, both of which are managed by the IBM.GblResRM global resource manager:

•IBM.Application

This resource class is a generic class that can be used to manage any application on the system. The resource class provides the persistent attributes StartCommand, StopCommand, and MonitorCommand, which are used by the global resource manager to control and monitor the application.

•IBM.ServiceIP

This resource class is used to manage virtual IP addresses that can be started, stopped, and moved between network adapters and nodes within a cluster. Each instance of this class is a floating resource that identifies one virtual IP address. These addresses are typically provided to clients that are connecting to some service offered by the cluster.

The IBM.GblResRM global resource manager only provides an interface to control resources. If a problem occurs with a specific resource, the IBM.RecoveryRM recovery resource manager automates the failover of the resource, based on the defined automation policies.

Figure 25-1 illustrates the most important resource managers.

Figure 25-1 Resource managers

Note: The Tivoli System Automation for Multiplatforms commands cannot be used to start or stop resources directly. These actions can only be performed on resource groups.

25.4 Resource groups

Resource groups are logical containers that contain a collection of resources and other nested resource groups. All members of the resource group can be treated as one logical instance. As a result, resource groups can be used to control all members collectively. A resource group is an instance of the IBM.ResourceGroup class and has persistent and dynamic attributes similar to a resource.

25.4.1 Persistent resource group attributes

A resource group has the following important persistent attributes:

•MemberLocation

This persistent attribute indicates whether all resource group members have to be collocated on the same node in the cluster.

Important: By default, all members of a resource group are collocated on the same node in the cluster.

•NominalState

This persistent attribute defines the state of the resource group. Tivoli System Automation for Multiplatforms will try to start and keep the resource group in this state.

25.4.2 Dynamic resource group attributes

Similar to a resource, a resource group also has an OpState dynamic attribute. Tivoli System Automation for Multiplatforms uses this attribute to indicate the aggregate operational state of the managed resources that are member of the resource group.

The possible values of the OpState dynamic attribute are listed in Table 25-2.

Table 25-2 OpState dynamic resource group attribute

State	Description
Offline	Indicates that all member resources are Offline, (their OpState is set to Offline).
Pending Online	Indicates that the resource group’s NominalState has just been set to Online. All resources in the resource group are started.
Online	Specifies that all Mandatory resources are Online. See 25.5, “Managed relations” on page 750 for more details.
Pending Offline	Indicates that resource group will be brought Offline.
Failed Offline	Specifies that one or more member resources are in the Failed Offline state.
Stuck Online	Indicates that one or more member resources are in the Stuck Online state.
Unknown	Specifies that one or more member resources are in the Unknown state.

25.4.3 Managed resources

A resource becomes a managed resource as soon as the resource has been inserted into a resource group or an equivalency, resulting in an instance of the IBM.ManagedResource class. The Mandatory persistent attribute of a managed resource specifies whether the resource is mandatory for the resource group. When a resource group is started, all managed resources within the group that are mandatory must be started. If a mandatory resource fails to start, the entire resource group is stopped and started on another node in the cluster. Non-mandatory resources may be sacrificed to activate the resource group.

Figure 25-2 on page 750 shows all components.

Important: The global resource manager IBM.GblResRM starts monitoring resources as soon as they become managed resources.

Figure 25-2 Managed Resources

25.5 Managed relations

A managed relationship can exist between a source resource or resource group and one or more target resources or resource groups. This section describes the most interesting relationships (because they are the most common) for the automation of Tivoli Workload Scheduler.

25.5.1 Start and stop dependencies

Tivoli System Automation for Multiplatforms provides relationships that can be used to regulate the starting and stopping of resources. The source of a start or stop relationship can be one of the following sources:

•Member of a resource group (managed resource)

•Resource group

The target of a start or stop relationship can be one of the following targets:

•Member of a resource group (managed resource)

•Resource group

•Equivalency (See “Equivalencies” on page 752 for an in-depth discussion)

The source and target can belong to separate resource groups.

StartAfter relationship

The StartAfter relationship ensures that the source resource is only started when the target resource is available, as shown in Figure 25-3 on page 751.

Figure 25-3 StartAfter relationship

When the App1 source resource has to be started, the App2 target resource is started first. As soon as App2 target resource reaches the Online operational state, App1 source resource is started. The start order only acts in the forward direction of the relationship. Setting the NominalState persistent attribute of resource group B to Online does not cause any action on resource App1 because resource App2 has no forward relationship with resource App1.

The StartAfter relationship will be used to automate the startup of the embedded WebSphere Application Server and the database.

DependsOn relationship

Similar to the StartAfter relationship, the DependsOn relationship is used to ensure that a source resource can only be started when the target resource is online. However, several important differences exist:

•A DependsOn relationship implies an implicit collocation between the source and the target resources. See 25.5.2, “Location dependencies” on page 751 for more details.

•If the target resource fails (OpState = Failed Offline) or is stopped gracefully (OpState = Offline), the source resource is also stopped.

The DependsOn relationship is used to automate the Event Processor and the Dynamic Workload Broker Component, which depend on the embedded WebSphere Application Server that is hosting them.

Important: Unlike the DependsOn relationship, the StartAfter relationship does not bring down the source resource if the target resource becomes unavailable.

25.5.2 Location dependencies

Tivoli System Automation for Multiplatforms also provides relationships that are used to define location constraints between floating resources. For automating Tivoli Workload Scheduler, the two relationships that are used are Collocated and Affinity:

•Collocated relationship

The Collocated relationship is employed to ensure that the source resource and the target resource are located on the same node in the cluster. A DependsOn relationship between the source resource and the target resource implies this relationship automatically.

•Affinity relationship

The Affinity relationship is used to indicate that, if possible, a source resource should run on the same node where the target resource is running. If another location relationship is inhibiting this, the source resource can also run on another node in the cluster.

Therefore, the Affinity relationship defines a soft-location relationship; the Collocated relationship is a hard-location relationship, as illustrated in Figure 25-4. The Affinity relationship between App4 and App1 is sacrificed because App4 has a Collocated relationship with App5, which is already bound to a separate node in the cluster.

Figure 25-4 Collocated / Affinity relationship

25.6 Equivalencies

Resources that provide similar functionality can be grouped into an equivalency, which can be the target of a managed relationship. The collection of resources that embrace an equivalency must be fixed resources of the same resource class.

25.6.1 Example

Assume three applications, App1, App2 and App3, that have the following characteristics:

•App1 is an embedded application that can run in either App2 or App3. However, App1 cannot run in both App2 and App3 at the same time. As a result, App1 is a floating resource, and App2 and App3 are fixed resources.

•Each application can be started and stopped independently.

•If App1 is running in App2, and App2 suddenly fails, App1 must be restarted on App3.

•If App1 is running in App2, and App3 suddenly fails, Tivoli System Automation for Multiplatforms attempts to restart only App3.

At first appearance, this automation problem can be solved easily by creating a resource group for each application and using a DependsOn relationships between the resource groups, as illustrated in Figure 25-5 on page 753.

Figure 25-5 Automation problem: incorrect solution

Using this setup, Tivoli System Automation for Multiplatforms introduces the following unexpected automation behavior:

•App1 is started only if both App2 and App3 are available. This requirement is too strict.

•If App1 is running in App2, and App3 suddenly fails, App1 is stopped also. The cause is because App1 depends on both App2 and App3, but it should depend only on either App2 or App3, depending on the application that is currently hosting App1.

To achieve the automation behavior, the use of shadow resources and shadow equivalencies is required.

25.6.2 Shadow resources and shadow equivalencies

A shadow resource is an instance of the IBM.Application class that has the following characteristics:

•It is a fixed resource that monitors the OpState of another fixed resource.

•It is a member of a shadow equivalency that instructs Tivoli System Automation for Multiplatforms to evaluate only the OpState of the member resources. The implication is that the shadow resources will not be started or stopped.

Using shadow resources and shadow equivalencies, the automation problem described in section 25.6, “Equivalencies” on page 752 can now be solved as follows:

1. Create a resource group for each application.

2. Define shadow resources of App2 and App3. These shadow resources monitor the OpState of App2 and App3, respectively.

3. Create a shadow equivalency, which contains the shadow resources.

4. Define a DependsOn relationship between the resource group of App1 and the shadow equivalency.

Figure 25-6 illustrates the correct solution.

Figure 25-6 Automation problem: correct solution

The solution bypasses the undesired automation behavior as follows:

•App1 is started if either App2 or App3 is available.

•If App1 is running in App2 and App3 suddenly fails, Tivoli System Automation for Multiplatforms attempts to restart App3. No action is performed on App1.

•If App1 is running in App2 and App2 suddenly fails, App1 is restarted on App3.

25.7 Quorum

A cluster may split into two or more subclusters if no more communication is possible between the nodes in the cluster. If subclusters are unaware of each other’s existence, Tivoli System Automation for Multiplatforms might start a new instance of an application that is already running in one of the other subclusters. If that application is a critical resource that is not allowed to run on multiple nodes in the cluster simultaneously, data integrity of the cluster might be endangered.

If the cluster splits into two or more subclusters, Tivoli System Automation for Multiplatforms determines which of the subclusters has the majority of member nodes. The subcluster with the majority of nodes (more than 50%) has the operational quorum and becomes the active cluster. Other subclusters are dissolved, as shown in Figure 25-7.

Figure 25-7 Quorum

The dynamic attribute OpQuorumState of the IBM.PeerDomain class identifies the operational quorum of the cluster. The attribute has the following possible values:

•OpQuorumState = 0 indicates that quorum is achieved. As a result, resources may be started.

•OpQuorumState = 1 indicates that a tie situation occurs and is not yet resolved. See 25.7.1, “Tie breaker” on page 755 for more details.

•OpQuorumState = 2 indicates that quorum is not achieved. As a result, Tivoli System Automation for Multiplatforms is not allowed to start resources.

Alternatively, the IBM.RecoveryRM can also be queried to obtain the Quorum State, as shown in Example 25-1.

Example 25-1 Quorum State

# lssrc -ls IBM.RecoveryRM | grep 'Operational Quorum State'

Operational Quorum State: HAS_QUORUM

25.7.1 Tie breaker

If a tie occurs in which the cluster is partitioned into subclusters with an equal number of nodes, Tivoli System Automation for Multiplatforms employs a tie breaker to determine which subcluster has operational quorum. Operational quorum becomes effective as soon as the subcluster reserves the tie breaker. Several types of tie breakers exist, of which the network tie breaker is the most common type in the context of automating Tivoli Workload Scheduler.

To resolve a tie situation, the network tie breaker uses an external IP address which has to be reachable from all nodes in the cluster, as illustrated in Figure 25-8. This external IP address must be able to reply to Internet Control Message Protocol (ICMP) echo requests (ping command). If a firewall rule is installed that blocks ICMP traffic between the cluster nodes and the external IP address, the network tie breaker does not function properly. Using the network tie breaker has the following advantages over other tie breakers:

•No additional hardware is required.

•Unlike other tie breakers (for example, disk based), the network tie breaker evaluates the availability of TCP/IP communication between the cluster nodes.

Figure 25-8 Network tie breaker

Important: The network tie breaker must be used only when all cluster nodes are part of the same subnet. By having the nodes in separate subnets, the nodes are more likely to ping the network tie breaker, although they cannot communicate to each other. As a result, a cluster split goes undetected.

25.7.2 Critical resource protection

The ProtectionMode persistent attribute of a resource specifies whether the resource is a critical resource. If critical resources are active in the dissolved subcluster, Tivoli System Automation for Multiplatforms uses the CritRsrcProtMethod attribute of each node in the dissolved subcluster to determine in which way the system should be terminated. The default (CritRsrcProtMethod = 1) is to simulate a kernel panic, as shown in Example 25-2.

Example 25-2 Critical resource protection

# lsrsrc -c IBM.PeerNode | grep CritRsrcProtMethod

CritRsrcProtMethod = 1

Note: By default, instances of the IBM.Application resource class are non-critical; instances of the IBM.ServiceIP resource class are critical.

25.8 Behavior patterns

This section demonstrates the most common behavior patterns of Tivoli System Automation for Multiplatforms, which are based on a change of the operational state of one or more managed resources. Figure 25-9 describes the behavior and consists of two floating resources in a single resource group, which must be collocated on the same node in the cluster. The cluster in the example consists of two nodes, NodeA and NodeB.

Figure 25-9 Behavior patterns

25.8.1 MonitorCommand

The OpState of an instance of the IBM.Application class is determined by the return code of the MonitorCommand command, which is run every MonitorCommandPeriod seconds. As described in 25.4, “Resource groups” on page 748, the operational state of a resource group is an aggregation of the OpState attributes of all managed resources that it comprises. If the NominalState of the resource group has been set to Online, the OpState of the resource group is set to Pending Online until all the managed resources in the resource group are Online. Finally, if all resources of the resource group have reached the NominalState of the resource group, the OpState of the resource group changes to Online. The next sections describe how Tivoli System Automation for Multiplatforms reacts to OpState changes of the managed resources.

OpState change App1

Assuming that the current operational status of both App1 and App2 is Online on NodeA and that the NominalState and OpState of resource group A is set to Online, two situations might occur: Offline and Failed Offline.

Case 1: OpState = Offline

If the operational state of App1 changes to Offline, Tivoli System Automation for Multiplatforms attempts to restart it on NodeA. If restarting App1 completes successfully, the operational state of resource group A changes from Pending Online to Online.

Table 25-3 shows the sequence that Tivoli System Automation for Multiplatforms follows.

Table 25-3 OpState change App1: Offline

NodeA		NodeB		GroupA
MonitorCommand (OpState App1)	MonitorCommand (OpState App2)	MonitorCommand (OpState App1)	MonitorCommand (OpState App2)	OpState Resource Group A
RC=1 (Online)	RC=1 (Online)	RC=2 (Offline)	RC=2 (Offline)	Online
RC=2 (Offline)	RC=1 (Online)	RC=2 (Offline)	RC=2 (Offline)	Pending Online
RC=1 (Online)	RC=1 (Online)	RC=2 (Offline)	RC=2 (Offline)	Online

Case 2: OpState = Failed Offline

An operational state change to Failed Offline of App1 indicates that the resource is broken on NodeA and cannot be restarted there. As a result, App1 must be restarted on NodeB. However, because App1 depends on App2 and both resources are part of the same resource group, first App2 must be stopped on NodeA and restarted on NodeB. Assuming that the restart is successful for both resources on NodeB, App1 is next started after App2. Finally, the OpState of resource group changes back to Online. Table 25-4 outlines the sequence in detail.

Table 25-4 OpState change App1: Failed Offline

NodeA		NodeB		Group A
MonitorCommand (OpState App1)	MonitorCommand (OpState App2)	MonitorCommand (OpState App1)	MonitorCommand (OpState App2)	OpState Resource Group A
RC=1 (Online)	RC=1 (Online)	RC=2 (Offline)	RC=2 (Offline)	Online
RC=3 (Failed Offline)	RC=1 (Online)	RC=2 (Offline)	RC=2 (Offline)	Pending Offline
RC=3 (Failed Offline)	RC=2 (Offline)	RC=2 (Offline)	RC=2 (Offline)	Offline
RC=3 (Failed Offline)	RC=2 (Offline)	RC=2 (Offline)	RC=1 (Online)	Pending Online
RC=3 (Failed Offline)	RC=2 (Offline)	RC=1 (Online)	RC=1 (Online)	Online

Note: The OpState of App1 on NodeA remains in the state Failed Offline. To change the state to Offline, resource App1 on NodeA must be reset with the resetrsrc command.

OpState change App2

Assuming that the current operational status of both App1 and App2 is Online on NodeA and that the NominalState and OpState of resource group A is set to Online, two events might occur: Offline and Failed Offline.

Case 1: OpState = Offline

If the OpState of resource App2 changes from Online to Offline, Tivoli System Automation for Multiplatforms stops App1 because of the DependsOn relationship. Subsequently, App2 is restarted on the same node in the cluster. Finally, App1 is restarted on NodeA also. Table 25-5 summarizes the events.

Table 25-5 OpState change App2: Offline

NodeA		NodeB		Group A
MonitorCommand (OpState App1)	MonitorCommand (OpState App2)	MonitorCommand (OpState App1)	MonitorCommand (OpState App2)	OpState Resource Group A
RC=1 (Online)	RC=1 (Online)	RC=2 (Offline)	RC=2 (Offline)	Online
RC=1 (Online)	RC=2 (Offline)	RC=2 (Offline)	RC=2 (Offline)	Pending Offline
RC=2 (Offline)	RC=2 (Offline)	RC=2 (Offline)	RC=2 (Offline)	Offline
RC=2 (Offline)	RC=1 (Online)	RC=2 (Offline)	RC=2 (Offline)	Pending Online
RC=1 (Online)	RC=1 (Online)	RC=2 (Offline)	RC=2 (Offline)	Online

Case 2: OpState = Failed Offline

If the operational state of resource App2 changes to Failed Offline while App1 is Online, Tivoli System Automation for Multiplatforms stops App1 on NodeA first. Next, App2 is started on NodeB. To honor the dependency, App1 is launched on NodeB after, as shown in Table 25-6.

Table 25-6 OpState change App2: Failed Offline

NodeA		NodeB		Group A
MonitorCommand (OpState App1)	MonitorCommand (OpState App2)	MonitorCommand (OpState App1)	MonitorCommand (OpState App2)	OpState Resource Group A
RC=1 (Online)	RC=1 (Online)	RC=2 (Offline)	RC=2 (Offline)	Online
RC=1 (Online)	RC=3 (Failed Offline)	RC=2 (Offline)	RC=2 (Offline)	Pending Offline
RC=2 (Offline)	RC=3 (Failed Offline)	RC=2 (Offline)	RC=2 (Offline)	Offline
RC=2 (Offline)	RC=3 (Failed Offline)	RC=2 (Offline)	RC=1 (Online)	Pending Online
RC=2 (Offline)	RC=3 (Failed Offline)	RC=1 (Online)	RC=1 (Online)	Online

25.8.2 StartCommand

Another important trigger for automation actions, is the result of the StartCommand command. This section gives more details about how this command affects the behavior of Tivoli System Automation for Multiplatforms.

StartCommand completed successfully

If the StartCommand command completes successfully (RC=0), Tivoli System Automation for Multiplatforms acts on the result of the subsequent MonitorCommand command.

Return code of MonitorCommand = 1 (Online)

The automation goal is reached. The OpState of the resource is set to Online.

Return code of MonitorCommand = 2 (Offline)

Tivoli System Automation for Multiplatforms sets the resource in the Pending Online state and waits for the outcome of the subsequent MonitorCommand command executions. If the resource is still Offline after StartCommandTimeout seconds, the resource is reset. If the reset is successful, an attempt to restart the resource is made and a counter is incremented. If the counter reaches a configurable threshold, the OpState of the resource is set to Failed Offline and will therefore trigger a failover.

The upper threshold is regulated by the RetryCount tunable, which can be set by the samctrl command. The default value is set to 3 as shown in Example 25-3.

Example 25-3 RetryCount tunable

# lssamctrl | grep RetryCount

RetryCount = 3

StartCommand completed with an error

If the StartCommand command completes unsuccessfully (RC=1), Tivoli System Automation for Multiplatforms acts on the result of the previous MonitorCommand command.

Figure 25-10 on page 761 shows a state diagram of the StartCommand command.

Return code of MonitorCommand = 1 (Online)

In this case, the return code of the StartCommand command is ignored. The OpState of the resource is set to Online, which implies that the automation goal is reached.

Return code of MonitorCommand = 2 (Offline)

If the MonitorCommand command indicates that the resource is Offline, the OpState of the resource is put in a Failed Offline state and therefore triggers a failover.

Figure 25-10 StartCommand state diagram

25.8.3 StopCommand

The logic behind the StopCommand command is similar to the behavior of the StartCommand command. If a stop or reset request is issued to an instance of the IBM.Application class, the StopCommand command is executed. This section describes the logic more in detail.

StopCommand completed successfully

If the StopCommand command completes successfully (RC=0), Tivoli System Automation for Multiplatforms acts on the outcome of the subsequent MonitorCommand command.

Return code of MonitorCommand = 2 (Offline)

In this case, the automation goal is reached. The OpState of the resource is set to Offline.

Return code of MonitorCommand = 1 (Online)

Tivoli System Automation for Multiplatforms sets the resource in the Pending Offline state and acts on the result of the subsequent MonitorCommand commands. If the resource is still Online after StopCommandTimeout seconds, the resource is reset. If after the reset timeout (equal to StopCommandTimeout seconds) the resource is still in the Online state, Tivoli System Automation for Multiplatforms will put the resource in the Stuck Online state. Operator intervention will be required to unblock the situation.

StopCommand completed with an error

If the StopCommand command completes unsuccessfully (RC=1), Tivoli System Automation for Multiplatforms will act as follows.

Return code of MonitorCommand = 2 (Offline)

The return code of the StopCommand command is ignored. The OpState of the resource is set to Offline. The automation goal is reached.

Return code of MonitorCommand = 1 (Online)

In this case, Tivoli System Automation for Multiplatforms sets the resource in the Pending Offline state. After, a reset on the resource is performed. If the resource is still Online after the reset timeout (equal to StopCommandTimeout seconds), the resource is put in the Stuck Online state. Figure 25-11 illustrates the states.

Figure 25-11 StopCommand state diagram

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 25. Tivoli System Automation for Multiplatforms: Concepts

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 25. Tivoli System Automation for Multiplatforms: Concepts