Message isolation
This chapter describes a z/OS V2R2 enhancement to assist with message isolation.
Message isolation is the process by which Cross-system Coupling Facility (XCF) monitoring identifies a member that is not processing messages in a timely manner, and then arranges to have sending systems reject or delay messages that are targeted at that member.
z/OS V2R2 introduced enhancements to help alleviate potential issues with message traffic and help recognize and respond to situations to contain and manage the effect of stalled systems.
This chapter includes the following topics:
2.1 Messaging overview
XCF groups are made up of members that have a relationship with each other. Each member of the group is within a z/OS instance. The multiple members can be grouped to constitute a major application. The individual member within the group represents a specific application that is a subset of the major application.
The XCF group members often communicate with each other to ensure that the XCF group is functioning satisfactorily to meet the major application and individual application requirements.
During the configuration stages, names are assigned to each of the XCF groups. The XCF group name in the sysplex must be unique to avoid a member intruding erroneously into another XCF group. The communication between the group members is known as message traffic.
Messages are held in the following types of buffers:
Outbound message: Used to store the message on the sending system
Inbound message: Used to receive the message on the target system
Local message: Used to send and receive messages within the same system
XCF is a z/OS component that passes messages between a group of members in a Parallel Sysplex. A group is a set of members. Each member must register with XCF joining to a group (macro IXCJOIN) to use the XCF services.
When a member sends a message to another member, the receiving member is known as the target member.
2.2 Operational considerations
The successful delivery of messages and responses between members is reliant on the following factors:
The configuration of components and operational processes that are required to perform messaging traffic is available and sufficiently functional.
The performance of all the related parts is balanced, and achieving their messaging without dominating a resource or disrupting the capacity apportionment across the sysplex.
If the target member is experiencing issues and is not recycling its inbound message buffers, the buffers can fill up and the member becomes stalled. The stall is a result of a change in expected behavioral patterns. This change in status can be because of one of the following situations:
A software failure within or outside of the XCF
A hardware failure within or outside of CF
An erroneous change
An unnoticed consistent creep in capacity that is required by the applications
A single sharp increase in demand for resources
MsgExit SRB suspended (waiting)
Local lock contention
Latch contention
MsgExit SRB not dispatched (ready state)
Application dispatch priority too low
Single CPU dominated by higher priority work
Tight loop in unrelated work unit
LPAR weight too low
New members joining the group
Effect of a stalled system
A stalled system can have a further effect on other members because the other members that are attempting to send messages to the stalled system still have their outbound message buffers in use as they wait for the stalled system to become available again.
If the situation persists, the outbound message buffers of the sending system become saturated because of the delay and any further attempt to send a message is denied. The sending systems then can experience a secondary effect and stall because of the issue with the target system that is still experiencing difficulties and stalled. This secondary effect can spread to other members.
When a stalled system affects other members, it issues the message IXC6311 to the hardcopy log that a local stalled XCF member can be affecting other members. This message is also displayed once for each of the other members that are affected by the stalled member. Figure 2-1 shows an excerpt from the manual SA38-0677 z/OS V2R2 MVS™ Systems Messages Volume 10 (IXC-IZP) for message IXC631I.
IXC631I GROUP grpname MEMBER membername JOB jobname ASID asid STALLED, IMPACTING SYSTEM sysname
Explanation: The indicated XCF Group Member is not processing its XCF work in a timely manner. The stall is considered critical because it is affecting the indicated system. For example, the indicated system might not be able to send signals to the local system because the stalled member is holding XCF signal buffers that are needed to receive such signals.
 
Figure 2-1 Excerpt of system message IXC631I
The stalled system issues message IXC640E once to indicate that the stalled situation occurred and there is a possible resolution, as shown in Figure 2-2. The situation becomes more complicated if the consoles on the offending system cannot process the IXC631I and IXC640E messages.
IXC640E type XCF GROUP MEMBERS ON SYSTEM sysname IMPACTING SYSPLEXtext
Explanation: One of the following conditions exists:
System sysname has at least one XCF group member that appears to be stalled and is not processing its XCF work in a timely manner. Failure to process this work appears to be affecting the sysplex.
System sysname has at least one critical XCF group member that appears to be impaired. See the explanation of message IXC633I for a description of situations that can make a member appear impaired.
Figure 2-2 Extract of system message IXC640E
It is suggested that operational practices might use the automation component’s capabilities to catch these messages. With sufficient logic, it can be determined whether to resolve the situation, escalate it for immediate attention, or report it. That logic can involve issuing the D XCF,GROUP command and interrogate the response to determine what course of action to follow.
The RMF XCF Activity Report is also a useful source of reference to help understand the behavior of the members.
2.3 Message isolation
Message isolation is the process by which XCF monitoring identifies a member that is not processing its inbound buffers in a timely manner. It then arranges to have sending XCF systems reject or delay messages that are targeted to that member.
z/OS V2R2 enhancements provide the following new options for message isolation for the recognition, containment, and management of stalled systems and their potential effect:
IXCJOIN macro: New MSGISO keyword to request isolated status. (MSGISO is enabled by default).
IXCMSGO macro and IXCMSGOX macro interface: A new isolation reason code is introduced.
Hex Return code 0C Hex Reason code 3C
Equate Symbol: ixcMsgRsnTargetIsolated
Meaning: The target member is not processing messages in a timely manner and is message isolated. Messages that are targeted to such a member are rejected or delayed by the sending system. If and when the target member makes sufficient progress, normal message flow resumes.
This reason code applies only if the sending member specified MSGISO=MSGORSN when the IXCJOIN macro was started to become an active group member. If MSGISO=NONE is specified (or taken as the default), a request to send a message to a target member that is message that is isolated is rejected with a no buffer reason code (ixcMsgoRsnNoBuffer).
Action: Retry the request after allowing time for the condition to clear. If TIMEOUT was not specified, consider retrying the request with a nonzero TIMEOUT value to have XCF try to handle the condition. If the target member becomes not isolated before the TIMEOUT expires, XCF sends the message:
IXCYCON macro: Provides equate symbols for the return and reason codes.
Hex Return code 0C Hex Reason code 3C Equate Symbol: ixcMsgRsnTargetIsolated
Query services IXCQUERY, IXCYQUAA, IXCMG, IXCYAMDA: To determine the member status and recognize the affected and isolated status.
Query services IXCQUERY and IXCMG. Also, IXCYQUAA and IXCYAMDA, which are the mapping macros that describe how the data that is returned by these query services is to be mapped by the application.
DISPLAY XCF,GROUP command (which uses query services to access the information) to show status information that is related to message isolation of XCF group members.
IXC637I: System message displaying the effect window for a specific member, as shown in Figure 2-3 on page 11.
IXC637I GROUP grp_name MEMBER isolated_memname JOB jobname ASID asid
MEMTOKEN memtoken1 memtoken2 ON SYSTEM isosysnm ISO#: isosysslot.sysiso#
MESSAGE ISOLATION IMPACT FOR SYSTEM impsysnm RPT#: report#
IMPACTED : impactdate impacttime IXC637ISEQ#: impactiso# whyclosed: closedate closetime
RESUMED : SEQ#: closeiso#
DELAYED : delayeddate delayedtime #MSG: #msgdelayed
REJECTED : rejecteddate rejectedtime #MSG: #msgrejected
 
Explanation: Member isolated_memname of group grp_name on system isosysnm is “message isolated”. XCF delays or rejects messages targeted to a member that is message isolated. When a sending member has a message delayed or rejected because the target member appears to be isolated, the sending member is said to be “impacted”. System impsysnm issues message IXC637I to summarize the isolation impact experienced by the members of group grp_name residing on system impsysnm. Message IXC637I indicates the time when the impact started, the number of messages that were delayed or rejected, and the time when messages were most recently delayed or rejected.
Figure 2-3 Message IXC637I impact window for a specific member
IXC638I: System message displaying the isolation window for a specific member, as shown in Figure 2-4.
IXC638I GROUP grp_name MEMBER member_name JOB jobname ASID asid
MEMTOKEN memtoken1 memtoken2 ON SYSTEM sysname ISO#: isosysslot.sysiso#
MESSAGE ISOLATION STATUS FOR SYSTEM sysname RPT#: report#
ISOLATED : isolatedate isolatetime : SEQ#: memberiso# whyclosed: closedate closetime
RESUMED : SEQ#: resumeiso#
DELIVERYQ : deliveryqdate deliveryqtime #MSG: #msgqueued
LAST MSGX : activedatesi activetimesi SEQ#: signalqueueseq#
 
Explanation: Member member_name of group grp_name on system sysname is “message isolated”. XCF isolates a member when it fails to make adequate progress with respect to the processing of its messages. Message isolation helps keep problematic group members from impeding the delivery of messages to other members. Message IXC638I indicates the time when XCF isolated the member and provides information about the XCF work pending for the
member, as well as information about the progress of the signal exit routines that are expected to process that work.
Figure 2-4 Message IXC638I isolation window for a specific member
IXC640E alerts the operator that there is at least one group member that appears to be stalled and is affecting the sysplex.
IXC645E: System message alerting the operator to the existence of isolated members, as shown in Figure 2-5 on page 12.
IXC645E SYSTEM sysname HAS ISOLATED XCF GROUP MEMBERS
 
Explanation: One or more XCF group members on system sysname are “message isolated”. XCF isolates a member when it fails to make adequate progress with respect to the processing of its messages. Message isolation helps keep
problematic group members from impeding the delivery of messages to other members.
Figure 2-5 Message IXC645E alerting the operator to the existence of isolated member
The goal is to isolate stalled or poorly performing members and to avoid the secondary effect on other members. Until the cause is remedied, the member can resume message communication.
The automation can use the enhancements to monitor the systems regularly. Automation can also assess each member’s behavior to determine how likely the member is at risk of becoming stalled, and where appropriate, take preventive steps to avoid such a situation or at least minimize the effect.
Consider the following points regarding the automation process:
Preventing a target member from using too much of available resources, which impedes the delivery of message signals to other members:
 – Detect the high consumption of inbound signal buffer pools
 – Monitor the high consumption of virtual common storage
 – Identify the high consumption of real frames
After a message is accepted by the XCF signal service by macro IXCMSGOX, it must be delivered. The inbound side (member and XCF) cannot refuse messages, so the sending XCF must be the one to reject incoming requests that are targeted to an offending member.
The inbound XCF side identifies ill behaved members and isolates them by instructing other XCFs to stop accepting message signals from that member. After the issue is resolved and the member is functioning appropriately, the other XCFs must be notified to resume message communications.
The XCF System Failure Management (SFM) policy has the MEMSTALLTIME(seconds) parameter to take automatic action to alleviate stalled XCF group members that are affecting other members. XCF stops the stalled member if the condition is severe enough that manual intervention or automation cannot respond to the situation.
An isolated member is a target member whose messages are being rejected or delayed by XCF. In this instance, the member is message isolated.
The isolation window is the period during which a target member is message isolated. For more information, see message IXC638I in Appendix 2.3, “Message isolation” on page 10.
XCF delays (internally) or rejects incoming message signals that are targeting an isolated member. However, sending message communication with non-isolated members continues.
Delayed messages; When the sending member does not need to retry the message.
If XCF accepts an incoming message signal but the target member is isolated before the signal can be transferred to the target system, the message is held. This configuration is an example of a signal message being delayed and not rejected. Delayed message signals are held by sending XCF until the target member becomes not isolated or the sending member cancels it because of, for example, expiring time outs.
However, if the message times out before the message can be sent, it can be argued that the delay was transformed into a reject. The notify exit, XCF, tells the application that the message was not sent. At that point, the application might decide to retry the send.
Rejected messages: When the sender member must retry the message or give up. The reason for the rejection is passed by XCF to the sending member, and can be for one of the following reasons:
 – By default, “No buffer”
 – The sending member can optionally request in advance a unique reason code of isolated instead of no buffer.
In such a circumstance, the sending member must specify MSGISO=MSGORSN when issuing the IXCJOIN macro to join the group. The IXCMSGO macro, IXCMSGOX macro interface, can return this new isolated reason code.
When the z/OS V2R2 enhancements are aligned with refined automation, the affected member senders might see an increase in the number and frequency of rejects, delays, or time outs. Therefore, it is suggested to monitor activity levels and refine the processes to identify what circumstances must be evident for the appropriate actions to be taken.
The suggested approach is to consider the following tasks:
Identify or predict actions that are likely to result in a member being stalled.
Assess the potential effect a stalled system might have on its peers.
Review and adjust parameters to help maintain availability and performance.
2.4 Messages that show the isolation effect
In Example 2-1, the set of messages shows an affected sender member. It identifies the group and member name that is causing the effect and the time of the occurrence. In the second set of messages in the example, you might see information about time of resume, delay, and rejected.
Example 2-1 IX637I Messages showing a summary effect
IXC637I GROUP A0000000 MEMBER SY2 JOB XCAT0C01 ASID 0025
MEMTOKEN 03000008 00150001 ON SYSTEM SY2 ISO#: 3.1
MESSAGE ISOLATION IMPACT FOR SYSTEM SY1 RPT#: 1
IMPACTED : 02/02/2015 17:28:46.515464 SEQ#: 1
RESUMED : SEQ#: 0
DELAYED : #MSG: 0
REJECTED : 02/02/2015 17:28:47.023326 #MSG: 977
*IXC440E SYSTEM SY1 IMPACTED BY ISOLATED XCF GROUP MEMBERS ON SYSTEM SY2
 
IXC637I GROUP A0000000 MEMBER SY2 JOB XCAT0C01 ASID 0025
MEMTOKEN 03000008 00150001 ON SYSTEM SY2 ISO#: 3.1
MESSAGE ISOLATION IMPACT FOR SYSTEM SY1 RPT#: 2
IMPACTED : 02/02/2015 17:28:46.515464 SEQ#: 1
RESUMED : 02/02/2015 17:28:58.285585 SEQ#: 1
DELAYED : 02/02/2015 17:28:54.077545 #MSG: 15300
REJECTED : 02/02/2015 17:28:49.164130 #MSG: 5100
In Example 2-2, the set of messages shows an isolated member. It identifies the group name and member name that are suffering from the isolation and the time of the occurrence. In the second set of messages in this example, you can see information about time of resume.
Example 2-2 Messages showing an isolated receiver member
IXC638I GROUP A0000000 MEMBER SY2 JOB XCAT0C01 ASID 0025
MEMTOKEN 03000008 00150001 ON SYSTEM SY2 ISO#: 3.1
MESSAGE ISOLATION STATUS FOR SYSTEM SY2 RPT#: 1
ISOLATED : 02/02/2015 17:28:44.644972 SEQ#: 1
RESUMED : SEQ#: 0
DELIVERYQ : 02/02/2015 17:28:38.855734 #MSG: 5084
LAST MSGX : SEQ#: 17
*IXC645E SYSTEM SY2 HAS ISOLATED XCF GROUP MEMBERS
 
IXC638I GROUP A0000000 MEMBER SY2 JOB XCAT0C01 ASID 0025
MEMTOKEN 03000008 00150001 ON SYSTEM SY2 ISO#: 3.1
MESSAGE ISOLATION STATUS FOR SYSTEM SY2 RPT#: 2
ISOLATED : 02/02/2015 17:28:44.644972 SEQ#: 1
RESUMED : 02/02/2015 17:28:58.285542 SEQ#: 1
DELIVERYQ : #MSG: 0
LAST MSGX : 02/02/2015 17:28:58.285631 SEQ#: 5101
Figure 2-6 shows the output of DISPLAY XCF, GROUP, grpname, membername at z/OS V2R2 level. There, we might see the new fields about isolated and affected members and the respective groups.
Figure 2-6 Output of the command D XCF,GROUP,grpname,member
2.5 Migration and coexistence considerations
The new enhancements apply only to systems that are running z/OS V2R2. However, consider the following points:
They do apply to local message traffic, though issues are seldom seen there.
We are not likely to see any new behavior until there are at least two systems running z/OS V2R2 in the sysplex.
Initially loading a system with z/OS V2R2 activates the new behavior. However, when communicating with an early system, the old behaviors apply and derive no benefit.
Early systems do not require any compatibility support.
z/OS V2R2, XCF might now selectively indicate no buffer for messages that are targeted to an isolated member.
Some XCF users issue messages to complain when their msgout request is rejected for a no buffer condition.
In the past, you might then look at your MAXMSG specifications. But with z/OS V2R2, those user messages might be the result of the target member being message isolated.
Therefore, with z/OS V2R2, you must first look to see whether message isolation might apply.
XCF query services (and therefore measurement products, such as RMF) indicate only “no buffer” for true MAXMSG constraints.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.42.215