Basis for proactive Fault Management

For TOGAF architects, the preceding diagram will present some resemblance to generic horizontal layering, which is similar to the OSI Reference Model where each layer provides services to the surrounding layers. We exclude some generic enterprise layers in order to focus on the SOA enterprise model, as it is realized in OFM. The difference is that OSI mostly depicts seven layers between applications, from one API to another (that is, integration), which is not applicable for service compositions where an individual service spans across several technical and logical layers. We have no intentions here to map the TOGAF/OSI model to SOA (actually, this theoretical exercise is done already). The sole purpose of this diagram is to illustrate the KPI-monitoring sources, their types, and input according to rules 5, 9, 14, and 15. WLS, obviously, is the main OFM server and the Oracle database (XE, standard, or enterprise), so technical monitoring for proactive error handling/prevention will be focused around WLS JMX (green) and database server metrics (red). Functional monitoring is based on the information provided by SOA applications and their APIs (SCA and OSB, yellow flows). Generally, that's what we get from BPEL's activities to Audit, Catch, CatchAll, Throw, and OSB's Log(). We have to omit the monitoring of the network infrastructure for brevity; as an enterprise architect, you should keep this in mind.

Technical monitoring for proactive Fault Management

Just declaring the rules and listing the SOA patterns to prevent fault handling provides you with little practical help. For green information flows (JMX), we have grouped some core MBeans and their attributes that you should monitor; please see the following table:

Resource category

MBean name

MBean attribute

Threads

MinThreadsConstraintRuntime

OutOfOrderExecutionCount, PendingRequests, CompletedRequests, MaxWaitTime, CurrentWaitTime, ExecutingRequests, DeploymentState, InvocationTotalCount, ExecutionTimeTotal, ExecutionTimeAverage, PoolMaxCapacity, HealthState, ConnectionsCount, MessagesSentCount, ServerConnectionRuntimes, and MaxCapacity

MaxThreadsConstraintRuntime

CurrentCapacity, MaxCapacity, ExecutionTimeTotal, ExecutionTimeAverage, PoolMaxCapacity, and HealthState

ThreadPoolRuntime

PendingUserRequestCount, CompletedRequestCount, ExecuteThreadIdleCount, QueueLength, PoolMaxCapacity, ExecutionTimeHigh, MaxCapacity, JMSThreadPoolSize, MaxMessageSize, DestinationsTotalCount, and HoggingThreadCount

JVM

JVMRuntime

HeapFreePercent, JavaVersion, HeapFreeCurrent, HeapSizeMax, HeapSizeCurrent, InvocationTotalCount, ExecutionTimeTotal, ExecutionTimeAverage, ExecutionTimeHigh, MessagesPendingCount, and MessagesReceivedCount

JMS

JMSRuntime

JMSServersCurrentCount, JMSServersTotalCount, HealthState, and JMSPooledConnections

JMSDestinationRuntime

MessagesCurrentCount, MessagesPendingCount, and MessagesHighCount

Queues

ExecuteQueueRuntime

PendingRequestCurrentCount

JDBC

JDBCServiceRuntime

HealthState, JDBCMultiDataSourceRuntimeMBeans, JDBCDriverRuntimeMBeans, JDBCDataSourceRuntimeMBeans, ConsumersTotalCount, ConsumersCurrentCount, MessagesPendingCount, MessagesReceivedCount, and MessagesSentCount

JTA

JTARuntime

TransactionRolledBackTimeoutTotalCount, TransactionRolledBackTotalCount, and TransactionAbandonedTotalCount

OFM Server Engine

ServerRunTime

Healthstate, State

The preceding attributes are selected from the thousands that are available on WLS; you can add (or exclude) any MBean attribute as you deem prudent. Please refer to the documentation at http://docs.oracle.com/cd/E12839_01/apirefs.1111/e13951/core/index.html regarding each of them, their meanings and metrics; also, check for new and deprecated ones (we do not have space for this here). The ones presented here are the most common ones (actually, those are the ones that we use and we strongly advise you to do the same); these are sufficient for detecting and reacting to most bottlenecks in the Oracle SOA infrastructure.

The ways in which you will implement regular attributes to pull data could be different, and they depend on your KPI consolidated solution. If you already use the mentioned Nagios, check for the Nagios plugins project and WLS plugins on Nagios exchange. Naturally, Oracle has quite a lot of their own tools in addition to the classic Enterprise Manager console on top of a Diagnostic Framework (DFW), and one of the most powerful tools is the Remote Diagnostic Agent (RDA) utility.

To get it, you must have an Oracle support account (old metalink); usually, it is used on one of those unhappy days when you need to generate diagnostic dumps for Oracle's technical specialists, for instance, OSB as rda.cmd -vSCRP OSB.

So, RDA is a manual tool by default for collecting static configuration information and runtime statistics. This is not really useful for proactive runtime monitoring and fault prevention, is it? Yes, it's part of the disaster recovery plan (your Orange Book), but we have one good use for tools such as RDA. Some of the typical scenarios were as follows:

  • A new release of the SOA application bundle (SCA and OSB) was delivered from UAT to ORT (operation readiness environment) after passing all the tests.
  • Performance issues were detected on Operation Readiness Test (ORT), and some fine WLS/SOA Server tuning was applied by admins after consultation with the developers. Traditionally, last-minute changes weren't logged on the ops wiki (that's never happened in your organization, right?).
  • After deployment in Production, considerable changes in performance became obvious. Stabilization attempts had no positive effect (or little).
  • Even if said otherwise, obviously we have three different SOA runtime environments (assuming that VMs or physical servers are the same).

Dumping the MBeans attributes (as XML, for instance) and automatically comparing the static configs and runtime metrics under the same load from the servers in question will reveal the difference. The problem can be fixed without the traditional swop ORT <-> Prod by RDA. In any case, creating and keeping the healthy dump as a reference will be a good start for proactive monitoring.

Actually, by default, you'll be provided with the ability to automatically collect and store diagnostic dumps along with SOA Suite, and it will be provided by Oracle Diagnostic Framework (DFW). As it is primarily related to SOA Suite, it does not cover the WLS configuration and runtime metrics, but it can seamlessly work with WebLogic Diagnostic Framework (WLDF), which will supply DFW with notifications about exceptions. You even get a preconfigured FMWDFW Notification in your DFW after the SOA Suite is deployed.

With this, you understand that these dumps must be stored somewhere, and Automatic Diagnostic Repository (ADR) exists in every managed server in a domain for this purpose (look at <SERVER_HOME>/adr). Bear in mind that ADR is not the SOA Suite log you usually read every time you look for the initial diagnosis records. Oracle Diagnostic Logging (ODL) is the basic and primary means of any OFM applications' logging, recording every single step in great detail.

In addition to the traditional Timestamp (actually, several of them), Message ID, and Message text, you will get MODULE_ID, THREAD_ID, and PROCESS_ID that can be correlated to the MBean data acquired using JMX/WLST from other preventive monitoring flows (see the previous figure). Talking about correlation, what's particularly interesting is that it presents us with the Execution Context ID (ECID), a global unique identifier for a service request, and an execution environment for handling this request. The main purpose of this ID is to link error messages from different components, and it will be a good idea to also use it as the basis for your Business Correlation ID within your SCA. We will show you how to do this a little later.

What is the role of RDA in this log's life cycle? The last incidents (the last 10 in total by default in Version 11.1.1.6) will be collected by RDA and incorporated into other information gathered using JMX, static, and runtime. Diagnostic Dumps and Incidents (which are a collection of dumps and are created by DWF, stored in ADR, and aggregated by RDA) can be:

  • Adjusted to be more sensitive to log details, that is, additional details can be collected at a certain level
  • Bundled and packaged for uploading to Oracle support
  • Purged from time to time after they are uploaded for proper analysis

The preceding points are essential tasks for Log Centralization, and Oracle has two additional instruments for them. The first one, Selective Tracing, is available in Version 11.1.1.4; it's a response to one of the major requirements: a low monitoring footprint with an adequate level of tracing. This feature is managed by OEM or, more selectively, by WLST, and fine-tuned selectivity can be based on any attributes (fields) of ODL.

The last two tasks are covered by Automatic Diagnostic Repository Command Interpreter (ADRCI). Later OMF versions include the Perl-based utility, the so-called Incident Packaging System (IPS), which is capable of packaging offline RDA bundles for uploading. To some extent, it competes with the RDA itself.

Everything mentioned in the previous points fit very well to the Log Centralization, incident management, and SOA governance in general, but RDA, ADRCI, DFW, and even WLDF are tools too reactive to be truly preventive. Indeed, aftermath dumps will not contribute much to pulse monitoring of a runtime SOA. Yes, it's true, but you as an architect will be aware that these tools work well in order to advise your ops where to start when it comes to investigate complex composition errors scenarios (not just bare metal or bare OFM infrastructure faults). It is also true that for all these tools and instruments, we have a critical part, which is purely the runtime main diagnostic information provider: Dynamic Monitoring Service (DMS).

DMS is delivered by default in the form of a servlet web app: <ORACLE_HOME>/modules/oracle.dms_<your_version>/dms.war. It can be accessed using http://telco.ctu.com:7778/dms/Spy (set your host and port accordingly). You will find a broad range of parameters that are combined in a collection of noun types and their individual attributes and exposed as MBean instances. DMS presents them at runtime for monitoring, which is performed through the already mentioned WLDF. This monitoring is organized in the form of WLDF watches, monitoring particular DMS attributes and their thresholds. Should we say again how important it is to include all the parameters from the previous table in your monitoring pattern?

A notification is fired by WLDF Watches every time a threshold is crossed, and DFW will create an incident in addition to the incidents that DWF can pull from ODL events. You should know that three WLDF watches are configured by default in the SOA Suite for deadlocks, stuck threads, and unchecked exceptions. They are bare essentials; please extend the watches according to the information from the previous table. You should also be aware of the latest SOA Suite diagnostic dumps, based on watches for the following:

  • soa.composite.trail: These are notifications from your running composites. They are highly important for dynamically running compositions.
  • soa.config: These are errors in deployment configuration, and they include MDS as well.
  • soa.db: This provides DB information on SOA-Infra DB and its repository.
  • soa.wsdl: This provides information on contracts/endpoints.

The preceding list is not complete. For more information about scopes, errors, and error codes, please refer to http://docs.oracle.com/cd/E17904_01/core.1111/e10113/chapter_ows_messages.htm.

This very quick walkthrough we had around Fusion DWF seems to be rather complex. We tried to simplify the complexity of DWF composite relations at a functional level in the following figure, but it cannot be used as a reference model for operations planning:

Technical monitoring for proactive Fault Management

Actually, the main purpose of the preceding diagram is to demonstrate how all three information flows can be consolidated around the centralized Logs Aggregator and thereby ultimately feed the Automated Recovery Tool. It is also obvious that you have complete freedom to implement your own lightweight JMX, monitoring the client for preselected parameters in addition to the existing standard OFM DMS. It will also provide you with an understanding of how a DMS Servlet is organized. As it is not the purpose of this book, the MBean attribute parsing the servlet code is omitted for brevity, but you can find perfect working examples provided by Steven Haines in his book Pro Java EE 5 Performance Management and Optimization, 2006 (yes, this book is probably a bit old, but still brilliant and useful).

Just to give you some ideas about servlets' implementation in conjunction with another tool, keep in mind that different log aggregators have different MEPs around the data flow; for instance, Nagios is generally a puller, so you can even consider converting Steven's JMX Servlet into a REST service by providing MBean attributes as JSON or XML. Initially, the proposed servlet has two main parts. The first one is the abstract where all operations are declared, including the main one, that is, service:

public abstract class AbstractStatsServlet extends HttpServlet
public void service( HttpServletRequest req, HttpServletResponse res ) throws ServletException
MBeanServer server = ( MBeanServer )this.ctx.getAttribute( "mbean-server" );
// Ask the Servlet instance for the root of the document
Element root = this.getPerformanceRoot( server, objectNames );
// Dump the MBean info
Element mbeans = new Element( "mbeans" );
for( Iterator i=objectNames.keySet().iterator(); i.hasNext(); )  {
String key = ( String )i.next();
Element domain = new Element( "domain" );
domain.setAttribute( "name", key );
Map typeNames = ( Map )objectNames.get( key );
for( Iterator j=typeNames.keySet().iterator(); j.hasNext(); ){
String typeName = ( String )j.next();
Element typeElement = new Element( "type" );
typeElement.setAttribute( "name", typeName );
List beans = ( List )typeNames.get( typeName );
for( Iterator k=beans.iterator(); k.hasNext(); )    {
ObjectName on = ( ObjectName )k.next();
Element bean = new Element( "mbean" );
bean.setAttribute( "name", on.getCanonicalName() );
// List the attributes
if( showAttributes ){
try {
MBeanInfo info = server.getMBeanInfo( on );
Element attributesElement = new Element( "attributes" );
MBeanAttributeInfo[] attributeArray = info.getAttributes();
for( int x=0; x<attributeArray.length; x++ )   {
String attributeClass = attributeArray[ x ].getType();
//set XML attributes for class:  is-getter, readable, writable
attributeElement.setAttribute( "description", attributeArray[ x ].getDescription() );
            // Output the XML document to the caller
            XMLOutputter out = new XMLOutputter( );
            out.output( root, res.getOutputStream() );
            ......
   }

The result is the creation of a complete DOM document with your MBean's attributes tree. Using domain keys and attribute names, you can filter it or just construct the required part.

This abstract class is extended by the JMX statistic servlet; this is where you connect to your server and gather statistics:

public class StatsServlet extends AbstractStatsServlet
....
      String config = getServletContext().getResource("/WEB-INF/xml/stats.xml").toString();
      SAXBuilder builder = new SAXBuilder();
      Document doc = builder.build( config );
      Element root = doc.getRootElement();
      Element adminServer = root.getChild( "admin-server" );
      String port = adminServer.getAttributeValue( "port" );
      url = "t3://localhost:" + port;
      username = adminServer.getAttributeValue( "username" );
      password = adminServer.getAttributeValue( "password" );

As you can see from Steven Haines' code, the connection is established with the admin server to acquire information from runtime servers. The direct connection to manage a server bean, as depicted in the earlier figure, is not advisable.

Why this is so important is clear from the second table in this chapter, and we advise you to familiarize yourself with Steven Haines' book and Oracle DMS Servlet in particular. We have more than 16,000 MBean attributes in WLS, and you have to pick the correct ones from the beginning, understand their roles and relations, and monitor them diligently.

If for some reason you have no time for this, but the necessity of configuring Log Centralization/Aggregation by external tools is clear to you (see rule 9), then you can look at open source tools such as Jolokia (http://www.jolokia.org/). Jolakia is a JMX-HTTP connector with many adapters to many servers, including WebLogic (9.2.3.0, 10.0.2.0, and 10.3.6.0 at the time of writing this book). Technically, it will be the same servlet as mentioned earlier, and you should use it together with DMS for complete monitoring.

Continuing on with rule 5 (Log Centralization), we cannot avoid discussion, as old as the hills, about the consolidation of security, operational, and business logs. Our advice from Chapter 7, Gotcha! Implementing Security Layers, was to deploy the Oracle API Gateway for proper implementation of all eight SOA security patterns (for static service deployment and message in transit). Please see the following sample of Jython code for extracting web service names from the gateway and displaying them on the Nagios dashboard. You can extend it in parts by gathering more service attributes for completeness and better visibility:

list the web services in a Gateway
from java.lang import Integer
from deploy import DeployAPI
from esapi import EntityStoreAPI
from vtrace import Tracer
import common
t = Tracer(Tracer.INFO) # Set trace to info level
dep = DeployAPI(common.gw_deployURL, common.defUserName, common.defPassword)
es = dep.getStore('')
webServices = es.getAll('/[WebServiceRepository]name=Web Service Repository/[WebServiceGroup]**/[WebService]')
i = 0
for webService in webServices:
               name = webService.getStringValue('name')
               //gather here all statuses for each web service
                ....
               t.info(Integer.toString(i) + ': ' + name);
               i = i + 1
es.close()

Regarding the consolidation of different log types, we have to stress the fact that we have been talking about different technical types from/within generic SOA technical frameworks (two previous figures, first and third). We are quite far from suggesting that you should put all logs in one location (DB) and monitor them from the same dashboard by the same personnel. Trying to find analogy, we can say that even if it is possible to put all of the corporate traffic in one TCP/IP-based backbone (for cost optimisation, for instance), security guys and emergency brigades would never be happy to have fire sensor wires, VIP landline phones, and intrusion detection channels combined with a regular business network (simply put, it' a bit more than just SPOF).

Similar to this, we have not one, not two, but many types of logs, which will be grouped for diligent monitoring by different teams of experts, sometimes reasonably separated; please see the following figure. Although it could be annoying, there are some strong reasons why SOA business ops cannot have immediate access to Secure Gateway or IDS logs. At the same time, Security and SOA Architects must constantly coordinate their efforts on a regular basis, and what is absolutely unacceptable is when DB logs are kept only for DBA.

Thus, a comprehensive but still a minimal model on the following figure must be designed as the basis; however, it should be carefully adjusted according to your business model and Industrial Policies/Regulation (PCI, Healthcare, and so on).

Technical monitoring for proactive Fault Management

To conclude, we have to again highlight the importance of proactive monitoring in adhering to the 5, 7-10, and 14-15 rules; further, we will show how it can leverage the implementation of Automated Recovery Tools to fulfil rule 1. Now, we have to spend a little time discussing how Oracle contributes to the first and second line of exception handling: rules 6 and 11-13.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.114.125