Flood event detection
This scenario demonstrates how automations in IBM Netcool Operations Insight can detect and manage event floods. It shows how the event visualization tools can graphically depict the flood by using gauge, maps, and monitor boxes on dashboards.
Probes can be coded to discard events or send them to a backup ObjectServer if a flood occurs. You can see how network operators can drill into the events that are causing the flood by using the Active Event List, from which tools can be started to instruct network engineers to resolve the underlying problems in the infrastructure.
This chapter includes the following topics:
7.1 Scenario description
This scenario describes how the event flood automations, which are included with the ObjectServer, can be used to automatically detect an event flood and prevent operators in the Operations Center and the system from being overwhelmed with the flood of events. It also describes how customized dashboards can warn the Operations Center manager that a flood is occurring. The event search integration with Log Analytics can be used to retrospectively analyze an event flood and isolate the cause.
You can see how network operators can drill into the events that are causing the flood by using the Active Event List, from which tools can be started to instruct network engineers to resolve the underlying problems in the infrastructure. You can also see how an administrator can retrospectively review the events via the Log Analysis tool and, by categorizing the event data, determine the root cause and effect of the events.
In this scenario, we introduce a fictional organization Company C that uses IBM Netcool Operations Insight to manage a globally distributed network of data centers. A change that is made to the configuration of one of the devices causes it to malfunction. As a result, many monitored systems and services go offline. Jane, one of the network operators in Company C, is assigned to work on the problem.
7.1.1 Business value
A mis-configured network device causes a flood of events. This flood can cause a high load on the monitoring infrastructure and flood operators with events, which affects the Operations Center’s ability to monitor and deal with other events occurring in the infrastructure.
The centralized flood detection and management functionality of Netcool/OMNIbus can reduce the effect of the flood on operators who are monitoring the event management system. It also can provide out-of-band accelerated notification of key events during the flood and provide the tools that enable operators to take decisive action to fix the problem.
7.2 Scenario topology
System components and default settings in the test environment are described in Chapter 1, “IBM Netcool Operations Insight overview” on page 3. The solution that is used in this scenario includes the following system components that are installed in the IBM test environment:
Web GUI:
LogAnalysis server
WebGUI/ObjectServer/Gateway/Probe
The following steps were used to achieve the scenario assumptions:
1. Set up a simnet probe to issue a background stream of an event at four events per second.
2. Set up the syslog probe to receive events from the test cell. The test cell includes misconfigured devices and generates a high number of events.
3. Switched on the syslog probe for a short period.
4. Used the Log Analysis user interface to find the spike in events and report on the categories of events that were observed.
7.3 Scenario steps
This section describes the process that is used to manage the event flood issue.
The Tivoli Netcool/OMNIbus includes a set of resources that you can use to extend the product to include event flood detection. It also warns of flood events and informs users of actions that can be taken, as required, to prevent abnormal behavior from affecting the entire Netcool Operations Insight. The customization is added to the probe rules file and target Object Servers.
The following process is used:
1. A device’s configuration was changed, which causes the device to malfunction. As a result, many monitored systems and services suddenly go offline. This issue results in approximately 2,000 critical outage events being sent to the operator.
2. The event management infrastructure is subjected to an event flood. The Netcool Operations Insight event flood automations detect the event flood and then centrally manage it by using bidirectional probe communication to implement flood management policies in the probe.
3. As the flood starts, the event rate starts to climb. The flood is detected. In the meantime, Object Server informs the probe to reroute the lower severity events, which reduces the number of events that are seen by operators by rerouting low severity events.
4. On the dashboard, Jane can see that event gauge goes critical. As more events are received, the event rate gauge starts to climb and it goes close to the red (critical section), as shown in Figure 7-1.
Figure 7-1 NOI KPI dashboard
5. Jane reviews the Active Event List (AEL) dashboard. By using this dashboard, she can view event details and run context sensitive tools on events. She double-clicks the alert record to display details, as shown in Figure 7-2 on page 136.
Figure 7-2 Active Event List
6. MTTraped Probe informs Jane that flood control ends and events are no longer rerouted.
After approximately 10 minutes, the event rate gauge settles down at around 560 events per minute, as shown in Figure 7-3.
Figure 7-3 Last Minute event rate drops down
7. Jane switches to Event Dashboard. The network monitor window contains critical event alerts, as shown in Figure 7-4.
Figure 7-4 Event Dashboard
When IBM Operations Analytics - Log Analysis is integrated with IBM Tivoli Netcool/OMNIbus, it can be used the text analytics features to find patterns and trends in event data. With the integration of these two products, historical and real-time event data from IBM Tivoli Netcool/OMNIbus in the IBM Operations Analytics - Log Analysis user interface can be viewed searched.
IBM Operations Analytics - Log Analysis parses event data into a format suitable for searching and indexing. The event data is transferred from IBM Tivoli Netcool/OMNIbus to IBM Operations Analytics - Log Analysis by the IBM Tivoli Netcool/OMNIbus Message Bus Gateway.
8. Jane opens IBM Operations Analytics - Log Analysis console to see the event rate chart. She starts the Event Trend By Severity view by clicking Search Dashboards → OMNIbusInsightPack → Event Analysis and Reduction → Event Trend By Severity from the left top menu, as shown in Figure 7-5.
Figure 7-5 Event Trend by Severity menu
She sees the graphs that are similar to the graphs that are shown in Figure 7-6.
Figure 7-6 Event trends
9. To get into event details, Jane double-clicks a peak on the chart. The view changes. A “Discovered Patterns” menu is shown in the lower left part of the window.
10. Jane selects the node host name and drills down to see the events that contain that host name. By carrying out this search further, she sees the events that show where the problem is occurring. She clicks the NOTPubType tab and sees the devices or locations that are responsible for the flood, as shown in Figure 7-7.
Figure 7-7 Event details
11. Jane clicks the small grid icon (as shown in the red box in Figure 7-7).
12. The window changes to a grid view. Jane clicks the Node column. She can see that the “May” node is problematic, as shown in Figure 7-8.
13. Jane clicks the Chart icon that can be see in the right upper corner (see the red box in Figure 7-8).
Figure 7-8 Alert group report
14. She makes sure that Generate Count option is selected and then chooses the Plot Chart (All Data) button, as shown in Figure 7-9 on page 140.
Figure 7-9 Generating plot with all data chart
15. She selects the spanner icon on the right side of the chart. She then clicks Chart type and sees a drop-down menu of the charts to create. Now, she can experiment with chart types to get meaningful data; for example, the bubble chart that is shown in Figure 7-10.
Figure 7-10 Bubble chart visualization
This chart shows that the event flood was caused by a CHASSIS warning on a node that is named “May”.
16. The new chart shows the relative quantity of each event severity for this node. Jane hovers over a data series in the chart. A tool-tip shows the actual event count for the corresponding severity.
17. It is clear to Jane that such a mis-configuration is a significant problem. She immediately dispatches an engineer to correct this problem. With the mis-configuration corrected, the monitored devices and services all auto-clear and the event flood is halted.
7.4 Using launch-in-context tools from Netcool Web GUI
The integration of IBM Tivoli Netcool/OMNIbus with IBM Operations Analytics - Log Analysis also provides right-click tools in Web GUI. These tools are for users to access from the Active Event List (AEL) or the Event Viewer within the Web GUI component of IBM Tivoli Netcool/OMNIbus.
The automated search that is implemented with the launch-in-context tools uses the FirstOccurrence time stamp in the event record as the basis of the search. FirstOccurrence is used because the tools are designed to find other events, not the event that is used as the basis for the search. The search criterion is designed to look for events with a time stamp that is less than the FirstOccurrence. This feature eliminates the possibility of finding the event that is used to start the search.
Complete the following steps to use the launch-in-context tools from the Netcool Web GUI:
1. Log in to Dashboard Application Services Hub as ncouser.
2. Click the flag icon and select Event Viewer, as shown in Figure 7-11.
Figure 7-11 Selecting the Event Viewer option
When a problem with a device is investigated, one question that often comes to mind is whether other devices are experiencing the same or similar issues. The Search for similar events right-click tool is designed for this scenario.
3. Use the tool to find all devices with Critical Problems that are affecting the application. Locate a Critical event with APP1 in the AlertGroup field. Click the event to select it. Right-click and select Event Search  Search for similar events  1 day before event, as shown in Figure 7-12 on page 142.
Figure 7-12 Search for similar events function
A new browser tab opens. You are logged in to Operations Analytics - Log Analysis and a search starts. After a short time, the results open in the window, as shown in Figure 7-13.
Figure 7-13 Results window
The following important points are shown in Figure 7-13:
The search text is configured based on values for the AlertGroup, Type, and Severity event columns. The values are extracted from the event record.
The time range is defined based on the value of FirstOccurence that is extracted from the event record.
From the text analytics, we can see that application experienced Network and Virtual Server problems on five nodes.
7.5 Summary
There are instances when a failure can lead to a flood of events. An example of this issue can be an air conditioner failing in a specific data center, which can lead to many events from systems in this data center that can fail because of overheating.
The centralized flood detection and management functionality of Tivoli Netcool/OMNIbus can reduce the effect of the flood on operators who are monitoring the event management system. This example demonstrates how an operator, who often might take hours to manually review many hundreds of events, quickly found the root cause of the major outage within a few mouse clicks.
The ability to perform in-context keyword searches of the entire event history within a specified time window by using the Log Analysis tool allows an operator to distill down, summarize, and make sense of vast quantities of event data.
 
Notes: The Tivoli Netcool/OMNIbus includes a set of resources that you can use to extend the product to include event flood detection. For more information, see the following articles in the IBM Knowledge Center:
Detecting event floods and anomalous event rates:
Protecting the ObjectServer against event floods:
Extending the functionality of Tivoli Netcool/OMNIbus:
 
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.116.102