Visual correlation and shared influencers

In addition to the anomaly in the transactions processed KPI (in which an unexpected dip occurs), the other three ML jobs (for the network metrics, the application logs, and the SQL database metrics) were superimposed on the same time frame in the Anomaly Explorer. The following screenshot shows the results of this:

Notice that during the time the KPI was exhibiting problems on February 8th 2017, the three other jobs also showed correlated anomalies (see the vertical stripe of significant anomalies in the annotated red circle across all four jobs). Upon closer inspection (by clicking on the red tile for the it_ops_sql job), you can see that there were issues with several of the SQL Server metrics going haywire at the same time:

Notice that the gray-shaded area of the thumbnail charts is highlighting the window of time associated with the width of the selected red tile in the preceding swim lane. This window of time might be larger than the bucket span of the analysis (as is the case here) and therefore the gray-shaded area can contain many individual anomalies during that time frame.

If we look at the anomalies in the ML job for the application log, there is an influx of errors all referencing the database (further corroborating an unstable SQL Server):

However, interesting things were also happening on the network:

Specifically, there was a large spike in network traffic (shown by the Out_Octets metric), and a high spike in packets getting dropped at the network interface (shown by the Out_Discards metric).

At this point, there was clear suspicion that this network spike might have something to do with the database problem. And, while correlation is not always causation, it was enough of a clue to entice the operations team to look back over some historical data from prior outages. In every other occasion of the outage, this large network spike and packet drops pattern also existed.

The ultimate cause of the network spike was VMware's action of moving VMs to new ESX servers. Someone had misconfigured the network switch and VMware was sending this massive burst of traffic over the application VLAN instead of the management VLAN. When this occurred (randomly, of course), the transaction processing app would temporarily lose connection to the database and attempt to reconnect. However, there was a critical flaw in this reconnection code in that it would not attempt the reconnection to the database at the remote IP address that belonged to SQL Server. Instead, it attempted the reconnection to localhost (IP address 127.0.01), where, of course, there was no such database. The clue to this bug was seen in one of the example log lines that ML displayed in the examples column (circled in the following screenshot):

Once the problem occurred, the connection to the SQL Server was therefore only possible if the application server was completely rebooted, the startup configuration files were re-read, and the IP address of SQL Server was relearned. This was why a full reboot always fixed the problem.

One key thing to notice is how the influencers in the UI also assist with narrowing down the scope of who's at fault for the anomalies:

The top scoring influencers over the time span selected in the dashboard are listed in the Top Influencers section on the left. For each influencer, the maximum influencer score (in any bucket) is displayed, together with the total influencer score over the dashboard time range (summed across all buckets). And, if multiple jobs are being displayed together, then those influencers that are common across jobs have higher sums, thus pushing their ranking higher.

This is a very key point because now it is very easy to see commonalities in offending entities across jobs. If esxserver1.acme.com is the only physical host that surfaces as an influencer when viewing multiple jobs, then we immediately know which machine to focus on; we know it is not a widespread problem.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.47.82