Statistical influencers

Within the ML job configuration, there is the ability to define fields as Influencers (also sometimes called Key Fields in the user interface). The concept of an influencer is a field that describes an entity that you'd like to know is to blame for the existence of the anomaly, or at least had a significant contribution. Note that any field chosen as a candidate to be an influencer doesn't need to be part of the detection logic, although it is natural to pick fields that are used as splits to also be influencers. To see how influencers work, let's discuss an example.

Imagine we have data that describes financial transactions at merchants for a family's monthly credit card bill. The data includes the following fields:

  • timestamp
  • merchant
  • purchase_amount
  • location
  • user (the name of the person who invokes the transaction)
Note that the actual purchasing user isn't tracked in a credit card bill for a shared account, but let's assume this for the sake of the example.

Let's also assume that our desire is to find unusually high purchases per merchant. As such, we would likely create an ML job that focused on high_sum(purchase_amount) by merchant (where we are splitting the analysis on the merchant field). In this case, what fields could be interesting to choose as influencers? Ideally, they would be the following ones:

  • merchant: Because this is the field the analysis is split on, and naturally we'd like to know which merchant was the most influential
  • location: Was there a specific location in which all (or most) of the anomalous transactions occurred?
  • user: What person, if any, invoked the majority of the transactions that caused the anomaly?

Notice that we purposely did not choose to use purchase_amount as a candidate for an influencer. Hopefully, it is obvious that the numerical value of purchases is more likely to be random and that any particular value of a purchase amount is not likely to dominate.

Given the choice of these three influencers, let's now imagine that we've detected an anomaly of a large amount of money spent at Starbucks for the current billing period. Here's a partial anomaly record for this fictitious example:

... 
          "timestamp": 1514764800000, 
          "partition_field_name": "merchant", 
          "partition_field_value": "Starbucks", 
          "function": "high_sum", 
          "function_description": "high_sum", 
          "typical": [ 
            10.21 
          ], 
          "actual": [ 
            104.52 
          ], 
          "field_name": "purchase_amount", 
          "influencers": [ 
            { 
              "influencer_field_name": "merchant", 
              "influencer_field_values": [ 
                "Starbucks" 
              ] 
            }, 
            { 
              "influencer_field_name": "user", 
              "influencer_field_values": [ 
                "Rich" 
              ] 
            } 
          ], 
... 
See Chapter 6, Alerting on ML Analysis, for information on how to query the anomaly results indices to get anomaly record results.

We can see that over a hundred dollars was spent this month, when usually only about ten dollars is spent. Who is to blame? Who/what are the influencers? They seem to be as follows:

  • merchant=Starbucks: Again, this one is obvious because that's how the analysis was split. It is a bit of a tautology to say that Starbucks is the influencer of this anomaly because the anomaly is already for the merchant Starbucks.
  • user=Rich: In this case, the bulk of the transactions for Starbucks were invoked by Rich. Other family members also may have purchased items from Starbucks during the month, but Rich's transactions highly dominate for the feature that we are analyzing, the amount of money spent.
  • location: In our fictitious example, location did not emerge as an influencer. This is because Rich's transactions at Starbucks occurred in many different locations throughout the month, due to his heavy business travel. As such, no one location of transactions dominated. Therefore, location does not emerge as an influencer and is not listed in the results.

To summarize, two out of the three fields that were candidates for being influencers were identified as being influencers. Despite location not being an influencer in this scenario, the choice of location as a candidate still makes sense and may be useful on some other future anomaly.

It is also key to understand that the process of finding potential influencers happens after ML finds the anomaly. In other words, it does not affect any of the probability calculations that are made as part of the detection. Once the anomaly has been determined, ML will systematically go through all instances of each candidate influencer field and remove that instance's contribution to the data in that time bucket. If, once removed, the remaining data is no longer anomalous, then via counterfactual reasoning, that instance's contribution must have been influential and is scored accordingly (with an influencer_score in the results).

More importantly, it is how we're going to leverage these influencers when viewing the results of not just a single ML job, but potentially several related jobs. Let's move on and discuss the process of grouping and viewing jobs together to assist with root cause analysis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.27.93