© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
K. FeaselFinding Ghosts in Your Datahttps://doi.org/10.1007/978-1-4842-8870-2_12

12. Copula-Based Outlier Detection (COPOD)

Kevin Feasel1  
(1)
DURHAM, NC, USA
 

So far, we have looked at clustering-based models for multivariate outlier detection. In this chapter, we will review a novel nonclustering technique that uses a concept called copulas. First, we will define what a copula is and how we can perform outlier detection using it. Then, we will implement application changes to integrate the new technique. After that, we will update our tests and website. Finally, we will wrap up Part III with some concluding notes.

Copula-Based Outlier Detection

In September 2020, Zheng Li and four coauthors released a paper entitled COPOD: Copula-Based Outlier Detection. This paper introduced a novel technique for multivariate outlier detection. Prior to this paper, several popular multivariate outlier detection techniques (including LOF, COF, and LOCI) focused on the idea of clustering: calculating the distances between points and calling out those points that are sufficiently distant from their neighbors. By contrast, COPOD relies on a concept known as a copula.

What’s a Copula?

The term “copula” comes from Latin and means a link or a bond. In language, we use the term “copula” to represent the link between the subject and the predicate, often using a verb like “to be” in English. In statistics, a copula is “a probability model that represents a multivariate uniform distribution, which examines the association or dependence between many variables” (Kenton, 2021).

Breaking that definition into pieces, we start with a dataset containing multiple variables. We may (or may not!) know the distribution of each variable—for example, one variable might follow a normal distribution, another a beta distribution, and a third may be uniformly distributed. So far, we have spoken of these variables as if they were entirely independent, but this is often not a good assumption when working with multivariate data. In practice, there tends to be some joint probability distribution, as inputs often have some influence upon one another. For example, the height, weight, age, blood pressure, and shoe size of a person will be related variables, so treating them as entirely independent will leave out important information. The problem that we quickly hit is, how do we model the joint probability distribution of any pair or combination of these variables? This is where copulas come in.

With a copula, we can reduce a joint probability distribution into its marginal distributions (also known as marginals), which are independent and therefore not correlated between variables. The copula is a function that allows us to make this transformation from a single joint distribution to a coupling of marginal distributions.

Note

Copula theory is very useful in certain avenues of statistics, but a detailed understanding of the topic is beyond the scope of this book. For an intuitive, visual-heavy approach to understanding copula functions, see Wiecki (2018). For a more detailed survey of copula theory, see Durante and Sempi (2010) as well as Pakdaman (2011). For our purposes, this high-level understanding should suffice.

Intuition Behind COPOD

Li et al. released a paper in September of 2020 introducing a new technique, Copula-Based Outlier Detection (COPOD). COPOD takes advantage of copulas to break down the relationships between multiple variables, even in cases in which we do not know the underlying distribution of the individual variables themselves. There are three stages for detecting outliers using COPOD (Li et al., 2). The first stage is to compute the empirical cumulative distribution function (also known as an empirical CDF or ECDF), which describes the probability at each point that a random variable will be less than or equal to that point. In Chapter 3, we introduced the concept of a probability distribution function (PDF), including an image like that in Figure 12-1. To oversimplify the explanation a bit, the height of the curve represents the probability some random variable drawn from this distribution takes on the value on the X axis.

A line graph with a bell-shaped curve. The line starts at (negative 4, 0.0), rises, peaks at (0, 0.39), and ends at (4, 0.0). Values are estimated.

Figure 12-1

The probability distribution function of a normal distribution

The probability distribution function is a “point” measure. The cumulative distribution function of the same graph, which appears in Figure 12-2, represents a “flow” measure. In other words, what is the probability that we will draw a point from this distribution that is less than or equal to a given value on the X axis?

A line graph. The line starts at (negative 4, 0.00), rises, and ends at (4, 1.00) with an increasing upward trend. Values are estimated.

Figure 12-2

The cumulative distribution function of a normal distribution

These visuals represent different views of the same underlying distribution—the cumulative distribution function is the integral to the probability distribution function over the range negative infinity to x (or, in other words, the probability distribution function is the first derivative of the cumulative distribution function, assuming the CDF is differentiable).

Once we have the empirical CDF for each variable, the second stage of the process is to determine an empirical copula function that translates the joint probability distributions of variables into marginal distributions.

The final stage of the process uses the empirical copula to approximate the tail probability, which is the probability that a sampled data point would be greater than (for a right tail) or less than (for a left tail) some specified value. We do this for each variable and multiply the probabilities together to create a multivariate tail probability—something we can do because the copula provides us marginal distributions, which are by definition independent from one another.

Given a tail probability, Li et al. show how to use this to estimate the likelihood of a data point being an outlier. The authors have also provided an implementation of their algorithm in Python, which they have incorporated into the PyOD library.

Implementing COPOD

Similar to LOCI in Chapter 11, COPOD does not have any required user-defined parameters. We can see this in Listing 12-1 as we implement a COPOD check.
def check_copod(col_array):
  clf = COPOD()
  clf.fit(col_array)
  diagnostics = {
    "COPOD Threshold": clf.threshold_
  }
  return (clf.labels_, clf.decision_scores_, diagnostics)
Listing 12-1

The function to run COPOD testing against our multivariate dataset

COPOD’s implementation is quite fast, so we are safe to run this no matter the number of records. Listing 12-2 shows a simplified version of the run_tests() function that now includes COF, LOCI, and COPOD as algorithms in our ensemble.
def run_tests(df, max_fraction_anomalies, n_neighbors):
  num_records = df['key'].shape[0]
  if (num_records > 1000):
    run_loci = 0
  else:
    run_loci = 1
  tests_run = {
    "cof": 1,
    "loci": run_loci,
    "copod": 1
  }
  diagnostics = {
    "Number of records": num_records
  }
  col_array = df.drop(["key", "vals"], axis=1).to_numpy()
  # COF and LOCI code elided
  (labels_copod, scores_copod, diag_copod) = check_copod(col_array)
  df["is_raw_anomaly_copod"] = labels_copod
  diagnostics["COPOD"] = diag_copod
  df["anomaly_score_copod"] = scores_copod
  anomaly_score = anomaly_score + scores_copod
  df["anomaly_score"] = anomaly_score
  return (df, tests_run, diagnostics)
Listing 12-2

The run_tests() function now includes COPOD.

Now that we have a third algorithm in our ensemble, we will need to adjust weights to determine our cutoffs. Li et al. suggest using a threshold such as -ln(0.01) = 4.61 (Li et al., 4) per dimension. In our case, we will use a more aggressive score of 2.3 (which is -ln(0.10)) above the median as our cutoff for COPOD. We can see the final form of determine_outliers(), which incorporates COPOD, in Listing 12-3.
def determine_outliers(
  df,
  tests_run,
  sensitivity_factors,
  sensitivity_score,
  max_fraction_anomalies
):
  tested_sensitivity_factors = {sf: sensitivity_factors.get(sf, 0) * tests_run.get(sf, 0) for sf in set(sensitivity_factors).union(tests_run)}
  median_copod = df["anomaly_score_copod"].median()
  sensitivity_threshold = sum([tested_sensitivity_factors[w] for w in tested_sensitivity_factors]) + median_copod
  diagnostics = { "Sensitivity threshold": sensitivity_threshold, "COPOD Median": median_copod }
  second_largest = df['anomaly_score'].nlargest(2).iloc[1]
  sensitivity_score = (100 - sensitivity_score) * second_largest / 100.0
  diagnostics["Raw sensitivity score"] = sensitivity_score
  max_fraction_anomaly_score = np.quantile(df['anomaly_score'], 1.0 - max_fraction_anomalies)
  diagnostics["Max fraction anomaly score"] = max_fraction_anomaly_score
  if max_fraction_anomaly_score > sensitivity_score and max_fraction_anomalies < 1.0:
    sensitivity_score = max_fraction_anomaly_score
  diagnostics["Sensitivity score"] = sensitivity_score
  return (df.assign(is_anomaly=df['anomaly_score'] > np.max([sensitivity_score, sensitivity_threshold])), diagnostics)
Listing 12-3

The determine_outliers() function now includes calculations for COPOD.

Our calculation for tested_sensitivity_factors remains the same as in Chapter 11, iterating over each test and multiplying the weight by the value (0 or 1) of tests_run.

The next step calculates the median for COPOD. Unlike LOCI and COF, which use absolute measures, COPOD’s score is a relative score from the median, meaning we need to perform this calculation and add it to the sensitivity threshold. The net result is that our minimum sensitivity threshold is 3.65, adding the 1.35 sensitivity factor for COF with the minimum 2.3 sensitivity factor for COPOD. If we include LOCI as well (as we do when we have no more than 1000 observations), our minimum sensitivity threshold will be 6.65. After that, the rest of the function remains unchanged.

Now that we have covered all of the necessary changes to the code, let’s continue on to tests and website updates.

Test and Website Updates

As we have seen several times throughout the book, introducing a new algorithm into an ensemble will affect existing tests. In this section, we will review the net effects on unit and integration tests. Because this is the final chapter in which we work on multivariate outlier detection, we will also extend the website to make it a bit more user-friendly.

Unit Test Updates

When running unit tests, the big change comes with the number of outliers in the sample input. Interestingly, the range of valid values has shrunk considerably, as we can see in Table 12-1.
Table 12-1

Number of outliers for differing sensitivity scores given a max fraction of anomalies of 1.0

Sensitivity Score

Ch. 10

Ch. 11

Ch. 12

100

11

8

6

50

11

8

6

40

11

8

6

25

4

8

6

5

2

2

5

1

1

2

3

In Chapter 10, we saw a range from 1 to 11 outliers depending on the sensitivity score, with a “dead zone” from 40 to 100 in which the number of outliers did not change. In Chapter 11, we saw either 2 or 8 outliers, with the cutoff happening somewhere between 25 and 5. In this chapter, the minimum number of outliers has increased to 3, and the maximum has dropped to 6. This has further reduced the relative importance of sensitivity score.

Integration Test Updates

When running the integration tests for Chapter 12, we end up with a test failure that is actually a positive outcome. In Chapter 10, we introduced an integration test that includes three outliers, one of which is fairly extreme and the other two reasonably small. In the Postman tests, this integration test is called Ch10 – Multivariate – Debug – Three Outliers, 2S1L – Two Recorded. We noted near the end of Chapter 10 that only the two largest outliers were caught by COF using our input parameters. In Chapter 11, the ensemble of COF and LOCI did not change the results of this integration test. With the introduction of COPOD, the anomaly score for the third data point increases enough to clear the threshold for detection. Figure 12-3 shows this result in visual form.

A bar graph plots the anomaly score versus the key. The highest bar is (10, 110). Values are estimated. Starting from the left, the bars have a decreasing trend.

Figure 12-3

Three outliers are detected, whereas previously, COF would detect only two

This is a positive outcome for us, as we now are able to show that one large outlier and one small-to-medium outlier will not necessarily overwhelm a small outlier.

Website Updates

Before we close the books on multivariate outlier detection, there is a small amount of functionality we should add to the Streamlit site. Listing 12-4 shows the first set of changes, in which we include the separate anomaly score components in the hover data for each point as well as our results table. This will allow us to see which algorithms have the most to say about a given data point.
g = px.bar(df, x=df["key"], y=df["anomaly_score"], color=df["is_anomaly"], color_discrete_map=colors,
      hover_data=["vals", "anomaly_score_cof", "anomaly_score_loci", "anomaly_score_copod"], log_y=True)
st.plotly_chart(g, use_container_width=True)
tbl = df[['key', 'vals', 'anomaly_score', 'is_anomaly', 'anomaly_score_cof', 'anomaly_score_loci', 'anomaly_score_copod']]
st.write(tbl)
Listing 12-4

Include separate anomaly score components for COF, LOCI, and COPOD

We can see the result of this change in Figure 12-4. For the most part, COPOD seems to drive the overall anomaly scores, though the result is not quite as drastic as it may first appear. We do need to factor in the COPOD median of 6.08 and the COPOD sensitivity factor of 2.3, meaning that the COPOD score must be above 8.38 to move the needle in any significant way.

A bar graph titled anomaly score per data point plots the anomaly score versus key. A table with 8 columns and 10 rows is provided below the graph.

Figure 12-4

Using a large, artificially generated dataset (sample_input in the multivariate unit test suite), we can see that each data point has a breakdown of its anomaly score components, covering COF, LOCI, and COPOD separately

The other change involves accepting a list of data. For univariate data, we accept a list of data points in Python format, something like [ 1, 2, 3, 4, 5 ]. We then convert this to key-value pairs and send the results to the outlier detection engine. For multivariate data, it would be nice to accept data in the same format as our unit tests, in which each row is a list that contains two elements: a key and a list of values. An example of one data point in this format is [ "key1", [ 1, 2, 3, 4, 5 ] ]. We could then create the dataset by generating a list of these data points. Once we do that, we should be able to parse the list, translate it to JSON, and send that result to the outlier detection engine. Listing 12-5 shows how to do just that using Python’s Abstract Syntax Tree (ast) module.
@st.cache
def convert_multivariate_list_to_json(multivariate_str):
  mv_ast = ast.literal_eval(multivariate_str)
  return json.dumps([{"key": k, "vals": v} for idx,[k,v] in enumerate(mv_ast)])
Listing 12-5

Create an abstract syntax tree to parse a list of lists and convert results to appropriate JSON.

Using the literal_eval() function, we evaluate the string and build out a list of lists. The json.dumps() function translates this list into valid JSON in the format we require by enumerating over each data point in the dataset, parsing out the key and value, and tagging them with the appropriate names.

To operate this function, we need a basic check that runs input_data = convert_multivariate_list_to_json(input_data) whenever we are using the multivariate method and the user has selected the checkbox to convert list data to JSON. Figure 12-5 shows an example of this, taking the sample_input unit test dataset as the input and building a proper JSON body for our outlier detection engine.

A checkbox to convert data in list to J S O N format. Data to process is provided in a list below and a detect option at the bottom left corner.

Figure 12-5

Convert a list of data point lists into valid input for outlier detection

Conclusion

Over the course of this chapter, we learned just enough about copulas to understand the basic workings of the Copula-Based Outlier Detection (COPOD) method. We then incorporated COPOD into our multivariate outlier detection ensemble to good effect. We concluded by making some small but useful improvements to the accompanying outlier detection application.

Before we move on to Part IV and time series analysis, we should take a moment to reflect on what we’ve been able to accomplish and how we could make the current multivariate outlier detection system better. Throughout Part III, we have built an independent ensemble of three algorithms: Connectivity-Based Outlier Factor (COF), Local Correlation Integral (LOCI), and COPOD. These three algorithms work in different ways to discover if some data point appears to be an outlier, with two of the algorithms using density-based approaches and the third a distributional approach using copula functions. We have seen that the combination of these three algorithms provides us a rather stable base for determining outliers, meaning we do not see radical differences in the number of data points marked as outliers given different sensitivity scores or numbers of nearest neighbors. We also have the benefit of choosing two algorithms with no required user input and a third with limited user input, which fits extremely well with our philosophy of making it easy for less statistically inclined users to work with our service.

There are dozens of other algorithms available for multivariate outlier detection we could investigate, though each additional algorithm adds new complexities and should be carefully evaluated before addition. COF and LOCI (specifically ALOCI, a linear approximation of the LOCI algorithm not available in PyOD) work fairly well based on Mehrotra’s research (Mehrotra et al., 116117), which came out before the COPOD paper. With that as our starting point, more is not necessarily better. To ensure that we make solid decisions on further algorithmic choices, a deeper analysis over additional, labeled datasets would be critical.

This deeper analysis would also help us fine-tune the weights. The LOCI algorithm comes with strong hyperparameter guidance from its authors, but COF and COPOD do not. We made reasonable decisions for each of these weights, but additional datasets could allow us to tweak these measures for better results.

We are now done with multivariate outlier detection and will shift to Part IV, time series analysis. In the next chapter, we will get an overview of time series problems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.167.178