So far, we have looked at clustering-based models for multivariate outlier detection. In this chapter, we will review a novel nonclustering technique that uses a concept called copulas. First, we will define what a copula is and how we can perform outlier detection using it. Then, we will implement application changes to integrate the new technique. After that, we will update our tests and website. Finally, we will wrap up Part III with some concluding notes.
Copula-Based Outlier Detection
In September 2020, Zheng Li and four coauthors released a paper entitled COPOD: Copula-Based Outlier Detection. This paper introduced a novel technique for multivariate outlier detection. Prior to this paper, several popular multivariate outlier detection techniques (including LOF, COF, and LOCI) focused on the idea of clustering: calculating the distances between points and calling out those points that are sufficiently distant from their neighbors. By contrast, COPOD relies on a concept known as a copula.
What’s a Copula?
The term “copula” comes from Latin and means a link or a bond. In language, we use the term “copula” to represent the link between the subject and the predicate, often using a verb like “to be” in English. In statistics, a copula is “a probability model that represents a multivariate uniform distribution, which examines the association or dependence between many variables” (Kenton, 2021).
Breaking that definition into pieces, we start with a dataset containing multiple variables. We may (or may not!) know the distribution of each variable—for example, one variable might follow a normal distribution, another a beta distribution, and a third may be uniformly distributed. So far, we have spoken of these variables as if they were entirely independent, but this is often not a good assumption when working with multivariate data. In practice, there tends to be some joint probability distribution, as inputs often have some influence upon one another. For example, the height, weight, age, blood pressure, and shoe size of a person will be related variables, so treating them as entirely independent will leave out important information. The problem that we quickly hit is, how do we model the joint probability distribution of any pair or combination of these variables? This is where copulas come in.
With a copula, we can reduce a joint probability distribution into its marginal distributions (also known as marginals), which are independent and therefore not correlated between variables. The copula is a function that allows us to make this transformation from a single joint distribution to a coupling of marginal distributions.
Copula theory is very useful in certain avenues of statistics, but a detailed understanding of the topic is beyond the scope of this book. For an intuitive, visual-heavy approach to understanding copula functions, see Wiecki (2018). For a more detailed survey of copula theory, see Durante and Sempi (2010) as well as Pakdaman (2011). For our purposes, this high-level understanding should suffice.
Intuition Behind COPOD
These visuals represent different views of the same underlying distribution—the cumulative distribution function is the integral to the probability distribution function over the range negative infinity to x (or, in other words, the probability distribution function is the first derivative of the cumulative distribution function, assuming the CDF is differentiable).
Once we have the empirical CDF for each variable, the second stage of the process is to determine an empirical copula function that translates the joint probability distributions of variables into marginal distributions.
The final stage of the process uses the empirical copula to approximate the tail probability, which is the probability that a sampled data point would be greater than (for a right tail) or less than (for a left tail) some specified value. We do this for each variable and multiply the probabilities together to create a multivariate tail probability—something we can do because the copula provides us marginal distributions, which are by definition independent from one another.
Given a tail probability, Li et al. show how to use this to estimate the likelihood of a data point being an outlier. The authors have also provided an implementation of their algorithm in Python, which they have incorporated into the PyOD library.
Implementing COPOD
The function to run COPOD testing against our multivariate dataset
The run_tests() function now includes COPOD.
The determine_outliers() function now includes calculations for COPOD.
Our calculation for tested_sensitivity_factors remains the same as in Chapter 11, iterating over each test and multiplying the weight by the value (0 or 1) of tests_run.
The next step calculates the median for COPOD. Unlike LOCI and COF, which use absolute measures, COPOD’s score is a relative score from the median, meaning we need to perform this calculation and add it to the sensitivity threshold. The net result is that our minimum sensitivity threshold is 3.65, adding the 1.35 sensitivity factor for COF with the minimum 2.3 sensitivity factor for COPOD. If we include LOCI as well (as we do when we have no more than 1000 observations), our minimum sensitivity threshold will be 6.65. After that, the rest of the function remains unchanged.
Now that we have covered all of the necessary changes to the code, let’s continue on to tests and website updates.
Test and Website Updates
As we have seen several times throughout the book, introducing a new algorithm into an ensemble will affect existing tests. In this section, we will review the net effects on unit and integration tests. Because this is the final chapter in which we work on multivariate outlier detection, we will also extend the website to make it a bit more user-friendly.
Unit Test Updates
Number of outliers for differing sensitivity scores given a max fraction of anomalies of 1.0
Sensitivity Score | Ch. 10 | Ch. 11 | Ch. 12 |
---|---|---|---|
100 | 11 | 8 | 6 |
50 | 11 | 8 | 6 |
40 | 11 | 8 | 6 |
25 | 4 | 8 | 6 |
5 | 2 | 2 | 5 |
1 | 1 | 2 | 3 |
In Chapter 10, we saw a range from 1 to 11 outliers depending on the sensitivity score, with a “dead zone” from 40 to 100 in which the number of outliers did not change. In Chapter 11, we saw either 2 or 8 outliers, with the cutoff happening somewhere between 25 and 5. In this chapter, the minimum number of outliers has increased to 3, and the maximum has dropped to 6. This has further reduced the relative importance of sensitivity score.
Integration Test Updates
This is a positive outcome for us, as we now are able to show that one large outlier and one small-to-medium outlier will not necessarily overwhelm a small outlier.
Website Updates
Include separate anomaly score components for COF, LOCI, and COPOD
Create an abstract syntax tree to parse a list of lists and convert results to appropriate JSON.
Using the literal_eval() function, we evaluate the string and build out a list of lists. The json.dumps() function translates this list into valid JSON in the format we require by enumerating over each data point in the dataset, parsing out the key and value, and tagging them with the appropriate names.
Conclusion
Over the course of this chapter, we learned just enough about copulas to understand the basic workings of the Copula-Based Outlier Detection (COPOD) method. We then incorporated COPOD into our multivariate outlier detection ensemble to good effect. We concluded by making some small but useful improvements to the accompanying outlier detection application.
Before we move on to Part IV and time series analysis, we should take a moment to reflect on what we’ve been able to accomplish and how we could make the current multivariate outlier detection system better. Throughout Part III, we have built an independent ensemble of three algorithms: Connectivity-Based Outlier Factor (COF), Local Correlation Integral (LOCI), and COPOD. These three algorithms work in different ways to discover if some data point appears to be an outlier, with two of the algorithms using density-based approaches and the third a distributional approach using copula functions. We have seen that the combination of these three algorithms provides us a rather stable base for determining outliers, meaning we do not see radical differences in the number of data points marked as outliers given different sensitivity scores or numbers of nearest neighbors. We also have the benefit of choosing two algorithms with no required user input and a third with limited user input, which fits extremely well with our philosophy of making it easy for less statistically inclined users to work with our service.
There are dozens of other algorithms available for multivariate outlier detection we could investigate, though each additional algorithm adds new complexities and should be carefully evaluated before addition. COF and LOCI (specifically ALOCI, a linear approximation of the LOCI algorithm not available in PyOD) work fairly well based on Mehrotra’s research (Mehrotra et al., 116–117), which came out before the COPOD paper. With that as our starting point, more is not necessarily better. To ensure that we make solid decisions on further algorithmic choices, a deeper analysis over additional, labeled datasets would be critical.
This deeper analysis would also help us fine-tune the weights. The LOCI algorithm comes with strong hyperparameter guidance from its authors, but COF and COPOD do not. We made reasonable decisions for each of these weights, but additional datasets could allow us to tweak these measures for better results.
We are now done with multivariate outlier detection and will shift to Part IV, time series analysis. In the next chapter, we will get an overview of time series problems.