© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
K. FeaselFinding Ghosts in Your Datahttps://doi.org/10.1007/978-1-4842-8870-2_7

7. Extending the Ensemble

Kevin Feasel1  
(1)
DURHAM, NC, USA
 

Chapter 6 provided us with the first three outlier detection tests. In this chapter, we will build upon the prior work and include several additional tests. We will also refactor existing code, rethink a few design choices, and wrap up the core elements of univariate statistical analysis.

Adding New Tests

The three tests we added in the prior chapter are quite useful and might even be enough on their own to solve many common univariate outlier detection problems. If your input data follows a specific shape, however, there are other that can be quite useful in tracking outliers. Specifically, there are several tests we can perform if the data fits closely enough to a normal distribution.

Note

In Chapter 3, we described various distributions, including the normal distribution. As a brief summary, the normal distribution has a mean of 0 and a variance of 1. A normal distribution is one in which we can, using only addition and multiplication, translate our data to the normal distribution without distorting the relationships between data points. For the rest of this chapter, any reference to normal distributions (even mentions of “the” normal distribution) will be the more generic concept, not the singular normal distribution with a mean of 0 and variance of 1.

Checking for Normality

There are several tests that can inform us as to whether a particular dataset appears to follow a normal distribution. The first test we will look at is called the Shapiro-Wilk test. This test, named after Samuel Shapiro and Martin Wilk, tends to be the most accurate at determining whether a given sample of data approximates a normal distribution. In Python, the Shapiro-Wilk test is a function within the scipy.stats library.

The second test is called D’Agostino’s K-squared test, named after Ralph D’Agostino. This test measures how much skewness and kurtosis the dataset displays. In statistical terms, the skewness of a sample is a measure of whether the left tail is longer (negative skew), the right tail is longer (positive skew), or both tails are the same length (no skew). Figure 7-1 shows an example of a normal distribution with no skew followed by a log-normal distribution that exhibits positive skew.

A graph plots density versus value. Log normal is a right-skewed curve and normal is a normal curve.

Figure 7-1

The normal distribution exhibits no skewness. By contrast, the log-normal distribution has a positive skew

The kurtosis of a dataset indicates the amount of a distribution contained in the tails. The kurtosis of a normal distribution is 3, and as the kurtosis increases, the likelihood of seeing outlier data points increases. Using the datasets from Figure 7-1, the skewness of our normal distribution is 0, and its kurtosis is 2.99. The log-normal dataset has a skewness of 5.34 and a kurtosis of 58.34.

D’Agostino’s K-squared test, which is the normaltest() function in scipy.stats, uses these measures to estimate whether the dataset’s distribution is markedly different from the normal distribution.

The third test we will use is the Anderson-Darling test, named after Theodore Anderson and Donald Darling. This test is similar to the Shapiro-Wilk test. It tends not to be quite as accurate as the Shapiro-Wilk test, but in Python, the anderson() function performs the test at a variety of significance levels, meaning that you can test at several levels of significance in one go.

Listing 7-1 includes the new import statements for this chapter as well as two of the three function calls: Shapiro-Wilk and D’Agostino’s K-squared.
from scipy.stats import shapiro, normaltest, anderson, boxcox
import scikit_posthocs as ph
import math
def check_shapiro(col, alpha=0.05):
  return check_basic_normal_test(col, alpha, "Shaprio-Wilk test", shapiro)
def check_dagostino(col, alpha=0.05):
  return check_basic_normal_test(col, alpha, "D'Agostino's K^2 test", normaltest)
def check_basic_normal_test(col, alpha, name, f):
  stat, p = f(col)
  return ( (p > alpha), (f"{name} test, W = {stat}, p = {p}, alpha = {alpha}.") )
Listing 7-1

Run two basic tests for normality

Both of these tests take a similar shape, and so we centralize the code in check_basic_normal_test(), which takes four parameters: the values in our DataFrame (col), an indicator of just how unlike a normal distribution our result should be before we flag it as non-normal (alpha), the name of the function that we want to use for labeling (name), and the actual function name (f). In Python, we can use a function as an input to another function. When doing so, note that we do not put parentheses around the function name, either in the parameter to check_basic_normal_test() or in the calls in check_shapiro() or check_dagostino(). We save the function call f(col) for the body of check_basic_normal_test(), in which we call shapiro() or normaltest() and pass in col as the parameter. These tests calculate a p-value and compare that against our provided alpha parameter. If the p-value is greater than alpha, then we assume that the sample could have come from a normal distribution. If the p-value is at or below alpha, we believe that the data is not normal.

Listing 7-2 contains the Anderson-Darling check. Because there is more going on than with the other tests, it does not fit cleanly into the same mold, and therefore, we need to do things a little differently.
def check_anderson(col):
  # Start by assuming normality.
  anderson_normal = True
  return_str = "Anderson-Darling test. "
  result = anderson(col)
  return_str = return_str + f"Result statistic: {result.statistic}"
  for i in range(len(result.critical_values)):
    sl, cv = result.significance_level[i], result.critical_values[i]
    if result.statistic < cv:
      return_str = return_str + f"Significance Level {sl}: Critical Value = {cv}, looks normally distributed. "
    else:
      anderson_normal = False
      return_str = return_str + f"Significance Level {sl}: Critical Value = {cv}, does NOT look normally distributed! "
  return ( anderson_normal, return_str )
Listing 7-2

The Anderson-Darling check for normality

Anderson-Darling provides a result statistic and allows you to compare that statistic vs. the critical value for each significance level—that is, each value of alpha. The critical value for each significance level depends on the number of input data points. Figure 7-2 shows what an example looks like for Shapiro-Wilk, D’Agostino, and Anderson-Darling tests against a dataset that passes the normality test.

A pseudocode of the Shapiro-Wilk, D Agostino, and Anderson-Darling tests against a dataset that passes the normality test.

Figure 7-2

The results of three normality checks. Note that the Anderson-Darling significance level scale runs from 0 to 100 rather than 0-1

In this figure, we can see that the p-values for Shapiro-Wilk and D’Agostino are both well above 0.05, so this dataset passes those normality tests. For Anderson-Darling, the value of the result statistic is 0.141. This number should be below the critical value at each significance level in order for us to consider the dataset as coming from a normal distribution. As we can see, the statistic value of 0.141 is well below the 15% significance level’s critical value of 0.501 and therefore below the weaker significance levels as well. By contrast, Figure 7-3 shows an example in which the input dataset does not appear to come from a normal distribution.

A pseudocode of the Shapiro-Wilk, D Agostino, and Anderson-Darling tests of the input dataset that does not appear to come from a normal distribution.

Figure 7-3

A sample that is decidedly not normal

In this case, the Shapiro-Wilk and D’Agostino p-values are well below 0.05, meaning that both tests indicate that this dataset does not appear to have come from a normal distribution. The Anderson-Darling result statistic has a value greater than 5.0, well above the 0.501 at the 15% significance level and even the 0.951 of the 1% significance level.

Now that we have the three checks in place, we want to build another function that knows when it makes sense to call each method, makes the calls, collates the responses, and makes a determination as to whether the input dataset appears to follow from a normal distribution. Listing 7-3 contains the code for this function.
def is_normally_distributed(col):
  alpha = 0.05
  if col.shape[0] < 5000:
    (shapiro_normal, shapiro_exp) = check_shapiro(col, alpha)
  else:
    shapiro_normal = True
    shapiro_exp = f"Shapiro-Wilk test did not run because n >= 5k. n = {col.shape[0]}"
  if col.shape[0] >= 8:
    (dagostino_normal, dagostino_exp) = check_dagostino(col, alpha)
  else:
    dagostino_normal = True
    dagostino_exp = f"D'Agostino's test did not run because n < 8. n = {col.shape[0]}"
  (anderson_normal, anderson_exp) = check_anderson(col)
  diagnostics = {"Shapiro-Wilk": shapiro_exp, "D'Agostino": dagostino_exp, "Anderson-Darling": anderson_exp}
  return (shapiro_normal and dagostino_normal and anderson_normal, diagnostics)
Listing 7-3

A function that determines whether an input dataset follows from a normal distribution

The function begins by setting an alpha value of 0.05, following our general principle of not having users make choices that require detailed statistical knowledge. Then, if the dataset has fewer than 5000 observations, we run the Shapiro-Wilk test; otherwise, we skip the test. If we have at least eight observations, we can run D’Agostino’s K-squared test. Finally, no matter the sample size, we may run the Anderson-Darling test.

When it comes to making a determination based on these three tests, we have to make a decision similar to how we weight our ensembles. We can perform simple voting or weigh some tests greater than others. In this case, we will stick with a simple unanimity rule: if all three tests agree that the dataset is normally distributed, then we will assume that it is; otherwise, we will assume that it is not normally distributed. The diagnostics generate the JSON snippets that we saw in Figures 7-2 and 7-3.

Approaching Normality

Now that we have a check for normality, we can then ask the natural follow-up question: If we need normally distributed data and our dataset is not already normally distributed, is there anything we can do about that? The answer to that question is an emphatic “Probably!”

There are a few techniques for transforming data to look more like it would if it followed from a normal distribution. Perhaps the most popular among these techniques is the Box-Cox transformation, named after George Box and David Cox. Box-Cox transformation takes a variable and transforms it based on the formula in Listing 7-4.

Listing 7-4. The equation behind Box-Cox transformation

$$ y=left{egin{array}{c} log (x)kern1.25em if lambda =0;kern0.75em \ {}frac{left({x}^{lambda }-1
ight)}{lambda}kern1em otherwise.end{array}
ight. $$

Suppose we have some target variable x. This represents the single feature of our univariate dataset. We want to transform the values of x into a variable that follows from a normal distribution, which we’ll call y. Given some value of lambda (λ), we can do this by performing one of two operations. If lambda is 0, then we take the log of x, and that becomes y. This works well for log-normally distributed data like we saw in Chapter 3. Otherwise, we have some lambda and raise x to that power, subtract 1 from the value, and divide it by lambda. Lambda can be any value, positive or negative, although common values in Box-Cox transformations range between -5 and 5.

This leads to the next question: How do we know what the correct value of lambda is? Well, the SciPy library has a solution for us: scipy.stats.boxcox() takes an optional lambda parameter. If you pass in a value for lambda, the function will use that value to perform Box-Cox transformation. Otherwise, if you do not specify a value for the lambda parameter, the function will determine the optimal value for lambda and return that to you. Even though we get back an optimal value for lambda, there is no guarantee that the transformed data actually follows from a normal distribution, and therefore, we should check the results afterward to see how our transformation fares.

Taking the aforementioned into consideration, Listing 7-5 shows the code we use to perform normalization on an incoming dataset.
def normalize(col):
  l = col.shape[0]
  col_sort = sorted(col)
  col80 = col_sort[ math.floor(.1 * l) + 1 : math.floor(.9 * l) ]
  temp_data, fitted_lambda = boxcox(col80)
  # Now use the fitted lambda on the entire dataset.
  fitted_data = boxcox(col, fitted_lambda)
  return (fitted_data, fitted_lambda)
Listing 7-5

The function to normalize our input data using the Box-Cox technique

The first thing we do is find the middle 80% of our input data. The rationale behind this is that Box-Cox transformation is quite effective at transforming data. So effective, in fact, that it can smother outliers in the data and make the subsequent tests lose a lot of their value in identifying outlier values. Therefore, we will focus on the central 80% of data, which we typically do not expect to contain many outliers. Running boxcox() on the central 80% of data, we get back a fitted lambda value. We can then pass that value in as a parameter to perform the transformation against our entire dataset, returning the entirety of fitted data as well as the fitted lambda we built from the central 80%.

We make use of this in Listing 7-6, which contains the rules around whether we can normalize the data and how successful that normalization was.
def perform_normalization(base_calculations, df):
  use_fitted_results = False
  fitted_data = None
  (is_naturally_normal, natural_normality_checks) = is_normally_distributed(df['value'])
  diagnostics = {"Initial normality checks": natural_normality_checks}
  if is_naturally_normal:
    fitted_data = df['value']
    use_fitted_results = True
  if ((not is_naturally_normal)
    and base_calculations["min"] < base_calculations["max"]
    and base_calculations["min"] > 0
    and df['value'].shape[0] >= 8):
    (fitted_data, fitted_lambda) = normalize(df['value'])
    (is_fitted_normal, fitted_normality_checks) = is_normally_distributed(fitted_data)
    use_fitted_results = True
    diagnostics["Fitted Lambda"] = fitted_lambda
    diagnostics["Fitted normality checks"] = fitted_normality_checks
  else:
    has_variance = base_calculations["min"] < base_calculations["max"]
    all_gt_zero = base_calculations["min"] > 0
    enough_observations = df['value'].shape[0] >= 8
    diagnostics["Fitting Status"] = f"Elided for space"
  return (use_fitted_results, fitted_data, diagnostics)
Listing 7-6

The function to perform normalization on a dataset

Listing 7-6 starts by checking to see if the data is already normally distributed. If so, we do not need to perform any sort of transformation and can use the resulting data as is. Otherwise, we perform a Box-Cox transformation if we meet all of the following results: first, the data must not follow from a normal distribution; second, there must be some variance in the data; third, all values in the data must be greater than 0, as we cannot take the logarithm of 0 or a negative number; and fourth, we need at least eight observations before Box-Cox transformation will work.

If we meet all of these criteria, then we call the normalize() function and retrieve the fitted data and lambda. Then, we call is_normally_distributed() a second time on the resulting dataset. Note that we may still end up with non-normalized data as a result of this operation due to the fact that we configured the value of lambda based on the central 80% of our dataset and so extreme outliers may still be present in the data after transformation. Therefore, we log the result but do not allow is_normally_distributed() to prohibit us from moving forward.

In the event that we do not meet all of the relevant criteria, we figure out which criteria are not correct and write that to the diagnostics dictionary. Finally, we return all results to the end user. Now that we have a function to normalize our data should we need it, we can update the run_tests() function to incorporate this, as well as a slew of new tests.

A Framework for New Tests

So far, we have three tests that we want to run regardless of the shape of our data and number of observations. In addition to these three tests, we should add new tests that make sense in particular scenarios. Listing 7-7 shows the outline for the updated run_tests() function, one which will be capable of running all of our statistical tests.
def run_tests(df):
  base_calculations = perform_statistical_calculations(df['value'])
  (use_fitted_results, fitted_data, normalization_diagnostics) = perform_normalization(base_calculations, df)
  # for each test, execute and add a new score
  # Initial tests should NOT use the fitted calculations.
  b = base_calculations
  df['sds'] = [check_sd(val, b["mean"], b["sd"], 3.0) for val in df['value']]
  df['mads'] = [check_mad(val, b["median"], b["mad"], 3.0) for val in df['value']]
  df['iqrs'] = [check_iqr(val, b["median"], b["p25"], b["p75"], b["iqr"], 1.5) for val in df['value']]
  tests_run = {
    "sds": 1,
    "mads": 1,
    "iqrs": 1,
    # Mark these as 0s to start and set them on if we run them.
    "grubbs": 0,
    "gesd": 0,
    "dixon": 0
  }
  # Start off with values of -1. If we run a test, we'll populate it with a valid value.
  df['grubbs'] = -1
  df['gesd'] = -1
  df['dixon'] = -1
  if (use_fitted_results):
    df['fitted_value'] = fitted_data
    col = df['fitted_value']
    c = perform_statistical_calculations(col)
    # New tests go here.
  else:
    diagnostics["Extended tests"] = "Did not run extended tests because the dataset was not normal and could not be normalized."
  return (df, tests_run, diagnostics)
Listing 7-7

An updated test runner. For the sake of parsimony, most mentions of the diagnostics dictionary have been removed.

The first thing we do in the function is to perform statistical calculations, retrieving data such as mean, standard deviation, median, and MAD. We have also added the minimum (min), maximum (max), and count of observations (len) to the dictionary, as they will be useful for the upcoming tests. After this, we perform a normalization check and get a response indicating whether we should use the fitted results. Regardless of whether we decide to use the fitted results, we first want to use the original data to perform our checks from Chapter 6. The pragmatic reason is that doing so means we don’t need to change as many unit tests in this chapter—transformed data will, by nature of the transformation, not necessarily share the same outlier points as untransformed data.

After running these initial tests, we mark them as having run and add our new tests as not having run. We also create columns in our DataFrame for the new tests. This way, any references to the columns later on are guaranteed to work. Finally, if the use_fitted_results flag is True, we will set the fitted value to our (potentially) transformed dataset and perform a new set of statistical calculations on the fitted data. Then, we ready ourselves to run any new tests. The following sections cover the three new tests we will introduce, why they are important, and how we will implement them.

Grubbs’ Test for Outliers

Grubbs’ test for outliers is based on Frank Grubbs’ 1969 article, Procedures for Detecting Outlying Observations in Samples (Grubbs, 1969). This test outlines a strategy to determine if a given dataset has an outlier. The assumption with this test is that there are no outliers in the dataset, with the alternative being that there is exactly one outlier in the dataset. There is no capacity to detect more than one outlier.

Although this test is quite limited in the number of outliers it produces, it does a good job at finding the value with the largest spread from the mean and checking whether the value is far enough away to matter.

We can run Grubbs’ test using the scikit_posthocs package in Python. This third-party package includes several outlier detection checks, including Grubbs’ test and another test we will use shortly. Calling the test is simple: ph.outliers_grubbs(col) is sufficient, where col is the column containing fitted values in our DataFrame. The function call then returns the set of inliers, which means we need to find the difference between our initial dataset and the resulting inlier set to determine if there are any outliers. Listing 7-8 shows one way to do this using set operations in Python.
def check_grubbs(col):
  out = ph.outliers_grubbs(col)
  return find_differences(col, out)
def find_differences(col, out):
  # Convert column and output to sets to see what's missing.
  # Those are the outliers that we need to report back.
  scol = set(col)
  sout = set(out)
  sdiff = scol - sout
  res = [0.0 for val in col]
  # Find the positions of missing inputs and mark them
  # as outliers.
  for val in sdiff:
    indexes = col[col == val].index
    for i in indexes: res[i] = 1.0
  return res
Listing 7-8

Finding differences between sets to determine if there are outliers

The find_differences() function takes in two datasets: our fitted values (col) and the output from the call to Grubbs’ test (out). We convert each of these to a set and then find the set difference—that is, any elements in the fitted values set that do not appear in the results set. Note that duplicate values are not allowed in a set and so all references to a particular outlier value will become outliers. For example, if we have a fitted values dataset of [ 0, 1, 1, 2, 3, 9000, 9000 ], the fitted values set becomes { 0, 1, 2, 3, 9000 }. If our output dataset is [ 0, 1, 1, 2, 3 ], we convert it to a set as well, making it { 0, 1, 2, 3 }. The set difference between these two is { 9000 }, so every value of 9000 will become an outlier. We do this by creating a new result column that has as many data points as we have elements in the fitted values dataset. Then, for each value in the set difference, we find all references in the fitted values dataset matching that value and mark the matching value in our results dataset to 1.0 to indicate that the data point represents an outlier. We then return the results list to the caller.

The reasons we loop through sdiff, despite there being only one possible element in the set, are twofold. First, this handles cleanly the scenario in which sdiff is an empty set, as then it will not loop at all. Second, we’re going to need this function again for the next test, the generalized ESD (or GESD) test.

Generalized ESD Test for Outliers

The generalized extreme Studentized deviate test, otherwise known as GESD or generalized ESD, is the general form of Grubbs’ test. Whereas Grubbs’ test requires that we have either no outliers or one outlier, generalized ESD only requires that we specify the upper bound of outliers. At this point, you may note that we already keep track of a max_fraction_anomalies user input, so we could multiply this fraction by the number of input items and use that to determine the maximum upper bound. This idea makes a lot of sense in principle, but it does not always work out in practice. The reason for this is a concept known as degrees of freedom.

Degrees of freedom is a concept in statistics that relates to how many independent values we have in our dataset and therefore how much flexibility we have in finding missing information (in our case, whether a particular data point is an outlier) by varying those values. If we have a dataset containing 100 data points and we estimate that we could have up to five outliers, that leaves us with 95 “known” inliers. Generalized ESD is built off of a distribution called the t (or student-t) distribution, whose calculation for degrees of freedom is the number of data points minus one. For this calculation, we take the 95 expected inliers, subtract 1, and end up with 94 degrees of freedom. Ninety-four degrees of freedom to find up to five outliers is certainly sufficient for the job. By contrast, if we allow 100% of data points to be outliers, we have no degrees of freedom and could get an error. Given that our default fraction of outliers is 1.0, using this as the input for generalized ESD is likely going to leave us in trouble.

Instead, let us fall back to a previous rule around outliers and anomalies: something is anomalous in part because it is rare. Ninety percent of our data points cannot be outliers; otherwise, the sheer number of data points would end up making these “outliers” become inliers! Still, we do not know the exact number of outliers in a given dataset, and so we need to make some estimation. The estimation we will make is that we have no more than 1/3 of data points as potential outliers. Along with a minimum requirement of 15 observations, this ensures that we have no fewer than 5 potential outliers for a given dataset and also that we have a minimum of 9 degrees of freedom. As the number of data points in our dataset increases, the degrees of freedom increases correspondingly, and our solution becomes better at spotting outliers.

Listing 7-9 shows how we can find the set of outliers from generalized ESD. It takes advantage of the find_differences() function in Listing 7-8. The implementation for generalized ESD comes from the same scikit_posthocs library as Grubbs’ test and therefore returns a collection of inlier points. Thus, we need to perform the set difference between input and output to determine which data points this test considers to be outliers.
def check_gesd(col, max_num_outliers):
  out = ph.outliers_gesd(col, max_num_outliers)
  return find_differences(col, out)
Listing 7-9

The code to check for outliers using generalized ESD

Dixon’s Q Test

The final test we will implement in this chapter is Dixon’s Q test, named after Wilfrid Dixon. Dixon’s Q test is intended to identify outliers in small datasets, with a minimum of three observations and a small maximum. The original equation for Dixon’s Q test specified a maximum of ten observations, although Dixon later came up with a variety of tests for different small datasets. We will use the original specification, the one commonly known as the Q test, with a maximum of 25 observations as later research showed that the Q test would hold with datasets of this size. The formula for the Q test is specified in Listing 7-10. We first sort all of the data in x in ascending order, such that x1 is the smallest value and xn is the largest.

Listing 7-10. The calculation for Q

$$ {Q}_{exp}=frac{x_2-{x}_1}{x_n-{x}_1} $$

Once we have calculated the value of Q, we compare it against a table of known critical values for a given confidence level. In our case, we will use the 95% confidence level. Doing so, we check if our expected Q value is greater than or equal to the critical value. If so, the largest data point is an outlier. We can perform a similar test to see if the minimum data point is an outlier. This means that in practice, we can spot up to two outliers with Dixon’s Q test: the maximum value and the minimum value.

Listing 7-11 includes the code to perform Dixon’s Q test. This test is not available in scikit_posthocs, so we will create it ourselves.
def check_dixon(col):
  q95 = [0.97, 0.829, 0.71, 0.625, 0.568, 0.526, 0.493, 0.466,
    0.444, 0.426, 0.41, 0.396, 0.384, 0.374, 0.365, 0.356,
    0.349, 0.342, 0.337, 0.331, 0.326, 0.321, 0.317, 0.312,
    0.308, 0.305, 0.301, 0.29]
  Q95 = {n:q for n, q in zip(range(3, len(q95) + 1), q95)}
  Q_mindiff, Q_maxdiff = (0,0), (0,0)
  sorted_data = sorted(col)
  Q_min = (sorted_data[1] - sorted_data[0])
  try:
    Q_min = Q_min / (sorted_data[-1] - sorted_data[0])
  except ZeroDivisionError:
    pass
  Q_mindiff = (Q_min - Q95[len(col)], sorted_data[0])
  Q_max = abs(sorted_data[-2] - sorted_data[-1])
  try:
    Q_max = Q_max / abs(sorted_data[0] - sorted_data[-1])
  except ZeroDivisionError:
    pass
  Q_maxdiff = (Q_max - Q95[len(col)], sorted_data[-1])
  res = [0.0 for val in col]
  if Q_maxdiff[0] >= 0:
    indexes = col[col == Q_maxdiff[1]].index
    for i in indexes: res[i] = 1.0
  if Q_mindiff[0] >= 0:
    indexes = col[col == Q_mindiff[1]].index
    for i in indexes: res[i] = 1.0
  return res
Listing 7-11

The code for Dixon’s Q test

We first create a list of critical values for a given number of observations at a 95% confidence level. We turn this into a dictionary whose key is the number of observations and whose value is the critical value for that many observations. After sorting the data, we calculate the minimum value’s Q, following the formula in Listing 7-10. Note that the division is in a try-catch block so that if there is no variance, we do not get an error. Instead, the numerator is guaranteed to be 0. After calculating Q_min, we compare its value to the appropriate critical value and call that Q_mindiff. We can perform a very similar test with the maximum data point, comparing it to the second largest and dividing by the total range of the dataset.

After calculating Q_mindiff and Q_maxdiff, we check to see if either is greater than or equal to zero. A value less than zero means that the data point is an inlier, whereas a value greater than or equal to zero indicates an outlier. Finally, as with the other tests in this chapter, in the event that there are multiple data points with the same value, we mark all such ties as outliers.

Now that we have three tests available to us, we will need to update our calling code to reference these tests.

Calling the Tests

In Listing 7-7, we left a spot for the new tests. Listing 7-12 completes the function call. For the sake of convenience, I have included an if statement and a few preparatory lines from Listing 7-7 to provide more context on where this new code slots in.
if (use_fitted_results):
  df['fitted_value'] = fitted_data
  col = df['fitted_value']
  c = perform_statistical_calculations(col)
  diagnostics["Fitted calculations"] = c
  if (b['len'] >= 7):
    df['grubbs'] = check_grubbs(col)
    tests_run['grubbs'] = 1
  else:
    diagnostics["Grubbs' Test"] = f"Did not run Grubbs' test because we need at least 7 observations but only had {b['len']}."
  if (b['len'] >= 3 and b['len'] <= 25):
    df['dixon'] = check_dixon(col)
    tests_run['dixon'] = 1
  else:
    diagnostics["Dixon's Q Test"] = f"Did not run Dixon's Q test because we need between 3 and 25 observations but had {b['len']}."
  if (b['len'] >= 15):
    max_num_outliers = math.floor(b['len'] / 3)
    df['gesd'] = check_gesd(col, max_num_outliers)
    tests_run['gesd'] = 1
else:
  diagnostics["Extended tests"] = "Did not run extended tests because the dataset was not normal and could not be normalized."
Listing 7-12

Add the three new tests.

In order to run Grubbs’ test, we need at least 7 observations; running Dixon’s Q test requires between 3 and 25 observations; and generalized ESD requires a minimum of 15 observations. If we run a given test, we set the appropriate column in our DataFrame to indicate the results of the test; otherwise, the score will be -1 for a test that did not run.

Because we have new tests, we need updated weights. Keeping the initial tests’ weights the same, we will add three new weights: Grubbs’ test will have a weight of 0.05, Dixon’s Q test a weight of 0.15, and generalized ESD a weight of 0.30. These may seem low considering how much work we put into the chapter, but it is for good reason: Grubbs’ test and Dixon’s Q test will mark a maximum of one and two outliers, respectively. If there are more anomalies in our data, then they will incorrectly label those data points as inliers. In practice, this provides a good signal for an extreme outlier but a much weaker signal for non-extremes. Generalized ESD has less of a problem with this, as we specify that up to 20% of the data points may be outliers and so its weight is commensurate with the other statistical tests.

Adding these new weights leads to an interesting problem with scoring. In Chapter 6, we made the score a simple summation of weights, allowing that the sum could exceed 1.0. We now have to deal with a new problem: if a test does not run, we don’t want to include its weight in the formula. Listing 7-13 shows the new version of the score_results() function, which now includes a set of tests we have run.
def score_results(df, tests_run, weights):
  tested_weights = {w: weights.get(w, 0) * tests_run.get(w, 0) for w in set(weights).union(tests_run)}
  max_weight = sum([tested_weights[w] for w in tested_weights])
  return df.assign(anomaly_score=(
    df['sds'] * tested_weights['sds'] +
    df['iqrs'] * tested_weights['iqrs'] +
    df['mads'] * tested_weights['mads'] +
    df['grubbs'] * tested_weights['grubbs'] +
    df['gesd'] * tested_weights['gesd'] +
    df['dixon'] * tested_weights['dixon']
  ) / (max_weight * 0.95))
Listing 7-13

A new way to score results

The function first creates a dictionary called tested_weights, which includes weights for all of the statistical tests we successfully ran, using the value of the weight if we did run the test or 0 otherwise. Then, we sum up the total weight of all tests. Finally, we calculate a new anomaly score as the sum of our individual scored results multiplied by the test weight and divide that by 95% of the maximum weight. The 0.95 multiplier on maximum weight serves a similar purpose to the way we oversubscribed on weights in Chapter 6. The idea here is that even if we have a low sensitivity score, we still want some of the extreme outliers to show up. As a result, our anomaly score for a given data point may go above 1.0, ensuring that the data point always appears as an outlier regardless of the sensitivity level.

With these code changes in place, we are almost complete. Now we need to make a few updates to tests, and we will have our outlier detection process in place.

Updating Tests

A natural result of adding new statistical tests to our ensemble is that some of the expected results in our unit and integration tests will necessarily change. We are adding these new statistical tests precisely because they give us added nuance, so there may be cases in which we had four outliers based on the results of Chapter 6, but now we have five or three outliers with these new tests. As such, we will want to observe the changes that these new statistical tests bring, ensure that the changes are fairly reasonable, and update the unit and integration tests to correspond with our new observations. In addition, we will want to add new tests to cover additional functionality we have added throughout this chapter.

Updating Unit Tests

Some of the tests we created in Chapter 6 will need to change as a result of our work throughout this chapter. The test case with the greatest number of changes is the test that ensures that sensitivity level changes affect the number of anomalies reported. There are several instances in which the number of outliers we saw in Chapter 6 will differ from the number we see in Chapter 7. For example, when we set the sensitivity level to 95 in Chapter 6, we got back 15 outliers. With these new tests in place, we are down to 12 outliers. With the sensitivity level at 85, we went from eight down to five outliers, 75 moves from five to two, and a sensitivity level of 1 goes from one outlier to zero outliers.

This turns out to be a positive outcome, as a human looking at the stream { 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 8, 9, 10, 2550, 9000 } sees two anomalies: 2550 and 9000. Ideally, our outlier detection engine would only return those two values as outliers. We can see that at least in the range 2575 (which includes our default of 50), we do in fact get two outliers. There are a few other tests that need minor changes as well for similar reasons.

In addition to these alterations to existing tests, we should create new test cases. Inside the test_univariate.py file, there are several new test datasets. The first is uniform_data, which is a list ranging from 1 to 10. Then, we have normal_data, which is randomly generated data following from a normal distribution with a mean of 50 and a standard deviation of 5. The third dataset, skewed_data, is the normal data list with the exception of four values we multiplied by 1000. Next, we have single_skewed_data, in which only a single value was multiplied by 1000. Finally, larger_uniform_data includes the range from 1 to 50. Listing 7-14 shows a new test, which ensures that normalization calls return the expected results as to whether a given dataset appears to be normal or not. One interesting thing to note is that the Shapiro-Wilk and Anderson-Darling tests tend to evaluate uniformly distributed data as close enough to normal but D’Agostino’s K-squared test correctly catches—at least on the larger uniform dataset—that this is a different distribution.
@pytest.mark.parametrize("df_input, function, expected_normal", [
  (uniform_data, check_shapiro, True),
  (normal_data, check_shapiro, True),
  (skewed_data, check_shapiro, False),
  (single_skewed_data, check_shapiro, False),
  (larger_uniform_data, check_shapiro, True),
  (uniform_data, check_dagostino, True),
  (normal_data, check_dagostino, True),
  (skewed_data, check_dagostino, False),
  (single_skewed_data, check_dagostino, False),
  (larger_uniform_data, check_dagostino, False), # Note that this is different from the other tests!
  (uniform_data, check_anderson, True),
  (normal_data, check_anderson, True),
  (skewed_data, check_anderson, False),
  (single_skewed_data, check_anderson, False),
  (larger_uniform_data, check_anderson, True),
])
def test_normalization_call_returns_expected_results(df_input, function, expected_normal):
  # Arrange
  # Act
  (is_normal, result_str) = function(df_input)
  # Assert: the distribution is/is not normal, based on our expectations.
  assert(expected_normal == is_normal)
Listing 7-14

A test to ensure that our normalization tests execute as expected on various types of datasets

In this test, we do not need to perform any special preparation work, and so the Arrange section of our test is empty. We can also see that for the most part, there is agreement between these tests, at least for the five extreme cases I chose. We could find other situations in which the tests disagree with one another, but this should prove sufficient for the main purpose of our test.

The natural follow-up to this test is one ensuring that the call to is_normally_distributed returns the value that we expect. Listing 7-15 provides the code for this test.
@pytest.mark.parametrize("df_input, function, expected_normal", [
  (uniform_data, is_normally_distributed, True),
  (normal_data, is_normally_distributed, True),
  (skewed_data, is_normally_distributed, False),
  (single_skewed_data, is_normally_distributed, False),
  (larger_uniform_data, is_normally_distributed, False),
])
def test_normalization_call_returns_expected_results(df_input, function, expected_normal):
  # Arrange
  df = pd.DataFrame(df_input, columns={"value"})
  col = df['value']
  # Act
  (is_normal, result_str) = function(col)
  # Assert: the distribution is/is not normal, based on our expectations.
  assert(expected_normal == is_normal)
Listing 7-15

Combining the three normalization tests provides the expected result

Because there is disagreement on the larger uniform dataset case, we expect this function to return False. In order for the test to return True, all three statistical tests must be in agreement.

After checking the results of valid tests, we should go further and make sure that the call to perform Box-Cox transformation only occurs when a given dataset meets our minimum criteria: we must have a minimum of eight data points, there must be some variance in the data, and we may not have any values less than or equal to 0 in our dataset. Listing 7-16 includes the code for this unit test, as well as a few new datasets for our test cases.
@pytest.mark.parametrize("df_input, should_normalize", [
  (uniform_data, True), # Is naturally normal (enough)
  (normal_data, True), # Is naturally normal
  (skewed_data, True),
  (single_skewed_data, True),
  # Should normalize because D'Agostino is false.
  (larger_uniform_data, True),
  # Not enough datapoints to normalize.
  ([1,2,3], False),
  # No variance in the data.
  ([1,1,1,1,1,1,1,1,1,1,1,1,1,1,1], False),
  # Has a negative value--all values must be > 0.
  ([100,20,3,40,500,6000,70,800,9,10,11,12,13,-1], False),
  # Has a zero value--all values must be > 0.
  ([100,20,3,40,500,6000,70,800,0,10,11,12,13,14], False),
])
def test_perform_normalization_only_works_on_valid_datasets(df_input, should_normalize):
  # Arrange
  df = pd.DataFrame(df_input, columns={"value"})
  base_calculations = perform_statistical_calculations(df['value'])
  # Act
  (did_normalize, fitted_data, normalization_diagnostics) = perform_normalization(base_calculations, df)
  # Assert: the distribution is/is not normal, based on our expectations.
  assert(should_normalize == did_normalize)
Listing 7-16

Ensure that we only perform normalization on valid datasets

The comments next to each test case explain why we expect the results we get. Listing 7-17 shows the final test we add, which indicates whether a given statistical test may run based on the dataset we pass in. For example, Grubbs’ test requires a minimum of seven data points, so if we have fewer than seven, we cannot run that check. The specific test cases are in the test_univariate.py file, with Listing 7-17 showing only the calling code.
def test_normalization_required_for_certain_tests(df_input, test_name, test_should_run):
  # Arrange
  df = pd.DataFrame(df_input, columns={"value"})
  # Act
  (df_tested, tests_run, diagnostics) = run_tests(df)
  # Assert: the distribution is/is not normal, based on our expectations.
  assert(test_should_run == diagnostics["Tests Run"][test_name])
Listing 7-17

The code to determine if we run a specific statistical test

We can confirm whether a test ran by reading through the diagnostics dictionary, looking inside the Tests Run object for a particular test name.

After creating all of the relevant tests, we now have 80 test cases covering a variety of scenarios. We will add more test cases and tests as we expand the product, but this will do for now. Instead, we shall update some existing integration tests and then add new ones for Chapter 7.

Updating Integration Tests

Just like with the unit tests, we will need to perform some minor alterations to integration tests. For example, the integration test named Ch06 – Univariate – All Inputs – No Constraint returned 14 outliers in Chapter 6, but with the new tests, that number drops to 8.

Aside from these changes, we will add three more integration tests indicating whether normalization succeeds, fails, or is unnecessary. Listing 7-18 shows the test cases for a Postman-based integration test in which we expect normalization to succeed.
pm.test("Status code is 200: " + pm.response.code, function () { pm.expect(pm.response.code).to.eql(200); });
pm.test("Response time is acceptable (range 5ms to 7000ms): " + pm.response.responseTime + "ms", function () { pm.expect(_.inRange(pm.response.responseTime, 5, 7000)).to.eql(true); });
var body = JSON.parse(responseBody);
pm.test("Number of items returned is correct (eq 17): " + body["anomalies"].length, function () { pm.expect(body["anomalies"].length).to.eql(17); });
pm.test("Debugging is enabled (debug_details exists): " + body["debug_details"], function () { pm.expect(body["debug_details"].exist); });
pm.test("Count of anomalies is correct (eq 2)", () => {
  let anomalies = body["anomalies"].filter(a => a.is_anomaly === true);
  pm.expect(anomalies.length).to.eql(2);
} );
pm.test("We have fitted results (Fitted Lambda exists): " + body["debug_details"]["Test diagnostics"]["Fitted Lambda"], function () { pm.expect(body["debug_details"]["Test diagnostics"]["Fitted Lambda"].exist); });
Listing 7-18

We expect normalization to take place for this call.

The final test, in which we check for the existence of a fitted lambda (indicating that we performed Box-Cox transformation), is the way in which we indicate that normalization did occur as we expected. We have similar checks in the other tests as well.

Multi-peaked Data

Now that our univariate anomaly detection engine is in place, we are nearly ready to wrap up this chapter. Before we do so, however, I want to cover one more topic: multi-peaked data.

A Hidden Assumption

Thus far, we have worked from a fairly straightforward and reasonable assumption: our data is approximately normally distributed. Furthermore, we assume that our data is single peaked. Figure 7-4 is a reprint of Figure 7-1, which we saw at the beginning of the chapter.

A graph plots density versus value. Log normal is a right-skewed curve and normal is a normal curve.

Figure 7-4

Two distributions of data, each of which has a single peak

This figure demonstrates two single-peaked curves. There is a clear central value in which density is highest. By contrast, Figure 7-5 shows an example of a multi-peaked curve with several competing groups of values.

A sinusoidal curve plots density versus value. Values are estimated. Multi-peaked (negative 4, 0.00), (0, 0.17), (5, 0.00), (12, 0.11), and (16, 0.00).

Figure 7-5

A dataset with two clear peaks

In this case, values at the extremes—that is, less than approximately -3 or greater than approximately 20—are still outliers, but so are some central values ranging from approximately 3 to 8. It turns out that our existing anomaly detection tests do not fare well on multi-peaked data. For example, in the integration test library, I have added one more test entitled Ch07 – Univariate – Outliers Slip in with Multi-Peaked Data. This test checks the following dataset: { 1, 2, 3, 4, 5, 6, 50, 95, 96, 97, 98, 99, 100 }. Based on our human understanding of outliers, we can see that the outlier in this dataset is 50. If we run our existing outlier detection program on this dataset, however, we end up with the surprising result in Table 7-1.
Table 7-1

The results of a test with multi-peaked data

Value

Anomaly Score

1

0.169

2

0.164

3

0.158

4

0.153

5

0.149

6

0.146

50

0.001

95

0.148

96

0.151

97

0.155

98

0.160

99

0.165

100

0.171

Not only does 50 not get marked as an outlier, it’s supposedly the least likely data point to be an outlier! The reason for this is that our statistical tests are focused around the mean and median, both of which are (approximately) 50 in this case.

The Solution: A Sneak Peek

Now that we see what the problem is, we can come up with a way to solve this issue. The kernel of understanding we need for a solution is that we should not necessarily think of the data as a single-peaked set of values following from an approximately normal distribution. Instead, we may end up with cases in which we have groups of data, which we will call clusters. Here, we have two clusters of values: one from 1 to 6 and the other from 95 to 100. We will have much more to say about clustering in Part III of the book.

Conclusion

In this chapter, we extended our capability to perform univariate anomaly detection. We learned how to check if an input dataset appears to follow from a normal distribution and how to perform Box-Cox transformation on the data if it does not follow a normal distribution, and incorporated additional tests that expect normalized data in order to work. Our univariate anomaly detector has grown to nearly 400 lines of code and has become better at picking out outliers, returning fewer spurious outliers as we increase the sensitivity level.

In the next chapter, we will take a step back from this code base. We will build a small companion project to visualize our results, giving us an easier method to see outputs than reading through hundreds of lines of JSON output.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.52.250