© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
K. FeaselFinding Ghosts in Your Datahttps://doi.org/10.1007/978-1-4842-8870-2_11

11. Local Correlation Integral (LOCI)

Kevin Feasel1  
(1)
DURHAM, NC, USA
 

Chapter 10 introduced us to one multivariate clustering technique in Connectivity-Based Outlier Factor (COF). COF is a popular technique for multivariate outlier detection, but it does come with the downside that there is little immediate guidance outside of “higher outlier scores are worse.” We lack solid guidance on what, exactly, represents a sufficiently large outlier score. In this chapter, we introduce a complementary technique to COF that does come with its own in-built guidance: Local Correlation Integral.

Local Correlation Integral

Local Correlation Integral (LOCI) is another density-based approach like Local Outlier Factor (LOF) and Connectivity-Based Outlier Factor (COF). Papadimitriou et al. describe one key difference between LOCI and other density-based measures: LOCI “provides an automatic, data-dictated cut-off to determine whether a point is an outlier” (Papadimitriou et al., 1). This built-in cutoff means that users will not need to provide any information outside of the data itself. Over the course of this section, we will see how the algorithm works; then, we will incorporate it into our multivariate outlier detection engine.

Discovering the Neighborhood

An important difference between LOCI and other density-based approaches like LOF or COF is that LOCI does not ask for an explicit number of neighbors in a neighborhood. Instead, it takes a measure called alpha (α), which represents neighborhood size, with 0 < α < 1. The default value for α is 0.5; Papadimitriou et al. chose this value for their experiments (Papadimitriou et al., 5), and we will stick with this value.

Along with α, we need a variable r, which represents the distance of our sampling neighborhood: the set of points we will use against which to compare a given point. Specifically, we determine if that data point is an outlier based on the number of neighbors within a bounded circle measuring α * r, otherwise known as the counting neighborhood. Figure 11-1 provides a visual example of this.

A Venn diagram of 2 sets. It has 2 concentric sets with a radius of r and the inner set has a radius of a r from the point P 1.

Figure 11-1

A representation of the sampling neighborhood (with radius of r) and counting neighborhood α * r

Within our counting neighborhood α * r, we count the number of data points that exist, giving us an idea of how densely packed data points are near our point of interest. Then, to gain an understanding of how that compares to other points, we repeat the process for each point inside our sampling neighborhood. Figure 11-2 fills in additional detail, introducing two additional points: p2 and p3. Because these two points exist inside p1’s sampling neighborhood, we use these two points to compare against p1. The results we come up with are that p1 has one data point in its counting neighborhood, p2 has three, and p3 has five, noting that counts are inclusive of the tested data point.

A Venn diagram of 11 sets. It has 2 concentric sets with a radius of r and the inner set has a radius of A R from the point P 1. The intersecting areas of sets on the top right and bottom left are labelled P 2 and P 3, respectively.

Figure 11-2

The counting neighborhoods of three data points within p1’s sampling neighborhood

From the image in Figure 11-2, we can determine two important criteria. The first is the number of data points inside the counting neighborhood, which we can represent as n(pi, αr). This comes out to 1 for p1, 3 for p2, and 5 for p3, respectively.

The second criterion we can determine is the mean of αr-neighbors within a given sampling neighborhood, which we can represent as $$ hat{n}left({p}_i,r,alpha 
ight) $$ In this case, the mean is 3.

With these two criteria, we can calculate the Multi-granularity Deviation Factor.

Multi-granularity Deviation Factor (MDEF)

The Multi-granularity Deviation Factor (MDEF) is defined using the formula in Listing 11-1.
r
Listing 11-1

The formula for Multi-granularity Deviation Facto

$$ MDEFleft({p}_i,r,alpha 
ight)=1-frac{nleft({p}_i,alpha r
ight)}{hat{n}left({p}_i,r,alpha 
ight)} $$

In other words, we take the size of our counting neighborhood and compare it to other counting neighborhoods in our sampling area. In our example, p1 has an MDEF of 2/3, p2 an MDEF of 0, and p3 an MDEF of -2/3.

The actual outlier detection process takes the standard deviation of counting neighborhood sizes over the sampling neighborhood, as we see in Listing 11-2.
F
Listing 11-2

Calculating the standard deviation of MDE

$$ {sigma}_{MDEF}left({p}_i,r,alpha 
ight)=frac{sigma_{hat{n}}left({p}_i,r,alpha 
ight)}{hat{n}left({p}_i,r,alpha 
ight)} $$

Once we have the standard deviation of MDEF, we get to the other input parameter aside from α: k, which represents the number of standard deviations from MDEF before we consider a data point an outlier. To this, we can apply our rule of thumb from Part I of the book: three standard deviations from the mean indicate a data point as an outlier. Papadimitriou et al. describe the formal logic behind this calculation and indicate that when k=3, we can expect fewer than 1% of data points to trigger that threshold if the set of neighborhood sizes follows a normal distribution (Papadimitriou et al., 5).

Now that we have some understanding of how the LOCI algorithm works, we can incorporate it into our multivariate outlier detection model.

Multivariate Algorithm Ensembles

As we begin to integrate LOCI with COF, it is a good idea to take a step back and understand the options available to us with respect to ensembling. The first part of this section will cover the key types of ensembles available to us. The second part of this section will give us a chance to tighten our use of COF via ensemble. The third part of this section will add LOCI to the mix as well.

Ensemble Types

There are two main classes of ensemble: independent ensembles and sequential ensembles. In Part II, we implemented a form of independent ensemble using weighted shares. The main idea behind an independent ensemble is that we run each test separately, generate each output separately, and combine results at the end. Typically, we return a data point as an anomaly if all—or at least enough—of the tests agree. We can weigh those tests differently based on the quality of different tests. We may, for example, reduce the weight of noisy tests relative to their less error-prone counterparts.

Sequential ensembles, by contrast, stack models like a sieve. The idea here is that the output of one model serves as the input for the next model. In practice, we tend to have the most sensitive (and often noisiest) model go first, giving us a provisional set of outliers. Then, for each additional algorithm, we need only to perform the next set of calculations on the remaining set of outliers—if prior, noisier models think a data point is an inlier, we can accept it as such and save the computational effort. The main benefit to a sequential ensemble is that it can quickly narrow down very large datasets, making it a great approach for enormous datasets. If you are dealing with tens or hundreds of millions of records, a noisy approach should still filter out 9095% of the data points out of the gate, leaving us with a much smaller fraction for the less noisy but more computationally expensive methods.

The major downside to sequential ensembles is that they tend to require more programming effort, as out-of-the-box implementations of algorithms in libraries like PyOD assume you will care about the entire dataset rather than a small selection of data points. In an effort to minimize complexity, we will again use an independent ensemble for multivariate analysis. This is an area, however, in which it can make sense to invest in developing a sequential ensemble, as the implementations of COF and (especially) LOCI we will use can be time-intensive even for small datasets.

Before we bring LOCI into the mix, it is worth taking a return trip to COF.

COF Combinations

By the end of Chapter 10, we created a multivariate outlier detection system using COF as the centerpiece. COF is a great algorithm, but it does require that callers choose the size of the neighborhood and the cutoff point for what constitutes an outlier. Our implementation partially handled the second item by setting a minimum boundary of 1.35 but left the choice of number of neighbors entirely to the caller. This is a reasonable solution, but we can do better: we can try COF with different combinations of neighbor counts (within reason) and combine the results of each test together as an ensemble. To do so, we need the combination module in PyOD.

Listing 11-3 shows our updated run_tests() function. This function now takes the max fraction of anomalies and sends it to COF. We also perform some of the preparatory work, such as converting the encoded values to an array in this function rather than inside check_cof() because we will need the same data for LOCI.
from pyod.models.combination import aom, moa, average, median, maximization, majority_vote
# Code elided
def run_tests(df, max_fraction_anomalies, n_neighbors):
  num_records = df['key'].shape[0]
  tests_run = {
    "cof": 1
  }
  diagnostics = {
    "Number of records": num_records
  }
  col_array = df.drop(["key", "vals"], axis=1).to_numpy()
  n_neighbor_range = range(n_neighbors, min(num_records - 5, n_neighbors + 100), 5)
  n_neighbor_range_len = len(n_neighbor_range)
  labels_cof = np.zeros([num_records, n_neighbor_range_len])
  scores_cof = np.zeros([num_records, n_neighbor_range_len])
  for idx,n in enumerate(n_neighbor_range):
    (labels_cof[:, idx], scores_cof[:, idx], diag_idx) = check_cof(col_array, max_fraction_anomalies=max_fraction_anomalies, n_neighbors=n)
    k = "Neighbors_" + str(n)
    diagnostics[k] = diag_idx
  df["is_raw_anomaly_cof"] = majority_vote(labels_cof)
  anomaly_score = median(scores_cof)
  df["anomaly_score"] = anomaly_score
  return (df, tests_run, diagnostics)
Listing 11-3

The run_tests() function for an ensemble of COF

In this function, we use the range() function to generate a series of nearest neighbor counts to try, starting with the user-defined total (or its default of 10) and moving up by increments of 5 to the lesser of 100 plus the original nearest neighbor count or five less than the number of records in the dataset. For example, with 1000 data points and a starting nearest neighbor count of 10, we will get back a total of 21 values, ranging from 10 to 105. By contrast, if we have 20 data points and a starting nearest neighbor count of 4, we will get back three values: 4, 9, and 14. We then build out two-dimensional arrays named labels_cof and scores_cof to store the results of each test. For each value in the nearest neighbor range, we call check_cof(), sending in the column array, max fraction of anomalies, and number of neighbors, storing the results in our array.

After we have taken care of running COF the prescribed number of times, we need to combine the results somehow. PyOD includes several mechanisms for determining the final value for an outlier score or label. Three options are intuitive: median(), average() (which performs the mean), and maximization(). Another useful option is majority_vote(), which we use for determining our label values. Two other interesting options are Average of Maximum (AOM) and Maximum of Average (MOA). AOM and MOA require a number of buckets, which defaults to 5. Then, we randomly allocate results to each of the buckets. For AOM, we find the maximum value in each bucket and calculate the mean across each of those bucket maxima. For MOA, we calculate the mean of each bucket and return the maximum of those values. With both measures, we can smooth out scores across runs for a given data point. For example, if a COF score for a given data point is normally in the 1.11.3 range but a single run results in a COF of 7, this can affect our average() and maximization() functions but will have a very small effect on median(), MOA, and AOM. There can be some benefit to using one of AOM or MOA to calculate our response, but median will work well enough and also ensures that we do not need to consider how many buckets we would need for any given dataset. For that reason, we will use the median to calculate our anomaly score.

Moving on, Listing 11-4 shows the new version of check_cof(). Compared to the version in Chapter 10, this function exchanges receiving a DataFrame for a column array, includes the max fraction of anomalies as its contamination percentage, and returns arrays rather than columns.
def check_cof(col_array, max_fraction_anomalies, n_neighbors):
  clf = COF(n_neighbors=n_neighbors, contamination=max_fraction_anomalies)
  clf.fit(col_array)
  diagnostics = {
    "COF Contamination": clf.contamination,
    "COF Threshold": clf.threshold_
  }
  return (clf.labels_, clf.decision_scores_, diagnostics)
Listing 11-4

A simple function to run our COF test

By incorporating a variety of tests covering different numbers of neighbors, we have a more robust calculation of COF, and we can also remove an unnecessary decision from our users in how many neighbors to choose. We are now ready to incorporate LOCI into the mix.

Incorporating LOCI

The changes we made in the prior section to the run_tests() function will allow us to incorporate LOCI in a straightforward manner. Listing 11-5 includes an updated run_tests() function as well as the LOCI process itself.
def run_tests(df, max_fraction_anomalies, n_neighbors):
  num_records = df['key'].shape[0]
  if (num_records > 1000):
    run_loci = 0
  else:
    run_loci = 1
  tests_run = {
    "cof": 1,
    "loci": run_loci
  }
  diagnostics = {
    "Number of records": num_records
  }
  col_array = df.drop(["key", "vals"], axis=1).to_numpy()
  # COF-specific code removed for clarity
  anomaly_score = median(scores_cof)
  if (run_loci == 1):
    (labels_loci, scores_loci, diag_loci) = check_loci(col_array)
    df["is_raw_anomaly_loci"] = labels_loci
    anomaly_score = anomaly_score + scores_loci
    diagnostics["LOCI"] = diag_loci
  df["anomaly_score"] = anomaly_score
  return (df, tests_run, diagnostics)
def check_loci(col_array):
  clf = LOCI()
  clf.fit(col_array)
  diagnostics = {
    "LOCI Threshold": clf.threshold_
  }
  return (clf.labels_, clf.decision_scores_, diagnostics)
Listing 11-5

Incorporating LOCI in the multivariate outlier detection process

The first change we see determines whether to run LOCI. As of the time of writing, the implementation of LOCI in PyOD is very inefficient and can struggle with larger datasets. For this reason, if our dataset is too large, we will not run LOCI and will depend on the more efficient implementation of COF. Assuming we do run LOCI, we call the check_loci() function. Note that we do not pass in values of α or k; we leave those at their defaults of 0.5 and 3, respectively. The LOCI algorithm also determines the value of r for us automatically, so we do not need to pass anything in for that parameter. What we get back from the check function includes an array of scores, one outlier score per data point. The result of the median(scores_cof) function call is also an array of scores, containing the median data score per data point across all of our COF tests. We simply add these two arrays together, giving us a combined score with no weighting factor. The intuition here is that our minimum threshold for COF is 1.35 and our threshold for LOCI is 3. We expect both values to be fairly close to one another for inliers and near-outliers, so by adding them together, we let a very strong decision by one model out-vote a tepid response from the other. For example, suppose LOCI returns a score of 2.8, which is close but not quite an outlier. If COF returns a score of 20, well above the 1.35 minimum threshold, the strength of COF’s conviction outweighs LOCI’s lukewarm judgement.

We can see this play out in Listing 11-6, the updated function to determine outliers. This function previously took three inputs: a DataFrame, our sensitivity score, and the maximum fraction of anomalies we would allow. Now we add two more parameters to the mix: the set of tests run and the sensitivity factors for each test, which are 1.35 for COF and 3 for LOCI.
def determine_outliers(
  df,
  tests_run,
  sensitivity_factors,
  sensitivity_score,
  max_fraction_anomalies
):
  tested_sensitivity_factors = {sf: sensitivity_factors.get(sf, 0) * tests_run.get(sf, 0) for sf in set(sensitivity_factors).union(tests_run)}
  sensitivity_threshold = sum([tested_sensitivity_factors[w] for w in tested_sensitivity_factors])
  diagnostics = { "Sensitivity threshold": sensitivity_threshold }
  second_largest = df['anomaly_score'].nlargest(2).iloc[1]
  sensitivity_score = (100 - sensitivity_score) * second_largest / 100.0
  diagnostics["Raw sensitivity score"] = sensitivity_score
  max_fraction_anomaly_score = np.quantile(df['anomaly_score'], 1.0 - max_fraction_anomalies)
  diagnostics["Max fraction anomaly score"] = max_fraction_anomaly_score
  if max_fraction_anomaly_score > sensitivity_score and max_fraction_anomalies < 1.0:
    sensitivity_score = max_fraction_anomaly_score
    diagnostics["Sensitivity score"] = sensitivity_score
  return (df.assign(is_anomaly=df['anomaly_score'] > np.max([sensitivity_score, sensitivity_threshold])), diagnostics)
Listing 11-6

The latest form of our function to determine which data points are outliers

We begin the function by calculating the sensitivity threshold, which will be 1.35 if we only ran COF and 4.35 if we ran both COF and LOCI. After that, we calculate the sensitivity score based on the second-largest data point, just as we did in Chapter 10. Then, we determine whether we need to use the sensitivity score or max fraction of anomalies to act as our cutoff, just as we did in Chapter 10. Finally, the return statement creates a new anomaly_score column based on whether a data point clears the larger of our sensitivity score and our sensitivity threshold. One final difference in this function is that we return diagnostic data relating to score calculations, allowing a user to diagnose issues.

Test and Website Updates

Whenever we introduce a new algorithm into an ensemble or change the way we calculate things, it makes sense to review the tests and ensure that behaviors change in a way we expect. In this case, we have some tests that expect to return zero outliers, some that return one, and a broader sample dataset with no fixed number of outliers. In this section, we will review those test results and see what fails. Then, we will move to website updates. Note that there is no section on integration tests because those will continue to pass as expected.

Unit Test Updates

There are two changes of note with respect to unit tests. The first is that by running multiple iterations of COF and incorporating LOCI, the number of outliers becomes rather stable: shifts in the sensitivity score and max fraction of anomalies have very little impact until you get to the extremes. On the whole, this is an interesting outcome for us as it means end users should not need to worry about picking the “right” sensitivity score or max fraction of anomalies. The key downside, however, is the loss of control for more advanced users. In the sample input provided with the multivariate test cases, we see five outliers with a sensitivity score of 5. Trying again with a sensitivity score of 25, the number of outliers jumps to eight, and it stays at eight the rest of the way to a sensitivity score of 100. If users anticipate smooth (or at least regular) increases in the number of outliers based on the sensitivity score, they might be confused by this result.

The other change of note comes from the fact that LOCI is more sensitive than COF, at least on our test datasets. For example, the sample inputs with no outliers and one outlier both contain a record whose first input is 87.9 and whose third input is 6180.90844, both of which represent “extreme” values. 87.9 happens to be the largest value we see in the first input and 6180.90844 the smallest value we see in the third, but both are well within 1% of their respective medians. LOCI is capable of spotting this minute difference and flags the data point as an unexpected outlier. To ensure that the tests return the appropriate number of outliers, changing the third value to its median value of 6185.90844 was sufficient to resolve the issue. This level of sensitivity is something to watch out for when using algorithms like COF and LOCI, especially given the first insight around the relative lack of user-controllable sensitivity adjustments.

Website Updates

The website changes in this section are fairly minor and relate to additional debug information we now collect. Listing 11-7 shows the relevant code block.
if debug:
  col11, col12 = st.columns(2)
  with col11:
    st.header("Tests Run")
    st.write(res['debug_details']['Tests run'])
    st.write(res['debug_details']['Test diagnostics'])
  with col12:
    st.header("Outlier Determinants")
    st.write(res['debug_details']['Outlier determination'])
  st.header("Full Debug Details")
  st.json(res['debug_details'])
Listing 11-7

Writing out details if debug mode is enabled

The first column of our debug menu shows the set of tests run as well as diagnostic information. This diagnostic information includes the number of records in the dataset; the number of neighbors, contamination score, and threshold value for each run of COF; and the LOCI threshold. The second column shows outlier determinants, including our sensitivity threshold and the work we do to calculate sensitivity score. Figure 11-3 shows an example of this using the pre-created sample dataset. Below these two columns, we can see the full debug details as well.

An algorithm for pre-created sample dataset. It has 2 columns labeled tests run and outlier determinants.

Figure 11-3

The tests we have run and determinants for our outlier cutoff point

Conclusion

This chapter focused on extending the multivariate outlier detection model in two ways: first, by taking advantage of the combination module in PyOD to run the Connectivity-Based Outlier Factor with multiple nearest neighbor counts, and second, by incorporating another algorithm called Local Correlation Integral. The next chapter will extend this model further by incorporating an algorithm entirely different from COF and LOCI.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.102.189