© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
K. FeaselFinding Ghosts in Your Datahttps://doi.org/10.1007/978-1-4842-8870-2_10

10. Connectivity-Based Outlier Factor (COF)

Kevin Feasel1  
(1)
DURHAM, NC, USA
 

The prior chapter provided us with an introduction to clustering techniques and the use of one such clustering technique, Gaussian Mixture Models. Over the course of this chapter, we will implement a separate technique for multivariate clustering: Connectivity-Based Outlier Factor.

Distance or Density?

There are two key approaches to outlier detection: distance-based approaches and density-based approaches. For univariate anomaly detection, we generally focused on distance-based approaches. As Mehrotra explains, “A large distance from the center of the distribution implies that the probability of observing such a data point is very small. Since there is only a low probability of observing such a point (drawn from that distribution), a data point at a large distance from the center is considered to be an anomaly” (Mehrotra, 97). We started off by implicitly assuming that our univariate datasets fit one cluster, a cluster whose shape might approximate a normal distribution. By Chapter 9, we relaxed that assumption and allowed for multiple clusters using Gaussian Mixture Models. In doing so, we also introduced a density-based approach to detecting outliers.

Figure 10-1 provides an example of the importance of density-based approaches for outlier detection. If we use distance as the sole measure of outlier-ness, then the marked data point in the left-hand cluster will have a higher anomaly score than the marked data point in the right-hand cluster. Our gestalt sensibilities, however, disagree with such an assessment: we can see that although vector A is in fact longer than vector B, the data point associated with vector A fits in a regular pattern with other data points. Each data point is roughly equidistant from at least two neighbors. By contrast, the data point associated with vector B is further away from a neighbor than any of its compatriots. In other words, for density-based approaches, we care about the distance between neighbors more than the distance from the centroid.

A set of two illustrations of point A and point B comprises 7 small circles. In point, A circles are spread and in point B circles are close.

Figure 10-1

Vector A is longer than vector B. Therefore, a pure distance-based approach may lead us to believe that point A is more of an outlier than point B

With density-based approaches, we think in terms of neighborhoods, which we can define as a given number of nearest neighbors. We saw this in practice in Chapter 9 with the k-nearest neighbors (knn) algorithm. What differentiates knn from the density-based outlier detection algorithms we will use in subsequent chapters is that knn stops after defining the neighborhood. With density-based outlier detection algorithms, we go one step further: after determining what the neighborhood is for a given point, we then compare the density of that data point vs. its neighbors. If the data point’s density is considerably smaller than that of its nearest neighbors, then the data point is likely to be an anomaly.

Figure 10-2 shows an example of this concept. In this case, we show an example of points along a single dimension. Note, however, that the multidimensional version of this problem is conceptually the same, except that we need to use some distance measurement like Manhattan distance or Mahalanobis distance to calculate the distance between points. These measurement techniques are outside the scope of this book but are of some relevance when digging into algorithms.

An illustration of the distance of two data points to their two nearest neighbours.

Figure 10-2

In the top image, we show the distance from our data point (in black) to its two nearest neighbors. In the bottom image, we show the distance from each of those neighbors to their two nearest neighbors

Reviewing the results in Figure 10-2, we can see that the average distance between points is considerably higher for our data point vs. its neighbors. We define the local density of a point as the reciprocal of that average distance between points. Therefore, our data point is in a lower-density area than its neighbors, and this is an indicator of a data point being an outlier.

One important rule that has to hold to make this outcome possible is that the set of neighbors is not reflexive. In other words, if we look at the two nearest neighbors to the black point, neither of them has the black point as one of its two nearest neighbors. As a rule of thumb, the fewer the number of neighbors a data point has that call it a nearest neighbor, the more likely that data point is to be an outlier.

Before we can get to the Connectivity-Based Outlier Factor approach to clustering, we need to understand another algorithm first.

Local Outlier Factor

The Local Outlier Factor (LOF) algorithm is a density-based approach to finding outliers in a dataset. We measure how the local density around a data point compares to that of its k nearest neighbors and report a data point as an outlier if its LOF is sufficiently large compared to others in the same cluster. Suppose we have some data point x and a neighborhood size of k=5. We can then find the distance between x and its 5th neighbor, as we see in Figure 10-3. We can then visualize this as a circle around x, intersecting the kth point from x.

An illustration of a local outlier factor comprises d sub reach x, a and d sub reach x, b.

Figure 10-3

Calculating reachability distance for data point x with k=5 with respect to two other data points a and b. Note that the neighborhood size includes data point x

We then calculate the distance from x to each point, taking the larger of the actual distance or the distance to the kth point as dreach. For example, point a lies within the boundaries of our circle, and so we use the distance to the kth point to define the reachability distance from x to a. Point b, meanwhile, lies beyond our kth point, and so we use the actual distance from x to b as the reachability distance from x to b. After performing this operation for each pairing of data points in our extended neighborhood, we can calculate the average reachability distance for x by summing up all of the reachability distances from x to various points and dividing by the number of points in the cluster minus one (i.e., the denominator will not include data point x itself).

The average reachability distance gives us an idea of how far we need to travel in order to find similar data points. If we take the reciprocal of this, we now calculate the reachability density of a point: How packed-together are other data points near our point x?

Finally, we define the Local Outlier Factor as a ratio of the local reachability densities of all other data points vs. the local reachability density of point x. A relatively high value for this measure means that there tends to be a larger distance between data point x and its neighbors compared to other data points in the cluster. This result comports with one an intuitive understanding of outliers: data points that are far removed from others are more likely to be outliers than ones packed closely together. Once we have the LOF per data point, we can use that score to determine whether a particular data point is an outlier. One way we could do this is to take the top few percent of values and declare those data points as outliers. Alternatively, we could compare the largest scores and use techniques like MAD to see if there is a significant enough spread from the median to declare any given point a likely outlier.

Connectivity-Based Outlier Factor

The Local Outlier Factor algorithm can be quite useful for finding outliers, but it does have some practical limitations. The biggest limitation is that outliers do not always need to be in lower-density areas. For example, Tang et al. devise the following cluster that we see in Figure 10-4.

An illustration of a local outlier factor highlights a mass of points at the top and the second cluster of points with some separation between them.

Figure 10-4

Any value of k that allows LOF to declare the highlighted point as an outlier will also declare all points in the circle as outliers

Here, we can see the outlier, which I have marked as a black dot for the sake of clarity. Note further that there is really only one outlier in this figure: we have a mass of points at the top of the image and a second cluster of points with some separation between them. Using the LOF algorithm, however, we run into a problem: in order to detect the marked point as an outlier, we would also classify all of the points in the circle as outliers because of the difference in distances between elements in the two clusters.

Connectivity-Based Outlier Factor (COF) is another density-based approach to clustering, one that attempts to resolve these sorts of problems with LOF. It does so by adding to density an idea called isolativity: the degree to which an object is connected to other objects. The points making up the circle are connected to one another, just as the points in the line at the top of the figure are connected to one another. The marked point, however, does not fit either of the established patterns—it is isolated from those patterns and is therefore more likely to be an outlier than a point that is connected in a pattern.

Following is a high-level summary of how COF works. For a given data point and the number of neighbors we care about, we will build a chain, starting with our given data point and iteratively including the nearest unconnected neighbor until we include a sufficient number of data points as neighbors. Figure 10-5 shows an example of this, starting with the marked point. We define the link to its nearest neighbor as 1. Then, we find the nearest point unconnected to our chain, labeling that as 2. In the example of Figure 10-5, we continue until we have six connections between seven points.

An illustration of a connectivity-based outlier factor highlights the market point with six connections between seven points.

Figure 10-5

The six connections for the marked data point. Note that edges are enumerated sequentially in order of the shortest path from the already-connected chain to the next free point

Once we have these connections, we perform a weighted calculation, emphasizing the “earlier” (numerically lower) connections more than the “later” (numerically higher). This provides us the Connectivity-Based Outlier Factor (COF) value. Data points with higher COF values are more likely to be outliers than data points with lower COF values. This allows us to sort data points by COF value in descending order and find the biggest outliers, something that is particularly handy if the base algorithm returns more outliers than our maximum fraction of anomalies parameter allows.

The math for calculating COF is available in Tang et al., but because we are going to use the Python Outlier Detection (PyOD) library to calculate COF, we will not include it here. Instead, we will move on to implementation.

Introducing Multivariate Support

The code for multivariate anomaly detection will necessarily look somewhat different from univariate anomaly detection—after all, the data structure for multivariate anomaly detection includes a key and list of values vs. a key and a single value for univariate anomaly detection. Even so, there will be considerable overlap in the groundwork code.

Laying the Groundwork

Listing 10-1 shows the API endpoint definition for multivariate anomaly detection. In it, our detection method includes four parameters relevant to detection. Two of them, sensitivity score and max fraction of anomalies, match the univariate example. A third, input data, is functionally similar to the univariate scenario, although instead of having a single value val, we have a list of values vals. Each element in the list represents a different feature we care about. For example, if we wish to perform outlier detection on a set of prospective professional athletes, we might include attributes such as height, weight, straight-line running speed, agility tests, and so on. Each of these would be a separate value in our list, and we would compare prospective athletes across the entire set of measures, not just one. The fourth parameter is new: n_neighbors, which represents the desired number of neighbors. This parameter can make a considerable difference in whether we label something as an outlier, but there is limited general advice on selecting the best number. Typically, the advice is to perform a search along a range from 10 (assuming you have at least 30 data points) to as high as 100 or 125 (assuming a sufficiently large number of data points). For the sake of development, we will leave this as an option for the caller for now, though we will come back to this decision in the next chapter.
class Multivariate_Input(BaseModel):
  key: str
  vals: list = []
@app.post("/detect/multivariate")
def post_multivariate(
  input_data: List[Multivariate_Input],
  sensitivity_score: float = 50,
  max_fraction_anomalies: float = 1.0,
  n_neighbors: int = 10,
  debug: bool = False
):
  df = pd.DataFrame(i.__dict__ for i in input_data)
  (df, weights, details) = multivariate.detect_multivariate_statistical(df, sensitivity_score, max_fraction_anomalies, n_neighbors)
  results = { "anomalies": json.loads(df.to_json(orient='records')) }
  if (debug):
    results.update({ "debug_weights": weights })
    results.update({ "debug_details": details })
  return results
Listing 10-1

Offering multivariate outlier detection to outside callers

Based on this definition, we know that we will need a detect_multivariate_statistical() function in our code base. We will also want helper functions like run_tests() and determine_outliers(), similar to our univariate example.

Listing 10-2 shows the implementation of detect_multivariate_statistical(). Just as in the univariate scenario, we will weigh algorithms differently and therefore need a dictionary of weights. We determine the number of data points in the dataset and then perform some basic checks (removed from the following listing but available in the code accompanying this book at https://github.com/Apress/finding-ghosts-in-your-data), ensuring that we have a sufficient number of data points to perform multivariate outlier detection and that the inputs are reasonable. In addition, because our implementation of COF requires that no more than 50% of data points be anomalies, we silently cap max fraction of anomalies at 0.5 to ensure this.
def detect_multivariate_statistical(
  df,
  sensitivity_score,
  max_fraction_anomalies,
  n_neighbors
):
  weights = {"cof": 1.0}
  num_data_points = df['vals'].count()
  if max_fraction_anomalies > 0.5:
      max_fraction_anomalies = 0.5
  (df_encoded, diagnostics) = encode_string_data(df)
  (df_tested, tests_run, diagnostics) = run_tests(df_encoded, max_fraction_anomalies, n_neighbors)
  df_out = determine_outliers(df_tested, sensitivity_score, max_fraction_anomalies)
  return (df_out, weights, { … })
Listing 10-2

A function to detect outliers in multivariate data. Certain diagnostic functionality has been removed from this listing for the sake of brevity, but it is available in the accompanying repository.

Once we have determined that the dataset and input parameters are all viable, we need to encode any string data that comes in. Outlier detection algorithms deal almost exclusively in numeric features, but our datasets often have strings that represent values. We could have the caller make this conversion, but for the sake of convenience and to support a wider variety of datasets, we will introduce an encode_string_data() function that translates any string column to an ordinal. Listing 10-3 shows this function. The first step of the function is to break out the list in vals and turn it into a set of columns. We don’t know exactly what the columns represent, but the lack of meaningful names does not matter for our purposes. The next step is to determine whether we have any string columns using the select_dtypes() function in Pandas. If there are string columns, we create an ordinal encoder and translate each unique string into a corresponding unique number. We then merge together the two DataFrames, which we can do because they will contain the same number of rows and will be in the same order. This will ensure that we have the original data as well as numeric-only translations.
def encode_string_data(df):
  df2 = pd.DataFrame([pd.Series(x) for x in df.vals])
  string_cols = df2.select_dtypes(include=[object]).columns.values
  diagnostics = { "Number of string columns in input": len(string_cols) }
  if (len(string_cols) > 0):
    diagnostics["Encoding Operation"] = "Encoding performed on string columns."
    enc = OrdinalEncoder()
    enc.fit(df2[string_cols])
    df2[string_cols] = enc.transform(df2[string_cols])
  else:
    diagnostics["Encoding Operation"] = "No encoding necessary because all columns are numeric."
  return (pd.concat([df, df2], axis=1), diagnostics)
Listing 10-3

Encoding string data

Even though we do include an encoding function, it is important to note that the way we are encoding is easy but not necessarily a good practice. The reason is that our encoder is ordinal and does not include any measurement of string “nearness,” either in terms of how many letters apart two words are or how similar (or dissimilar) the meanings of two words are. As an example, the word “cat” might have an ordinal value of 1 but “cats” a value of 900. Our transformation is simple and likely will not work well on text-heavy datasets. The intent here is to provide an easy method for working with a mostly numeric dataset that happens to have one or two categorical string features, rather than analyzing the textual content of features and discerning meaning from them. It is possible to create a more sophisticated encoder that does this, but that is an exercise for another book on natural language processing.

Implementing COF

The actual implementation of COF is simple. Listing 10-4 includes the run_tests() function as well as the check_cof() function it calls.
def run_tests(df, max_fraction_anomalies, n_neighbors):
  tests_run = {
    "cof": 1
  }
  (df_cof, diagnostics) = check_cof(df, max_fraction_anomalies, n_neighbors)
  return (df_cof, tests_run, diagnostics)
def check_cof(df, max_fraction_anomalies, n_neighbors):
  col_array = df.drop(["key", "vals"], axis=1).to_numpy()
  clf = COF(n_neighbors=n_neighbors, contamination=max_fraction_anomalies)
  clf.fit(col_array)
  diagnostics = {
    "COF Contamination": clf.contamination,
    "COF Threshold": clf.threshold_
  }
  df["is_raw_anomaly"] = clf.labels_
  df["anomaly_score"] = clf.decision_scores_
  return (df, diagnostics)
Listing 10-4

Implementing the check for Connectivity-Based Outlier Factor

The run_tests() function is very similar to its univariate counterpart, saving the interesting code for check_cof(). This latter function accepts our DataFrame as well as the max fraction of anomalies and desired number of neighbors. It first creates an array out of all columns except key and vals, as the COF function requires inputs be shaped as an array rather than a DataFrame. After creating a classifier function using COF(), the function fits the classifier to our input array. We use the corresponding labels_ attribute to see if COF considers a particular data point as anomalous and decision_scores_ to gain an idea of how much of an outlier a given data point is. The contamination factor, which we capture as clf.contamination, is an optional input that defaults to 0.1 and represents the proportion of outliers in the dataset. This is similar to what we call the max fraction of anomalies in our engine, except that contamination factor requires that we mark a certain percentage of results as outliers, no matter how high or low the decision score is. If we set the contamination factor to 0.5, half of our data points will be marked as outliers, even if every data point is tightly clustered together and a human would not recognize any data points as anomalies. The threshold_ output provides the cutoff point for what COF labels as an outlier.

The most important numeric attribute we get from COF is the decision score. This numeric value is what we send to the determine_outliers() function. Listing 10-5 shows the code for this function. The code is somewhat similar to the univariate scenario, but it also accounts for COF’s determination of what makes an outlier.
def determine_outliers(
  df,
  sensitivity_score,
  max_fraction_anomalies
):
  second_largest = df['anomaly_score'].nlargest(2).iloc[1]
  sensitivity_score = (100 - sensitivity_score) * second_largest / 100.0
  max_fraction_anomaly_score = np.quantile(df['anomaly_score'], 1.0 - max_fraction_anomalies)
  if max_fraction_anomaly_score > sensitivity_score and max_fraction_anomalies < 1.0:
    sensitivity_score = max_fraction_anomaly_score
  return df.assign(is_anomaly=df['anomaly_score'] > np.max([sensitivity_score, 1.35]))
Listing 10-5

The outlier determination function for multivariate outlier detection

We start by calculating the sensitivity score. Because our anomaly score range may be quite different from 0 to 1, we adjust the sensitivity_score range to run from 0 to the second-largest anomaly score, dealing with cases in which one extreme outlier dominates everything else. After doing that, the calculations are the same as the univariate scenario. An important note here is that, unlike the univariate scenario, we could always get back a number of anomalies equal to max_fraction_anomalies * number of data points when sensitivity score is set to 100. COF does not have any in-built rules around whether a particular point is an outlier or not, making our selection of max fraction of anomalies and sensitivity score even more important than in the univariate scenario. That said, values closer to 1.0 are less likely to be outliers, and choosing some minimum threshold will reduce the number of spurious outliers. 1.35 is an arbitrary choice but one that is intended to prevent an outcome in which a dataset with no actual anomalies triggers a large number of reported outliers.

Test and Website Updates

This new technique brings with it a new set of tests, as well as some challenges we will need to face regarding visualization of multivariate data. Let’s start first with unit test changes, move on to integration tests, and end with website updates.

Unit Test Updates

The file test est_multivariate.py contains the unit tests we will use to confirm functionality in our multivariate outlier detection scenario. Listing 10-6 covers the first test, which ensures that we encode strings as ordinals properly. To make it easier to read, only one test case appears in the listing; the remaining test cases are available in the test file.
@pytest.mark.parametrize("df_input, requires_encoding, number_of_string_columns", [
  ([["s1a", [1, "Bob", 2, "Janice", 3]], ["s2", [4, "Jim", 5, "Alice", 6]], ["s3", [7, "Teddy", 8, "Mercedes", 9]]], True, 2),
)
def test_detect_multivariate_encoding_string_columns(df_input, requires_encoding, number_of_string_columns):
  # Arrange
  df = pd.DataFrame(df_input, columns=["key", "vals"])
  # Act
  (df_encoded, diagnostics) = encode_string_data(df)
  encoding_performed = (diagnostics["Encoding Operation"] == "Encoding performed on string columns.")
  num_string_columns = diagnostics["Number of string columns in input"]
  # Assert
  assert(requires_encoding == encoding_performed)
  assert(number_of_string_columns == num_string_columns)
Listing 10-6

Encode strings as ordinal values

Reviewing the aforementioned, there are two string columns per array. The encoding function correctly sees this and dutifully encodes the strings in the second and fourth columns of the array.

Aside from this, there are several other functions intended to test COF’s outputs and how sensitivity score and max fraction of anomalies combine to affect outcomes. There are three datasets that we will use in the tests. The first is sample_input, which contains an artificially generated set of data spanning four columns. It includes a fairly wide range of values, making it difficult to determine a priori whether there are any outliers. The second dataset is sample_input_one_outlier, which includes dozens of extremely similar rows, differing by no more than ~1%, as well as one radically different row. Finally, sample_input_no_outliers is the prior dataset minus the one outlier.

The minimum anomaly score check in determine_outliers() is critical for scenarios like sample_input_no_outliers. If we have no minimum threshold and set maximum fraction of anomalies to 1.0, we would need to drop the sensitivity score down to 8 in order to have fewer than ten outliers returned. With no cutoff and very close data points, instead of having no outliers, everything becomes an outlier.

Integration Test Updates

Along with new unit tests, we also need new integration tests to handle multivariate data. These tests are in the Chapter 10 Postman collection and handle fairly basic scenarios ranging from zero to three outliers in a small, 15-record sample. The most interesting of these tests is Ch10 – Multivariate – Debug – Three Outliers, 2S1L – Two Recorded. This test includes three outliers at three different orders of magnitude. The first outlier has an anomaly score over 100, the second has a score of 34, and the third has a score of 5.8. All three are legitimate outliers, but unless we set the sensitivity score to 83 or higher, we only capture the two larger outliers because the largest outlier “sets the curve” and makes it harder for us to find smaller outliers. This points at a flaw in our solution, one we aim to rectify over the next two chapters by introducing additional algorithms and forming a new type of ensemble.

Website Updates

The final section of this chapter will cover updates to our Streamlit dashboard. We first review the process() function in Listing 10-7. This listing includes all of the parameters we will need to send in for univariate or multivariate analysis. Note that we do not include the number of neighbors in this function, as this is an implementation detail that we do not want to force users to handle.
@st.cache
def process(server_url, method, sensitivity_score, max_fraction_anomalies, debug, input_data_set):
  full_server_url = f"{server_url}/{method}?sensitivity_score={sensitivity_score}&max_fraction_anomalies={max_fraction_anomalies}&debug={debug}"
  r = requests.post(
    full_server_url,
    data=input_data_set,
    headers={"Content-Type": "application/json"}
  )
  return r
Listing 10-7

The definition for the process() function

The major change comes when selecting the Detect! button. We now need to break out the univariate and multivariate scenarios. Listing 10-8 shows this breakout, eliding over univariate code in the process.
if method=="univariate" and convert_to_json:
      input_data = convert_univariate_list_to_json(input_data)
    resp = process(server_url, method, sensitivity_score, max_fraction_anomalies, debug, input_data)
    res = json.loads(resp.content)
    df = pd.DataFrame(res['anomalies'])
    if method=="univariate":
      # ...
    elif method=="multivariate":
      st.header('Anomaly score per data point')
      colors = {True: '#481567', False: '#3CBB75'}
      df = df.sort_values(by=['anomaly_score'], ascending=False)
      g = px.bar(df, x=df["key"], y=df["anomaly_score"], color=df["is_anomaly"], color_discrete_map=colors,
            hover_data=["vals"], log_y=True)
      st.plotly_chart(g, use_container_width=True)
      tbl = df[['key', 'vals', 'anomaly_score', 'is_anomaly']]
      st.write(tbl)
      if debug:
        col11, col12 = st.columns(2)
        with col11:
          st.header('Debug weights')
          st.write(res['debug_weights'])
        with col12:
          st.header("Tests Run")
          st.write(res['debug_details']['Tests run'])
          st.write(res['debug_details']['Test diagnostics'])
        st.header("Full Debug Details")
        st.json(res['debug_details'])
Listing 10-8

Creating a visual to expose multivariate outlier results

With univariate data, we were able to create a scatter plot, using the one variable as the X axis and the anomaly score as the Y axis. With multivariate data, we are not able to plot the data directly. There are two reasons for this: first, we do not know how many dimensions there will be, making it difficult to discern what kinds of plots might even be useful here. Second, humans have a lot of difficulty making sense of three-dimensional plots, so even if we use a technique like Principal Component Analysis (PCA) to narrow our results down to two dimensions plus an anomaly score (assuming we are able successfully to get it down to two), displaying both values plus anomaly score in a way that humans can easily interpret becomes a challenge. Instead, we display each data point as a bar in a column chart, with the height of the bar representing the anomaly score. Because there can be a wide range in score values, the anomaly score is measured on a logarithmic axis. This allows us to see the differences in those smaller scores while still acknowledging the anomalous values. Figure 10-6 shows an example of this.

A bar graph plots anomaly score versus values. Values are estimated. True (10, 100 to the power 2), (6, 10 to the power 2.5). False (8, 10), and more.

Figure 10-6

The outcome of a multivariate anomaly detection test. This shows the key as the X axis and the anomaly score on a logarithmic scale as the Y axis

After that, we display a table of all keys and values. Finally, if the user wishes to see debug information, we can see debug weights, tests run, and test diagnostics below the table.

Conclusion

In this chapter, we looked at one useful algorithm for multivariate outlier detection: Connectivity-Based Outlier Factor (COF). This is a good first algorithm to look at because it tends to be quite effective. As we saw in the test cases, though, understanding the appropriate cutoffs can be a challenge. In the next chapter, we will introduce another, complementary technique to COF and see how to combine the two together in an ensemble to reduce some of the challenge of picking appropriate inputs and cutoff values.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.64.66