Unsupervised learning using outlier detection

The subject of finding outliers or anomalies in the data streams is one of the emerging fields in machine learning. This area has not been explored by researchers as much as classification and clustering-based problems have. However, there have been some very interesting ideas extending the concepts of clustering to find outliers from data streams. We will provide some of the research that has been proved to be very effective in stream outlier detection.

Partition-based clustering for outlier detection

The central idea here is to use an online partition-based clustering algorithm and based on either cluster size ranking or inter-cluster distance ranking, label the clusters as outliers.

Here we present one such algorithm proposed by Koupaie et al., using incremental k-Means.

Inputs and outputs

Only numeric features are used, as in most k-Means algorithms. The number of clusters k and the number of windows of outliers n, on which offline clustering happens, are input parameters. The output is constant outliers (local and global) and an updatable model that detects these.

How does it work?

  1. This algorithm works by having the k-Means algorithm in two modes, an offline mode and an online mode, both working in parallel.
  2. For the online mode:
    1. Apply k-Means on the given window w and find clusters and partitions of the data with clusters.
    2. Rank the clusters based on the cluster distances and cluster size. The clusters which are farthest apart and small in size are considered outliers.
    3. Store the outliers in memory for the window as a set Ow = {x1, x2..xn} and regard them as local outliers.
    4. The window is cleared and the process is repeated.
  3. For the offline mode:
    1. Get outliers from n, previous windows, and create a set: How does it work?
    2. Cluster this window with set S using k-Means and find clusters which are farthest away and small in size.
    3. These clusters are global outliers.
    4. The window is cleared and the process is repeated.

Advantages and limitations

The advantages and limitations are as follows:

  • It is very sensitive to the two parameters k and n and can generate lots of noise.
  • Only spherical clusters/outliers are found and outliers with different shapes get missed.

Distance-based clustering for outlier detection

Distance-based outlier detection is the most studied, researched, and implemented method in the area of stream learning. There are many variants of the distance-based methods, based on sliding windows, the number of nearest neighbors, radius and thresholds, and other measures for considering outliers in the data. We will try to give a sampling of the most important algorithms in this section.

Inputs and outputs

Most algorithms take the following parameters as inputs:

  • Window size w, corresponding to the fixed size on which the algorithm looks for outlier patterns
  • Sliding size s, corresponds to the number of new instances that will be added to the window, and old ones removed
  • The count threshold k of instances when using nearest neighbor computation
  • The distance threshold R used to define the outlier threshold in distances

Outliers as labels or scores (based on neighbors and distance) are outputs.

How does it work?

We present different variants of distance-based stream outlier algorithms, giving insights into what they do differently or uniquely. The unique elements in each algorithm define what happens when the slide expires, how a new slide is processed, and how outliers are reported.

Exact Storm

Exact Storm stores the data in the current window w in a well-known index structure, so that the range query search or query to find neighbors within the distance R for a given point is done efficiently. It also stores k preceding and succeeding neighbors of all data points:

  • Expired Slide: Instances in expired slides are removed from the index structure that affects range queries but are preserved in the preceding list of neighbors.
  • New Slide: For each data point in the new slide, range query R is executed, results are used to update the preceding and succeeding list for the instance, and the instance is stored in the index structure.
  • Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, any instance with at least k elements from the succeeding list and non-expired preceding list is reported as an outlier.

Abstract-C

Abstract-C keeps the index structure similar to Exact Storm but instead of preceding and succeeding lists for every object it just maintains a list of counts of neighbors for the windows the instance is participating in:

  • Expired Slide: Instances in expired slides are removed from the index structure that affects range queries and the first element from the list of counts is removed corresponding to the last window.
  • New Slide: For each data point in the new slide, range query R is executed and results are used to update the list count. For existing instances, the count gets updated with new neighbors and instances are added to the index structure.
  • Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, all instances with a neighbors count less than k in the current window are considered outliers.

Direct Update of Events (DUE)

DUE keeps the index structure for efficient range queries exactly like the other algorithms but has a different assumption, that when an expired slide occurs, not every instance is affected in the same way. It maintains two priority queues: the unsafe inlier queue and the outlier list. The unsafe inlier queue has sorted instances based on the increasing order of smallest expiration time of their preceding neighbors. The outlier list has all the outliers in the current window:

  • Expired Slide: Instances in expired slides are removed from the index structure that affects range queries and the unsafe inlier queue is updated for expired neighbors. Those unsafe inliers which become outliers are removed from the priority queue and moved to the outlier list.
  • New Slide: For each data point in the new slide, range query R is executed, results are used to update the succeeding neighbors of the point, and only the most recent preceding points are updated for the instance. Based on the updates, the point is added to the unsafe inlier priority queue or removed from the queue and added to the outlier list.
  • Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, all instances in the outlier list are reported as outliers.

Micro Clustering based Algorithm (MCOD)

Micro-clustering based outlier detection overcomes the computational issues of performing range queries for every data point. The micro-cluster data structure is used instead of range queries in these algorithms. A micro-cluster is centered around an instance and has a radius of R. All the points belonging to the micro-clusters become inliers. The points that are outside can be outliers or inliers and stored in a separate list. It also has a data structure similar to DUE to keep a priority queue of unsafe inliers:

  • Expired Slide: Instances in expired slides are removed from both micro-clusters and the data structure with outliers and inliers. The unsafe inlier queue is updated for expired neighbors as in the DUE algorithm. Micro-clusters are also updated for non-expired data points.
  • New Slide: For each data point in the new slide, the instance either becomes a center of a micro-cluster, or part of a micro-cluster or added to the event queue and the data structure of the outliers. If the point is within the distance R, it gets assigned to an existing micro-cluster; otherwise, if there are k points within R, it becomes the center of the new micro cluster; if not, it goes into the two structures of the event queue and possible outliers.
  • Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, any instance in the outlier structure with less than k neighboring instances is reported as an outlier.

Approx Storm

Approx Storm, as the name suggests, is an approximation of Exact Storm. The two approximations are:

  • Reducing the number of data points in the window by adding a factor ρ and changing the window to ρW.
  • Storing the number instead of the data structure of preceding neighbors by using the fraction of the number of neighbors which are safe inliers in the preceding list to the number in the current window.

The processing of expired and new slides and how outliers are determined based on these steps follows:

  1. Expired Slide: Same as Exact Storm—instances in expired slides are removed from the index structure that affects range queries but preserved in the preceding list of neighbors.
  2. New Slide: For each data point in the new slide, range query R is executed, results are used to compute the fraction discussed previously, and the index structure is updated. The number of safe inliers are constrained to ρW by removing random inliers if the size exceeds that value. The assumption is that most of the points in safe inliers are safe.
  3. Outlier Reporting: In any window, after the processing of expired and new slide elements has been completed, when an approximation (see References [17]) of the number of neighbors of an instance based on the fraction, window size, and preceding list is a value less than k, it is considered as an outlier.
Advantages and limitations

The advantages and limitations are as follows:

  • Exact Storm is demanding in storage and CPU for storing lists and retrieving neighbors. Also, it introduces delays; even though they are implemented in efficient data structures, range queries can be slow.
  • Abstract-C has a small advantage over Exact Storm, as no time is spent on finding active neighbors for each instance in the window. The storage and time spent is still very much dependent on the window and slide chosen.
  • DUE has some advantage over Exact Storm and Abstract-C as it can efficiently re-evaluate the "inlierness" of points (that is, whether unsafe inliers remain inliers or become outliers) but sorting the structure impacts both CPU and memory.
  • MCOD has distinct advantages in memory and CPU owing to the use of the micro-cluster structure and removing the pair-wise distance computation. Storing the neighborhood information in micro-clusters helps memory too.
  • Approx Storm has an advantage of time over the others as it doesn't process the expired data points over the previous window.

Validation and evaluation techniques

Validation and evaluation of stream-based outliers is still an open research area. In many research comparisons, we see various metrics being used, such as:

  • Time to evaluate in terms of CPU times per object
  • Number of outliers detected in the streams
  • Number of outliers that correlate to existing labels, TP/Precision/Recall/Area under PRC curve, and so on

By varying parameters such as window-size, neighbors within radius, and so on, we determine the sensitivity to the performance metrics mentioned previously and determine the robustness.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.234.150