Concept drift and drift detection

As discussed in the introduction of the chapter, the dynamic nature of infinite streams stands in direct opposition to the basic principles of stationary learning; that is, that the distribution of the data or patterns remain constant. Although there can be changes that are swift or abrupt, the discussion here is around slow, gradual changes. These slow, gradual changes are fairly hard to detect and separating the changes from the noise becomes tougher still:

Concept drift and drift detection

Figure 1 Concept drift illustrated by the gradual change in color from yellow to blue in the bottom panel. Sampled data reflects underlying change in data distribution, which must be detected and a new model learned.

There have been several techniques described in various studies in the last two decades that can be categorized as shown in the following figure:

Concept drift and drift detection

Figure 2 Categories of drift detection techniques

Data management

The main idea is to manage a model in memory that is consistent with the dynamic nature of the data.

Partial memory

These techniques use the most recently used data in a memory buffer to learn or derive summary information. The key question as discussed previously is: what is the right window size to be effective in detecting the change and learning effectively? In Fixed Window Size based techniques, we use the idea of a queue where a new instance with a recent timestamp comes in and the one with the oldest is evicted. The window thus contains all the recent enough examples and the size is generally chosen based on physical availability of memory and size of data elements in the queue. In Adaptive Window Size, the queue is used in conjunction with a detection algorithm. When the detection algorithm indicates signs of drifts based on performance evaluation, the window size can be reduced to effectively remove old examples which no longer help the model.

Full memory

The idea is to store sufficient statistics over all the examples or data seen. One way to do this is to put weights on the data and weights decay over time. Exponential weighting using the rate factor given by λ can be very effective:

wλ (x) = exp(– λ * i)

Detection methods

Given the probability P(X) that the given data is observed, the probability of patterns/class P(C), and the probability of data given class P(X|C)—which is the model—the detection method can be divided into two categories, at a high level:

  • Monitoring evolution or performance of the model, classifier, or P(C|X)
  • Monitoring distributions from the environment or observing P(X), P(C), and P(X|C)

Monitoring model evolution

Although this method is based on the assumption that all the learning of models is stationary and data is coming from independent, identical distributions (i.i.d.), which doesn't hold true in many applications, it has nevertheless been shown to be effective. Some of the well-known techniques are described next.

Widmer and Kubat

This is one of the earliest methods which observed the error rates or misclassification rates and the changes to the model such as tree structures due to new branches, for instance. Using these and known thresholds, the learning window size is increased or decreased.

Drift Detection Method or DDM

This method assumes that the parameter being observed, such as the classifier labeling things correctly or incorrectly, is a binary random variable which follows a binomial distribution. It assumes probability of misclassification at probability pi with standard deviation of Drift Detection Method or DDM where the values are computed at the ith point in the sequence. The method then uses two levels:

  • Warning level: When pi + sipmin + 2 * smin
  • Detection level: When pi + sipmin + 3 * smin

All the examples between the "warning" and "detection" levels are used to train a new classifier that will replace the "non-performing" classifier when the "detection" level is reached.

Early Drift Detection Method or EDDM

EDDM uses the same technique as DDM but with a slight modification. It uses classification rate (that is, recall) rather than error rate (1 – accuracy) and uses the distance between the number of right predictions and two wrong predictions to change the levels.

EDDM computes the mean distance between the two errors pi' and standard deviation between the two si'. The levels are then:

  • Warning level: (pi' + 2 * si') ⁄ (p'max + 2 * s'max) < α
  • Detection level: (pi' + 2 * si') ⁄ (p'max + 2 * s'max) < β

The parameters α and β are normally tuned by the user to something around 90% and 95% respectively.

Monitoring distribution changes

When there are no models or classifiers to detect changes, we apply techniques that use some form of statistical tests for monitoring distribution changes. These tests are used to identify the distribution changes. Owing to assumptions, whether parametric or non-parametric, and different biases, it is difficult to say concretely which works best. Here we provide some of the well-known statistical tests.

Welch's t test

This is an adaptation of the Student t test with two samples. The test is adapted to take two windows of size N1 and N2 with means Welch's t testand Welch's t test and variances Welch's t test and Welch's t test to compute the p value and use that to reject or accept the null hypothesis:

Welch's t test
Kolmogorov-Smirnov's test

This statistical test is normally used to compare distances between two distributions and validate if they are below certain thresholds or not. This can be adapted for change detection by using windows of two different sample sizes, N1 and N2, with different cumulative distribution functions, F1(x) and F2(x), KS distance:

Kolmogorov-Smirnov's test

The null hypothesis, which assumes the two distributions are similar, is rejected with confidence of α if and only if Kolmogorov-Smirnov's test, which is obtained by a lookup in the Kolmogorov-Smirnov's table.

CUSUM and Page-Hinckley test

The cumulative sum (CUSUM) is designed to indicate when the mean of the input is significantly different from zero:

g0 = 0 , gt = max(0, gt–1) + ϵtv)

We raise change detection when gt > h, where (h, v) are user-defined parameters. Note that the CUSUM test is memoryless and is one-sided or asymmetric, detecting only the increase.

The Page Hinckley test is similar to CUSUM with a small variation as shown here:

g0 = 0 , gt = gt–1 + ϵtv)

For increasing and decreasing the values, we use Gt = min(gt, Gt–1) or Gt = max(gt, Gt–1), and Gt – gt > h for change detection.

Adaptation methods

Explicit and implicit adaptation are the two well-known techniques for adapting to environment changes when a change is detected.

Explicit adaptation

In explicit adaptation, an additional technique from among the following is used:

  • Retrain the model from scratch with new data so the previous model or data does not impact the new model
  • Update the model with the changes or new data such that the transition is smooth—assumes changes are gradual and not drastic
  • Create a sequence or ensemble of models that are learned over time—when a collaborative approach is better than any single model

Implicit adaptation

In implicit adaptation, we generally use ensemble algorithms/models to adapt to the concept change. This can mean using different combinations ranging from a single classifier, to predicting in the ensemble, to using ADWIN for adaptive window-based with the classifier—all fall within the choices for implicit adaptation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.40.63