Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Concept drift and drift detection

As discussed in the introduction of the chapter, the dynamic nature of infinite streams stands in direct opposition to the basic principles of stationary learning; that is, that the distribution of the data or patterns remain constant. Although there can be changes that are swift or abrupt, the discussion here is around slow, gradual changes. These slow, gradual changes are fairly hard to detect and separating the changes from the noise becomes tougher still:

Figure 1 Concept drift illustrated by the gradual change in color from yellow to blue in the bottom panel. Sampled data reflects underlying change in data distribution, which must be detected and a new model learned.

There have been several techniques described in various studies in the last two decades that can be categorized as shown in the following figure:

Figure 2 Categories of drift detection techniques

Data management

The main idea is to manage a model in memory that is consistent with the dynamic nature of the data.

Partial memory

These techniques use the most recently used data in a memory buffer to learn or derive summary information. The key question as discussed previously is: what is the right window size to be effective in detecting the change and learning effectively? In Fixed Window Size based techniques, we use the idea of a queue where a new instance with a recent timestamp comes in and the one with the oldest is evicted. The window thus contains all the recent enough examples and the size is generally chosen based on physical availability of memory and size of data elements in the queue. In Adaptive Window Size, the queue is used in conjunction with a detection algorithm. When the detection algorithm indicates signs of drifts based on performance evaluation, the window size can be reduced to effectively remove old examples which no longer help the model.

Full memory

The idea is to store sufficient statistics over all the examples or data seen. One way to do this is to put weights on the data and weights decay over time. Exponential weighting using the rate factor given by λ can be very effective:

w_λ (x) = exp(– λ * i)

Detection methods

Given the probability P(X) that the given data is observed, the probability of patterns/class P(C), and the probability of data given class P(X|C)—which is the model—the detection method can be divided into two categories, at a high level:

Monitoring evolution or performance of the model, classifier, or P(C|X)
Monitoring distributions from the environment or observing P(X), P(C), and P(X|C)

Monitoring model evolution

Although this method is based on the assumption that all the learning of models is stationary and data is coming from independent, identical distributions (i.i.d.), which doesn't hold true in many applications, it has nevertheless been shown to be effective. Some of the well-known techniques are described next.

Widmer and Kubat

This is one of the earliest methods which observed the error rates or misclassification rates and the changes to the model such as tree structures due to new branches, for instance. Using these and known thresholds, the learning window size is increased or decreased.

Drift Detection Method or DDM

This method assumes that the parameter being observed, such as the classifier labeling things correctly or incorrectly, is a binary random variable which follows a binomial distribution. It assumes probability of misclassification at probability pi with standard deviation of where the values are computed at the i^th point in the sequence. The method then uses two levels:

Warning level: When p_i + s_i ≥ p_min + 2 * s_min
Detection level: When p_i + s_i ≥ p_min + 3 * s_min

All the examples between the "warning" and "detection" levels are used to train a new classifier that will replace the "non-performing" classifier when the "detection" level is reached.

Early Drift Detection Method or EDDM

EDDM uses the same technique as DDM but with a slight modification. It uses classification rate (that is, recall) rather than error rate (1 – accuracy) and uses the distance between the number of right predictions and two wrong predictions to change the levels.

EDDM computes the mean distance between the two errors p_i^' and standard deviation between the two s_i'. The levels are then:

Warning level: (p_i^' + 2 * s_i^') ⁄ (p^'_max + 2 * s^'_max) < α
Detection level: (p_i^' + 2 * s_i') ⁄ (p^'_max + 2 * s^'_max) < β

The parameters α and β are normally tuned by the user to something around 90% and 95% respectively.

Monitoring distribution changes

When there are no models or classifiers to detect changes, we apply techniques that use some form of statistical tests for monitoring distribution changes. These tests are used to identify the distribution changes. Owing to assumptions, whether parametric or non-parametric, and different biases, it is difficult to say concretely which works best. Here we provide some of the well-known statistical tests.

Welch's t test

This is an adaptation of the Student t test with two samples. The test is adapted to take two windows of size N₁ and N₂ with means and and variances and to compute the p value and use that to reject or accept the null hypothesis:

Kolmogorov-Smirnov's test

This statistical test is normally used to compare distances between two distributions and validate if they are below certain thresholds or not. This can be adapted for change detection by using windows of two different sample sizes, N₁ and N₂, with different cumulative distribution functions, F₁(x) and F₂(x), KS distance:

The null hypothesis, which assumes the two distributions are similar, is rejected with confidence of α if and only if , which is obtained by a lookup in the Kolmogorov-Smirnov's table.

CUSUM and Page-Hinckley test

The cumulative sum (CUSUM) is designed to indicate when the mean of the input is significantly different from zero:

g₀ = 0 , g_t = max(0, g_t–1) + ϵ_t – v)

We raise change detection when g_t > h, where (h, v) are user-defined parameters. Note that the CUSUM test is memoryless and is one-sided or asymmetric, detecting only the increase.

The Page Hinckley test is similar to CUSUM with a small variation as shown here:

g0 = 0 , g_t = g_t–1 + ϵ_t – v)

For increasing and decreasing the values, we use G_t = min(g_t, G_t–1) or G_t = max(g_t, G_t–1), and Gt – gt > h for change detection.

Adaptation methods

Explicit and implicit adaptation are the two well-known techniques for adapting to environment changes when a change is detected.

Explicit adaptation

In explicit adaptation, an additional technique from among the following is used:

Retrain the model from scratch with new data so the previous model or data does not impact the new model
Update the model with the changes or new data such that the transition is smooth—assumes changes are gradual and not drastic
Create a sequence or ensemble of models that are learned over time—when a collaborative approach is better than any single model

Implicit adaptation

In implicit adaptation, we generally use ensemble algorithms/models to adapt to the concept change. This can mean using different combinations ranging from a single classifier, to predicting in the ensemble, to using ADWIN for adaptive window-based with the classifier—all fall within the choices for implicit adaptation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Concept drift and drift detection

Create new playlist

Sign In

Sign Up