As discussed in the introduction of the chapter, the dynamic nature of infinite streams stands in direct opposition to the basic principles of stationary learning; that is, that the distribution of the data or patterns remain constant. Although there can be changes that are swift or abrupt, the discussion here is around slow, gradual changes. These slow, gradual changes are fairly hard to detect and separating the changes from the noise becomes tougher still:
There have been several techniques described in various studies in the last two decades that can be categorized as shown in the following figure:
The main idea is to manage a model in memory that is consistent with the dynamic nature of the data.
These techniques use the most recently used data in a memory buffer to learn or derive summary information. The key question as discussed previously is: what is the right window size to be effective in detecting the change and learning effectively? In Fixed Window Size based techniques, we use the idea of a queue where a new instance with a recent timestamp comes in and the one with the oldest is evicted. The window thus contains all the recent enough examples and the size is generally chosen based on physical availability of memory and size of data elements in the queue. In Adaptive Window Size, the queue is used in conjunction with a detection algorithm. When the detection algorithm indicates signs of drifts based on performance evaluation, the window size can be reduced to effectively remove old examples which no longer help the model.
The idea is to store sufficient statistics over all the examples or data seen. One way to do this is to put weights on the data and weights decay over time. Exponential weighting using the rate factor given by λ can be very effective:
wλ (x) = exp(– λ * i)
Given the probability P(X) that the given data is observed, the probability of patterns/class P(C), and the probability of data given class P(X|C)—which is the model—the detection method can be divided into two categories, at a high level:
Although this method is based on the assumption that all the learning of models is stationary and data is coming from independent, identical distributions (i.i.d.), which doesn't hold true in many applications, it has nevertheless been shown to be effective. Some of the well-known techniques are described next.
This is one of the earliest methods which observed the error rates or misclassification rates and the changes to the model such as tree structures due to new branches, for instance. Using these and known thresholds, the learning window size is increased or decreased.
This method assumes that the parameter being observed, such as the classifier labeling things correctly or incorrectly, is a binary random variable which follows a binomial distribution. It assumes probability of misclassification at probability pi with standard deviation of where the values are computed at the ith point in the sequence. The method then uses two levels:
All the examples between the "warning" and "detection" levels are used to train a new classifier that will replace the "non-performing" classifier when the "detection" level is reached.
EDDM uses the same technique as DDM but with a slight modification. It uses classification rate (that is, recall) rather than error rate (1 – accuracy) and uses the distance between the number of right predictions and two wrong predictions to change the levels.
EDDM computes the mean distance between the two errors pi' and standard deviation between the two si'. The levels are then:
The parameters α and β are normally tuned by the user to something around 90% and 95% respectively.
When there are no models or classifiers to detect changes, we apply techniques that use some form of statistical tests for monitoring distribution changes. These tests are used to identify the distribution changes. Owing to assumptions, whether parametric or non-parametric, and different biases, it is difficult to say concretely which works best. Here we provide some of the well-known statistical tests.
This is an adaptation of the Student t test with two samples. The test is adapted to take two windows of size N1 and N2 with means and and variances and to compute the p value and use that to reject or accept the null hypothesis:
This statistical test is normally used to compare distances between two distributions and validate if they are below certain thresholds or not. This can be adapted for change detection by using windows of two different sample sizes, N1 and N2, with different cumulative distribution functions, F1(x) and F2(x), KS distance:
The null hypothesis, which assumes the two distributions are similar, is rejected with confidence of α if and only if , which is obtained by a lookup in the Kolmogorov-Smirnov's table.
The cumulative sum (CUSUM) is designed to indicate when the mean of the input is significantly different from zero:
g0 = 0 , gt = max(0, gt–1) + ϵt – v)
We raise change detection when gt > h, where (h, v) are user-defined parameters. Note that the CUSUM test is memoryless and is one-sided or asymmetric, detecting only the increase.
The Page Hinckley test is similar to CUSUM with a small variation as shown here:
g0 = 0 , gt = gt–1 + ϵt – v)
For increasing and decreasing the values, we use Gt = min(gt, Gt–1) or Gt = max(gt, Gt–1), and Gt – gt > h for change detection.
Explicit and implicit adaptation are the two well-known techniques for adapting to environment changes when a change is detected.
In explicit adaptation, an additional technique from among the following is used:
In implicit adaptation, we generally use ensemble algorithms/models to adapt to the concept change. This can mean using different combinations ranging from a single classifier, to predicting in the ensemble, to using ADWIN for adaptive window-based with the classifier—all fall within the choices for implicit adaptation.
3.135.190.182