We will now describe some basic computations that can be performed on the stream of data. If we must run summary operations such as aggregations or histograms with limits on memory and speed, we can be sure that some kind of trade-off will be needed. Two well-known types of approximations in these situations are:
We will illustrate some basic computations and aggregations to highlight the difference between batch and stream-based calculations when we must compute basic operations with constraints on memory and yet consider the entire data:
Often, you don't need the entire data for computing statistics or summarizations but only the "recent past". In such cases, sliding window techniques are used to calculate summary statistics by keeping the window size either fixed or adaptable and moving it over the recent past.
ADaptable sliding WINdow (ADWIN) is one well-known technique used to detect change as well as estimating values needed in the computation. The idea behind ADWIN is to keep a variable-length window of last seen values with the characteristic that the window has a maximum length statistically consistent with the fact that there has been no change in the average value within the window. In other words, the older values are dropped if and only if a new incoming value would change the average. This has the two-fold advantage of recording change and maintaining the dynamic value, such as aggregate, over the recent streams. The determination of the subjective notion "large enough" for dropping items can be determined using the well-known Hoeffding bound as:
Here is the harmonic mean between two windows W0 and W1 of size |W0| and |W1| respectively, with W1 containing the more recent elements. Further, let and be the respective calculated averages.
The algorithm can be generalized as:
W ← W ∪ {xt} //add new instance xt to the head of Window W
ADWIN has also shown that it provides theoretical bounds on false positives and false negatives, which makes it a very promising technique to use.
In many stream-based algorithms, there is a need to reduce the data or select a subset of data for analysis. The normal methodology of sampling on the whole data must be augmented for stream-based data.
The key concerns in sampling that must be addressed are how unbiased the samples are and how representative they are of the population from which streams are being generated. In a non-streaming environment, this depends completely on the sample size and the sampling method. Uniform random sampling (Chapter 2, Practical Approach to Real-World Supervised Learning) is one of the most well-known techniques employed to reduce the data in the batch data world. The reservoir sampling technique is considered to be a very effective way of reducing the data given the memory constraints.
The basic idea of reservoir sampling is to keep a reservoir or sample of fixed size, say k, and every element that enters the stream has a probability k/n of replacing an older element in the reservoir. The detailed algorithm is shown here:
ReservoirSampling(x:inputstream, k:sizeOfReservoir) //add first k elements to reservoir for(i = 0; i < k; i++) addToReservoir(x) while (x){ for(i = 0; i < k; i++) //flip a coin to get random integer r = randomInteger[1..n] if(r ≤ k){ //move it inside the reservoir addToReservoir(x) //delete an instance randomly from reservoir position = randomInteger[1..k] removeInstance(position) } }
There are extensions to these such as the Min-wise Sampling and Load Shedding that overcome some issues associated with the base method.
3.15.137.59