Basic stream processing and computational techniques

We will now describe some basic computations that can be performed on the stream of data. If we must run summary operations such as aggregations or histograms with limits on memory and speed, we can be sure that some kind of trade-off will be needed. Two well-known types of approximations in these situations are:

  • ϵ Approximation: The computation is close to the exact value within the fraction ϵ of error.
  • (ϵ, δ) Approximation: The computation is close to the exact value within 1 ± ϵ with probability within 1 – δ.

Stream computations

We will illustrate some basic computations and aggregations to highlight the difference between batch and stream-based calculations when we must compute basic operations with constraints on memory and yet consider the entire data:

  • Frequency count or point queries: The generic technique of Count-Min Sketch has been successfully applied to perform various summarizations on the data streams. The primary technique is creating a window of size w x d. Then, given a desired probability (δ) and admissible error (ϵ), the size of data in memory can be created using w = 2/ ϵ and Stream computations. Associated with each row is a hash function: h(.). This uniformly transforms a value x to a value in the interval [1, 2 … w]. This method of lookup and updates can be used for performing point queries of values or dot products or frequency counts.
  • Distinct count: The generic technique of Hash-Sketch can be used to perform "distinct values" queries or counts. Given the domain of incoming stream values x ∈ [0,1,2….N-1], the hash function h(x) maps the values uniformly across [0,1,….2L-1], where L=O(log N).
  • Mean: Computing the mean without the need for storing all the values is very useful and is normally employed using a recursive method where only the number of observations (n) and sum of values seen so far (∑xn) is needed:
    Stream computations
  • Standard deviation: Like the mean, standard deviation can be computed using the memoryless option with only the number of observations (n), sum of values seen so far (∑xn), and sum of squares of the values (∑xn2):
    Stream computations
  • Correlation coefficient: Given a stream of two different values, many algorithms need to compute the correlation coefficient between the two which can be done by maintaining the running sum of each stream (∑xn and ∑yn), the sum of squared values (∑xn2 and ∑yn2), and the cross-product (∑xnx yn). The correlation is given by:
    Stream computations

Sliding windows

Often, you don't need the entire data for computing statistics or summarizations but only the "recent past". In such cases, sliding window techniques are used to calculate summary statistics by keeping the window size either fixed or adaptable and moving it over the recent past.

ADaptable sliding WINdow (ADWIN) is one well-known technique used to detect change as well as estimating values needed in the computation. The idea behind ADWIN is to keep a variable-length window of last seen values with the characteristic that the window has a maximum length statistically consistent with the fact that there has been no change in the average value within the window. In other words, the older values are dropped if and only if a new incoming value would change the average. This has the two-fold advantage of recording change and maintaining the dynamic value, such as aggregate, over the recent streams. The determination of the subjective notion "large enough" for dropping items can be determined using the well-known Hoeffding bound as:

Sliding windows

Here Sliding windows is the harmonic mean between two windows W0 and W1 of size |W0| and |W1| respectively, with W1 containing the more recent elements. Further, let Sliding windows and Sliding windows be the respective calculated averages.

The algorithm can be generalized as:

  1. ADWIN (x: inputstream, δ: confidence)
  2. init (W) //Initialize Window W
  3. while (x){

    W ← W ∪ {xt} //add new instance xt to the head of Window W

  4. repeat W ← W – xold //drop elements from tail of the window
  5. Sliding windows < Sliding windows holds for every split of W
  6. output Sliding windows
  7. }

ADWIN has also shown that it provides theoretical bounds on false positives and false negatives, which makes it a very promising technique to use.

Sampling

In many stream-based algorithms, there is a need to reduce the data or select a subset of data for analysis. The normal methodology of sampling on the whole data must be augmented for stream-based data.

The key concerns in sampling that must be addressed are how unbiased the samples are and how representative they are of the population from which streams are being generated. In a non-streaming environment, this depends completely on the sample size and the sampling method. Uniform random sampling (Chapter 2, Practical Approach to Real-World Supervised Learning) is one of the most well-known techniques employed to reduce the data in the batch data world. The reservoir sampling technique is considered to be a very effective way of reducing the data given the memory constraints.

The basic idea of reservoir sampling is to keep a reservoir or sample of fixed size, say k, and every element that enters the stream has a probability k/n of replacing an older element in the reservoir. The detailed algorithm is shown here:

ReservoirSampling(x:inputstream, k:sizeOfReservoir)
//add first k elements to reservoir
for(i = 0; i < k; i++)
  addToReservoir(x)
  while (x){
    for(i = 0; i < k; i++)
    //flip a coin to get random integer
    r = randomInteger[1..n]
    if(r ≤ k){
      //move it inside the reservoir
      addToReservoir(x)
      //delete an instance randomly from reservoir
      position = randomInteger[1..k]
      removeInstance(position)
    }
}

There are extensions to these such as the Min-wise Sampling and Load Shedding that overcome some issues associated with the base method.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.22.145