Anomaly detection in website traffic

In the second example, we'll focus on modeling the opposite of the previous example. Instead of discussing what typical fraud-less cases are, we'll discuss the normal expected behavior of the system. If something cannot be matched against our expected model, it will be considered anomalous.

Dataset

We'll work with a publicly available dataset released by Yahoo Labs that is useful for discussing how to detect anomalies in time series data. For Yahoo, the main use case is in detecting unusual traffic on Yahoo servers.

Even though Yahoo announced that their data is publicly available, you have to apply to use it, and it takes about 24 hours before the approval is granted. The dataset is available here:

http://webscope.sandbox.yahoo.com/catalog.php?datatype=s&did=70

The data set comprises real traffic to Yahoo services, along with some synthetic data. In total, the dataset contains 367 time series, each of which contain between 741 and 1680 observations, recorded at regular intervals. Each series is written in its own file, one observation per line. A series is accompanied by a second column indicator with a one if the observation was an anomaly, and zero otherwise. The anomalies in real data were determined by human judgment, while those in the synthetic data were generated algorithmically. A snippet of the synthetic times series data is shown in the following table:

Dataset

In the following section, we'll learn how to transform time series data to attribute presentation that allows us to apply machine learning algorithms.

Anomaly detection in time series data

Detecting anomalies in raw, streaming time series data requires some data transformations. The most obvious way is to select a time window and sample time series with fixed length. In the next step, we want to compare a new time series to our previously collected set to detect if something is out of the ordinary.

The comparison can be done with various techniques, as follows:

  • Forecasting the most probable following value, as well as confidence intervals (for example, Holt-Winters exponential smoothing). If a new value is out of forecasted confidence interval, it is considered anomalous.
  • Cross correlation compares new sample to library of positive samples, it looks for exact match. If the match is not found, it is marked as anomalous.
  • Dynamic time wrapping is similar to cross correlation, but allows signal distortion in comparison.
  • Discretizing signal to bands, where each band corresponds to a letter. For example, A=[min, mean/3], B=[mean/3, mean*2/3], and C=[mean*2/3, max] transforms the signal to a sequence of letters such as aAABAACAABBA…. This approach reduces the storage and allows us to apply text-mining algorithms that we will discuss in Chapter 10, Text Mining with Mallet – Topic Modeling and Spam Detection.
  • Distribution-based approach estimates distribution of values in a specific time window. When we observe a new sample, we can compare whether distribution matches to the previously observed one.

This list is, by no means, exhaustive. Different approaches are focused on detecting different anomalies (for example, in value, frequency, and distribution). We will focus on a version of distribution-based approaches.

Histogram-based anomaly detection

In histogram-based anomaly detection, we split signals by some selected time window as shown in the following image.

For each window, we calculate the histogram, that is, for a selected number of buckets, we count how many values fall into each bucket. The histogram captures basic distribution of values in a selected time window as shown at the center of the diagram.

Histograms can be then directly presented as instances, where each bin corresponds to an attribute. Further, we can reduce the number of attributes by applying a dimensionality-reduction technique such as Principal Component Analysis (PCA), which allows us to visualize the reduced-dimension histograms in a plot as shown at the bottom-right of the diagram, where each dot corresponds to a histogram.

In our example, the idea is to observe website traffic for a couple of days and then to create histograms, for example, four-hour time windows to build a library of positive behavior. If a new time window histogram cannot be matched against positive library, we can mark it as an anomaly:

Histogram-based anomaly detection

For comparing a new histogram to a set of existing histograms, we will use a density-based k-nearest neighbor algorithm, Local Outlier Factor (LOF) (Breunig et al, 2000). The algorithm is able to handle clusters with different densities as shown in the following image. For example, the upper-right cluster is large and widespread as compared to the bottom-left cluster, which is smaller and denser:

Histogram-based anomaly detection

Let's get started!

Loading the data

In the first step, we'll need to load the data from text files to a Java object. The files are stored in a folder, each file contains one-time series with values per line. We'll load them into a list of Double:

String filePath = "chap07/ydata/A1Benchmark/real";
List<List<Double>> rawData = new ArrayList<List<Double>>();

We will need the min and max value for histogram normalization, so let's collect them in this data pass:

double max = Double.MIN_VALUE;
double min = Double.MAX_VALUE;

for(int i = 1; i<= 67; i++){
  List<Double> sample = new ArrayList<Double>();
  BufferedReader reader = new BufferedReader(new FileReader(filePath+i+".csv"));
  
  boolean isAnomaly = false;
  reader.readLine();
  while(reader.ready()){
    String line[] = reader.readLine().split(",");
    double value = Double.parseDouble(line[1]);
    sample.add(value);
    
    max = Math.max(max, value);
    min = Double.min(min, value);
    
    if(line[2] == "1")
      isAnomaly = true;
    
  }
  System.out.println(isAnomaly);
  reader.close();
  
  rawData.add(sample);
}

The data is loaded, now let's move on to histograms.

Creating histograms

We will create a histogram for a selected time window with the WIN_SIZE width. The histogram will hold the HIST_BINS value buckets. The histograms consisting of list of doubles will be stored into an array list:

int WIN_SIZE = 500;
int HIST_BINS = 20;
int current = 0;

List<double[]> dataHist = new ArrayList<double[]>();
for(List<Double> sample : rawData){
  double[] histogram = new double[HIST_BINS];
  for(double value : sample){
    int bin = toBin(normalize(value, min, max), HIST_BINS);
    histogram[bin]++;
    current++;
    if(current == WIN_SIZE){
      current = 0;
      dataHist.add(histogram);
      histogram = new double[HIST_BINS];
    }
  }
  dataHist.add(histogram);
}

Histograms are now completed. The last step is to transform them into Weka's Instance objects. Each histogram value will correspond to one Weka attribute, as follows:

ArrayList<Attribute> attributes = new ArrayList<Attribute>();
for(int i = 0; i<HIST_BINS; i++){
  attributes.add(new Attribute("Hist-"+i));
}
Instances dataset = new Instances("My dataset", attributes, dataHist.size());
for(double[] histogram: dataHist){
  dataset.add(new Instance(1.0, histogram));
}

The dataset is now loaded and ready to be plugged into an anomaly-detection algorithm.

Density based k-nearest neighbors

To demonstrate how LOF calculates scores, we'll first split the dataset into training and testing set using the testCV(int, int) function. The first parameter specifies the number of folds, while the second parameter specifies which fold to return.

// split data to train and test
Instances trainData = dataset.testCV(2, 0);
Instances testData = dataset.testCV(2, 1);

The LOF algorithm is not a part of the default Weka distribution, but it can be downloaded through Weka's package manager:

http://weka.sourceforge.net/packageMetaData/localOutlierFactor/index.html

LOF algorithm has two implemented interfaces: as an unsupervised filter that calculates LOF values (known-unknowns) and as a supervised k-nn classifier (known-knowns). In our case, we want to calculate the outlier-ness factor, therefore, we'll use the unsupervised filter interface:

import weka.filters.unsupervised.attribute.LOF;

The filter is initialized the same way as a usual filter. We can specify the k number of neighbors, for example, k=3, with –min and –max parameters. LOF allows us to specify two different k parameters, which are used internally as the upper and lower bound to find the minimal/maximal number lof values:

LOF lof = new LOF();
lof.setInputFormat(trainData);
lof.setOptions(new String[]{"-min", "3", "-max", "3"});

Next, we load training instances into the filter that will serve as a positive example library. After we complete the loading, we call the batchFinished()method to initialize internal calculations:

for(Instance inst : trainData){
  lof.input(inst);
}
lof.batchFinished();

Finally, we can apply the filter to test data. Filter will process the instances and append an additional attribute at the end containing the LOF score. We can simply output the score on the console:

Instances testDataLofScore = Filter.useFilter(testData, lof);

for(Instance inst : testDataLofScore){
  System.out.println(inst.value(inst.numAttributes()-1));
}

The LOF score of the first couple of test instances is as follows:

1.306740014927325
1.318239332210458
1.0294812291949587
1.1715039094530768

To understand the LOF values, we need some background on the LOF algorithm. It compares the density of an instance to the density of its nearest neighbors. The two scores are divided, producing the LOF score. The LOF score around 1 indicates that the density is approximately equal, while higher LOF values indicate that the density of the instance is substantially lower than the density of its neighbors. In such cases, the instance can be marked as anomalous.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.78.237