Intrusion detection and density-based methods

Here is a formal definition of outliers formalized based on concepts such as, LOF, LRD, and so on. Generally speaking, an outlier is a data point biased from others so much that it seems as if it has not been generated from the same distribution functions as others have been.

Given a dataset, D, a DB (x, y)-outlier, p, is defined like this:

Intrusion detection and density-based methods

The k-distance of the p data point denotes the distance between p and the data point, o, which is member of D:

Intrusion detection and density-based methods
Intrusion detection and density-based methods
Intrusion detection and density-based methods

The k-distance neighborhood of the p object is defined as follows, q being the k-Nearest Neighbor of p:

Intrusion detection and density-based methods

The following formula gives the reachability distance of an object, p, with respect to an object, o:

Intrusion detection and density-based methods

The Local Reachability Density (LRD) of a data object, o, is defined like this:

Intrusion detection and density-based methods

The Local Outlier Factor (LOF) is defined as follows, and it measures the degree of the outlierness:

Intrusion detection and density-based methods

A property of LOF (p) is defined as shown in the following equation:

Intrusion detection and density-based methods
Intrusion detection and density-based methods
Intrusion detection and density-based methods

These equations are illustrated as follows:

Intrusion detection and density-based methods

The OPTICS-OF algorithm

The input parameters for the bagging algorithm are:

  • The OPTICS-OF algorithm, the dataset
  • The OPTICS-OF algorithm, the parameter
  • The OPTICS-OF algorithm, another parameter

The output of the algorithm is the value of CBLOF, for all records.

The summarized pseudocodes of the OPTICS-OF algorithm are as follows:

The OPTICS-OF algorithm

The High Contrast Subspace algorithm

The summarized pseudocodes of the High Contrast Subspace (HiCS) algorithm are as follows, where the input parameters are S, M, and The High Contrast Subspace algorithm. The output is a contrast, |S|.

The High Contrast Subspace algorithm

The R implementation

Look up the file of R codes, ch_07_ density _based.R, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested using the following command:

> source("ch_07_ density _based.R")

Intrusion detection

Any malicious activity against systems, networks, and servers can be treated as intrusion, and finding such activities is called intrusion detection.

The characteristics of situations where you can detect intrusion are high volume of data, missing labeled data in the dataset (which can be training data for some specific solution), time series data, and false alarm rate in the input dataset.

An intrusion detection system is of two types: host-based and network-based intrusion detection systems. A popular architecture for intrusion detection based on data mining is illustrated in the following diagram:

Intrusion detection

The core algorithms applied in an outlier detection system are usually semi-supervised or unsupervised according to the characteristics of intrusion detection.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.189.228