Classification algorithms can be used to detect outliers. The ordinary strategy is to train a one-class model only for the normal data point in the training dataset. Once you set up the model, any data point that is not accepted by the model is marked as an outlier.
The OCSVM (One Class SVM) algorithm projects input data into a high-dimensional feature space. Along with this process, it iteratively finds the maximum-margin hyperplane. The hyperplane defined in a Gaussian reproducing kernel Hilbert space best separates the training data from the origin. When , the major portion of outliers or the solution of OCSVM can be represented by the solution of the following equation (subject to and ):
This algorithm is based on the k-Nearest Neighbor algorithm. A couple of formulas are added.
The local density is denoted as follows:
The distance between the test object, x
, and its nearest neighbor in the training set, , is defined like this:
The distance between this nearest neighbor () and its nearest neighbor in the training set () is defined as follows:
One data object is marked as an outlier once , or, in another format, is marked as .
Look up the file of R codes, ch_07_ classification _based.R
, from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:
> source("ch_07_ classification _based.R")
Web server performance measurements are really important to the business and for operating system management. These measurements can be in the form of CPU usage, network bandwidth, storage, and so on.
The dataset comes from various sources such as benchmark data, logs, and so on. The types of outliers that appear during the monitoring of the web server are point outliers, contextual outliers, and collective outliers.
3.139.86.131