The strategy of outlier detection technologies based on the clustering algorithm is focused on the relation between data objects and clusters.
Hierarchical clustering to detect outliers
Outlier detection that uses the hierarchical clustering algorithm is based on the k-Nearest Neighbor graph. The input parameters include the input dataset, DATA, of size, n, and each data point with k variables, the distance measure function (d), one hierarchical algorithm (h), threshold (t), and cluster number (nc).
The k-means-based algorithm
The process of the outlier detection based on the k-means algorithm is illustrated in the following diagram:
The summarized pseudocodes of outlier detection using the k-means algorithm are as follows:
Phase 1 (Data preparation):
The target observations and attributes should be aligned to improve the accuracy of the result and performance of the k-means clustering algorithm.
If the original dataset has missing data, its handling activity must be carried out. The data of maximum likelihood that is anticipated by the EM algorithm is fed as input into the missing data.
Phase 2 (The outlier detection procedure):
The k value should be determined in order to run the k-means clustering algorithm. The proper k value is decided by referring to the value of the Cubic Clustering Criterion.
The k-means clustering algorithm runs with the decided k value. On completion, the expert checks the external and internal outliers in the clustering results. If the other groups' elimination of the outliers is more meaningful, then he/she stops this procedure. If the other groups need to be recalculated, he/she again runs the k-means clustering algorithms but without the detected outliers.
Phase 3 (Review and validation):
The result of the previous phase is only a candidate for this phase. By considering the domain knowledge, we can find the true outliers.
The ODIN algorithm
Outlier detection using the indegree number with the ODIN algorithm is based on the k-Nearest Neighbor graph.
The R implementation
Look up the file of R codes, ch_07_ clustering _based.R, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command: