In this chapter, you will learn how to write R codes to detect outliers in real-world cases. Generally speaking, outliers arise for various reasons, such as the dataset being compromised with data from different classes and data measurement system errors.
As per their characteristics, outliers differ dramatically from the usual data in the original dataset. Versatile solutions are developed to detect them, which include model-based methods, proximity-based methods, density-based methods, and so on.
In this chapter, we will cover the following topics:
Here is a diagram illustrating a classification of outlier detection methods:
The output of an outlier detection system can be categorized into two groups: one is the labeled result and the other is the scored result (or an ordered list).
One major solution to detect outliers is the model-based method or statistical method. The outlier is defined as the object not belonging to the model that is used to represent the original dataset. In other words, that model does not generate the outlier.
Among the accurate models to be adopted for the specific dataset, there are many choices available such as Gaussian and Poisson. If the wrong model is used to detect outliers, the normal data point may wrongly be recognized as an outlier. In addition to applying the single distribution model, the mixture of distribution models is practical too.
The log-likelihood function is adopted to find the estimation of parameters of a model:
Look up the file of R codes, ch_07_lboutlier_detection.R
, from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_07_lboutlier_detection.R")
Fraud denotes the criminal activities that happen in various commercial companies, such as credit card, banking, or insurance companies. For credit card fraud detection, two major applications are covered, fraudulent application of credit card and fraudulent usage of credit card. The fraud represents behavior anomalous to the average usage of credit cards to certain users, that is, transaction records of the users.
This kind of outlier statistically denotes credit card theft, which deviates from the normal nature of criminal activities. Some examples of outliers in this case are high rate of purchase, very high payments, and so on.
The location of payment, the user, and the context are possible attributes in the dataset. The clustering algorithms are the possible solutions.
18.226.187.55