Chapter 7. Outlier Detection

In this chapter, you will learn how to write R codes to detect outliers in real-world cases. Generally speaking, outliers arise for various reasons, such as the dataset being compromised with data from different classes and data measurement system errors.

As per their characteristics, outliers differ dramatically from the usual data in the original dataset. Versatile solutions are developed to detect them, which include model-based methods, proximity-based methods, density-based methods, and so on.

In this chapter, we will cover the following topics:

  • Credit card fraud detection and statistical methods
  • Activity monitoring—the detection of fraud of mobile phones and proximity-based methods
  • Intrusion detection and density-based methods
  • Intrusion detection and clustering-based methods
  • Monitoring the performance of network-based and classification-based methods
  • Detecting novelty in text, topic detection, and mining contextual outliers
  • Collective outliers on spatial data
  • Outlier detection in high-dimensional data

Here is a diagram illustrating a classification of outlier detection methods:

Outlier Detection

The output of an outlier detection system can be categorized into two groups: one is the labeled result and the other is the scored result (or an ordered list).

Credit card fraud detection and statistical methods

One major solution to detect outliers is the model-based method or statistical method. The outlier is defined as the object not belonging to the model that is used to represent the original dataset. In other words, that model does not generate the outlier.

Among the accurate models to be adopted for the specific dataset, there are many choices available such as Gaussian and Poisson. If the wrong model is used to detect outliers, the normal data point may wrongly be recognized as an outlier. In addition to applying the single distribution model, the mixture of distribution models is practical too.

Credit card fraud detection and statistical methods

The log-likelihood function is adopted to find the estimation of parameters of a model:

Credit card fraud detection and statistical methods
Credit card fraud detection and statistical methods
Credit card fraud detection and statistical methods

The likelihood-based outlier detection algorithm

The summarized pseudocode of the likelihood-based outlier detection algorithm is as follows:

The likelihood-based outlier detection algorithm

The R implementation

Look up the file of R codes, ch_07_lboutlier_detection.R, from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_07_lboutlier_detection.R")

Credit card fraud detection

Fraud denotes the criminal activities that happen in various commercial companies, such as credit card, banking, or insurance companies. For credit card fraud detection, two major applications are covered, fraudulent application of credit card and fraudulent usage of credit card. The fraud represents behavior anomalous to the average usage of credit cards to certain users, that is, transaction records of the users.

This kind of outlier statistically denotes credit card theft, which deviates from the normal nature of criminal activities. Some examples of outliers in this case are high rate of purchase, very high payments, and so on.

The location of payment, the user, and the context are possible attributes in the dataset. The clustering algorithms are the possible solutions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.165.70