12.3 Statistical Approaches

As with statistical methods for clustering, statistical methods for outlier detection make assumptions about data normality. They assume that the normal objects in a data set are generated by a stochastic process (a generative model). Consequently, normal objects occur in regions of high probability for the stochastic model, and objects in the regions of low probability are outliers.

The general idea behind statistical methods for outlier detection is to learn a generative model fitting the given data set, and then identify those objects in low-probability regions of the model as outliers. However, there are many different ways to learn generative models. In general, statistical methods for outlier detection can be divided into two major categories:parametric methods and nonparametric methods, according to how the models are specified and learned.

A parametric method assumes that the normal data objects are generated by a parametric distribution with parameter Θ. The probability density function of the parametric distribution image gives the probability that object x is generated by the distribution. The smaller this value, the more likely x is an outlier.

A nonparametric method does not assume an a priori statistical model. Instead, a nonparametric method tries to determine the model from the input data. Note that most nonparametric methods do not assume that the model is completely parameter-free. (Such an assumption would make learning the model from data almost mission impossible.) Instead, nonparametric methods often take the position that the number and nature of the parameters are flexible and not fixed in advance. Examples of nonparametric methods include histogram and kernel density estimation.

12.3.1 Parametric Methods

In this subsection, we introduce several simple yet practical parametric methods for outlier detection. We first discuss methods for univariate data based on normal distribution. We then discuss how to handle multivariate data using multiple parametric distributions.

Detection of Univariate Outliers Based on Normal Distribution

Data involving only one attribute or variable are called univariate data. For simplicity, we often choose to assume that data are generated from a normal distribution. We can then learn the parameters of the normal distribution from the input data, and identify the points with low probability as outliers.

Let’s start with univariate data. We will try to detect outliers by assuming the data follow a normal distribution.

Example 12.8

Univariate outlier detection using maximum likelihood

Suppose a city’s average temperature values in July in the last 10 years are, in value-ascending order, 24.0° C, 28.9° C, 28.9° C, 29.0° C, 29.1° C, 29.1° C, 29.2° C, 29.2° C, 29.3° C, and 29.4° C. Let’s assume that the average temperature follows a normal distribution, which is determined by two parameters: the mean, μ, and the standard deviation, σ.

We can use the maximum likelihood method to estimate the parameters μ and σ. That is, we maximize the log-likelihood function

image (12.1)

where n is the total number of samples, which is 10 in this example.

Taking derivatives with respect to μ and σ2 and solving the resulting system of first-order conditions leads to the following maximum likelihood estimates:

image (12.2)

image (12.3)

In this example, we have

image

Accordingly, we have image.

The most deviating value, 24.0° C, is 4.61° C away from the estimated mean. We know that the image region contains 99.7% data under the assumption of normal distribution. Because image, the probability that the value 24.0° C is generated by the normal distribution is less than 0.15%, and thus can be identified as an outlier.

Example 12.8 elaborates a simple yet practical outlier detection method. It simply labels any object as an outlier if it is more than 3σ away from the mean of the estimated distribution, where σ is the standard deviation.

Such straightforward methods for statistical outlier detection can also be used in visualization. For example, the boxplot method (described in Chapter 2) plots the univariate input data using a five-number summary (Figure 12.3): the smallest nonoutlier value (Min), the lower quartile (Q1), the median (Q2), the upper quartile (Q3), and the largest nonoutlier value (Max). The interquantile range (IQR) is defined as Q3 − Q 1. Any object that is more than 1.5 × IQR smaller than Q1 or 1.5 × IQR larger than Q3 is treated as an outlier because the region between Q1 − 1.5 × IQR and Q3 + 1.5 × IQR contains 99.3% of the objects. The rationale is similar to using 3σ as the threshold for normal distribution.

image

Figure 12.3 Using a boxplot to visualize outliers.

Another simple statistical method for univariate outlier detection using normal distribution is the Grubb’s test (also known as the maximum normed residual test). For each object x in a data set, we define a z-score as

image (12.4)

where image is the mean, and s is the standard deviation of the input data. An object x is an outlier if

image (12.5)

where image is the value taken by a t-distribution at a significance level of image, and N is the number of objects in the data set.

Detection of Multivariate Outliers

Data involving two or more attributes or variables are multivariate data. Many univariate outlier detection methods can be extended to handle multivariate data. The central idea is to transform the multivariate outlier detection task into a univariate outlier detection problem. Here, we use two examples to illustrate this idea.

Example 12.9

Multivariate outlier detection using the Mahalanobis distance

For a multivariate data set, let image be the mean vector. For an object, o, in the data set, the Mahalanobis distance from o to image is

image (12.6)

where S is the covariance matrix.

image is a univariate variable, and thus Grubb’s test can be applied to this measure. Therefore, we can transform the multivariate outlier detection tasks as follows:

1. Calculate the mean vector from the multivariate data set.

2. For each object o, calculate image, the Mahalanobis distance from o to image.

3. Detect outliers in the transformed univariate data set, image.

4. If image is determined to be an outlier, then o is regarded as an outlier as well.

Our second example uses the χ2-statistic to measure the distance between an object to the mean of the input data set.

Example 12.10

Multivariate outlier detection using the χ2-statistic

The χ2-statistic can also be used to capture multivariate outliers under the assumption of normal distribution. For an object, o, the χ2-statistic is

image (12.7)

where oi is the value of o on the i th dimension, Ei is the mean of the i-dimension among all objects, and n is the dimensionality. If the χ2-statistic is large, the object is an outlier.

Using a Mixture of Parametric Distributions

If we assume that the data were generated by a normal distribution, this works well in many situations. However, this assumption may be overly simplified when the actual data distribution is complex. In such cases, we instead assume that the data were generated by a mixture of parametric distributions.

Example 12.11

Multivariate outlier detection using multiple parametric distributions

Consider the data set in Figure 12.4. There are two big clusters, C1 and C2. To assume that the data are generated by a normal distribution would not work well here. The estimated mean is located between the two clusters and not inside any cluster. The objects between the two clusters cannot be detected as outliers since they are close to the mean.

image

Figure 12.4 A complex data set.

To overcome this problem, we can instead assume that the normal data objects are generated by multiple normal distributions, two in this case. That is, we assume two normal distributions, image and image. For any object, o, in the data set, the probability that o is generated by the mixture of the two distributions is given by

image

where image and image are the probability density functions of image and image, respectively. We can use the expectation-maximization (EM) algorithm (Chapter 11) to learn the parameters image from the data, as we do in mixture models for clustering. Each cluster is represented by a learned normal distribution. An object, o, is detected as an outlier if it does not belong to any cluster, that is, the probability is very low that it was generated by the combination of the two distributions.

Example 12.12

Multivariate outlier detection using multiple clusters

Most of the data objects shown in Figure 12.4 are in either C1 or C2. Other objects, representing noise, are uniformly distributed in the data space. A small cluster, C3, is highly suspicious because it is not close to either of the two major clusters, C1 and C2. The objects in C3 should therefore be detected as outliers.

Note that identifying the objects in C3 as outliers is difficult, whether or not we assume that the given data follow a normal distribution or a mixture of multiple distributions. This is because the probability of the objects in C3 will be higher than some of the noise objects, like o in Figure 12.4, due to a higher local density in C3.

To tackle the problem demonstrated in Example 12.12, we can assume that the normal data objects are generated by a normal distribution, or a mixture of normal distributions, whereas the outliers are generated by another distribution. Heuristically, we can add constraints on the distribution that is generating outliers. For example, it is reasonable to assume that this distribution has a larger variance if the outliers are distributed in a larger area. Technically, we can assign image, where k is a user-specified parameter and σ is the standard deviation of the normal distribution generating the normal data. Again, the EM algorithm can be used to learn the parameters.

12.3.2 Nonparametric Methods

In nonparametric methods for outlier detection, the model of “normal data” is learned from the input data, rather than assuming one a priori. Nonparametric methods often make fewer assumptions about the data, and thus can be applicable in more scenarios.

Example 12.13

Outlier detection using a histogram

AllElectronics records the purchase amount for every customer transaction. Figure 12.5 uses a histogram (refer to Chapters 2 and 3) to graph these amounts as percentages, given all transactions. For example, 60% of the transaction amounts are between $0.00 and $1000.

image

Figure 12.5 Histogram of purchase amounts in transactions.

We can use the histogram as a nonparametric statistical model to capture outliers. For example, a transaction in the amount of $7500 can be regarded as an outlier because only image of transactions have an amount higher than $5000. On the other hand, a transaction amount of $385 can be treated as normal because it falls into the bin (or bucket) holding 60% of the transactions.

As illustrated in the previous example, the histogram is a frequently used nonparametric statistical model that can be used to detect outliers. The procedure involves the following two steps.

Step 1: Histogram construction. In this step, we construct a histogram using the input data (training data). The histogram may be univariate as in Example 12.13, or multivariate if the input data are multidimensional.
Note that although nonparametric methods do not assume any a priori statistical model, they often do require user-specified parameters to learn models from data. For example, to construct a good histogram, a user has to specify the type of histogram (e.g., equal width or equal depth) and other parameters (e.g., the number of bins in the histogram or the size of each bin). Unlike parametric methods, these parameters do not specify types of data distribution (e.g., Gaussian).

Step 2: Outlier detection. To determine whether an object, o, is an outlier, we can check it against the histogram. In the simplest approach, if the object falls in one of the histogram’s bins, the object is regarded as normal. Otherwise, it is considered an outlier.
For a more sophisticated approach, we can use the histogram to assign an outlier score to the object. In Example 12.13, we can let an object’s outlier score be the inverse of the volume of the bin in which the object falls. For example, the outlier score for a transaction amount of $7500 is image, and that for a transaction amount of $385 is image. The scores indicate that the transaction amount of $7500 is much more likely to be an outlier than that of $385.

A drawback to using histograms as a nonparametric model for outlier detection is that it is hard to choose an appropriate bin size. On the one hand, if the bin size is set too small, many normal objects may end up in empty or rare bins, and thus be misidentified as outliers. This leads to a high false positive rate and low precision. On the other hand, if the bin size is set too high, outlier objects may infiltrate into some frequent bins and thus be “disguised” as normal. This leads to a high false negative rate and low recall.

To overcome this problem, we can adopt kernel density estimation to estimate the probability density distribution of the data. We treat an observed object as an indicator of high probability density in the surrounding region. The probability density at a point depends on the distances from this point to the observed objects. We use a kernel function to model the influence of a sample point within its neighborhood. A kernel K() is a non-negative real-valued integrable function that satisfies the following two conditions:

■ image.

■ image for all values of u.

A frequently used kernel is a standard Gaussian function with mean 0 and variance 1:

image (12.8)

Let image be an independent and identically distributed sample of a random variable f. The kernel density approximation of the probability density function is

image (12.9)

where K() is a kernel and h is the bandwidth serving as a smoothing parameter.

Once the probability density function of a data set is approximated through kernel density estimation, we can use the estimated density function image to detect outliers. For an object, o, image gives the estimated probability that the object is generated by the stochastic process. If image is high, then the object is likely normal. Otherwise, o is likely an outlier. This step is often similar to the corresponding step in parametric methods.

In summary, statistical methods for outlier detection learn models from data to distinguish normal data objects from outliers. An advantage of using statistical methods is that the outlier detection may be statistically justifiable. Of course, this is true only if the statistical assumption made about the underlying data meets the constraints in reality.

The data distribution of high-dimensional data is often complicated and hard to fully understand. Consequently, statistical methods for outlier detection on high-dimensional data remain a big challenge. Outlier detection for high-dimensional data is further addressed in Section 12.8.

The computational cost of statistical methods depends on the models. When simple parametric models are used (e.g., a Gaussian), fitting the parameters typically takes linear time. When more sophisticated models are used (e.g., mixture models, where the EM algorithm is used in learning), approximating the best parameter values often takes several iterations. Each iteration, however, is typically linear with respect to the data set’s size. For kernel density estimation, the model learning cost can be up to quadratic. Once the model is learned, the outlier detection cost is often very small per object.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.104.77