Chapter 12

Other Classification Algorithms

The problem of classification can be approached in many different ways. The support vector machine essentially solves a geometric problem of separating points in feature space. Probability never entered into the design or training of SVM. Probabilities are only addressed when we evaluate the classifier by testing it on new, unseen objects.

Many of the early methods from statistical steganalysis are designed as hypothesis tests and follow the framework of sampling theory within statistics. A null hypothesis H0 is formed in such a way that we have a known probability distribution for image data assuming that H0 is true. For instance, ‘image I is a clean image’ would be a possible null hypothesis. This gives precise control of the false positive rate. The principle of sampling theory is that the experiment is defined before looking at the observations, and we control the risk of falsely rejecting H0 (false positives). The drawback of this approach is that this p-value is completely unrelated to the probability that H0 is true, and thus the test does not give us any quantitative information relating to the problem.

The other main school of statistics, along side sampling theory, is known as Bayesian statistics or Bayesian inference. In terms of classification, Bayesian inference will aim to determine the probability of a given image I belonging to a given class C a posteriori, that is after considering the observed image I.

It is debatable whether SVM and statistical methods are learning algorithms. Some authors restrict the concept of machine learning to iterative and often distributed methods, such as neural networks. Starting with a random model, these algorithms inspect one training object at a time, adjusting the model for every observation. Very often it will have to go through the training set multiple times for the model to converge. Such systems learn gradually over time, and in some cases they can continue to learn also after deployment. At least new training objects can be introduced without discarding previous learning. The SVM algorithm, in contrast, starts with no model and considers the entire training set at once to calculate a model. We did not find room to cover neural networks in this book, but interested readers can find accessible introductions in books such as, for instance, Negnevitsky (2005) or Luger (2008).

In this chapter, we will look at an assortment of classifier algorithms. Bayesian classifiers, developed directly from Bayesian statistics, fit well into the framework already considered, of supervised learning and classification. Regression analysis is also a fundamental statistical technique, but instead of predicting a discrete class label, it predicts a continuous dependent variable, such as the length of the embedded message. Finally, we will consider unsupervised learning, where the machine is trained without knowledge of the true class labels.

12.1 Bayesian Classifiers

We consider an observed image I, and two possible events CS that I be a steganogram and CC that I be a clean image. As before, we are able to observe some feature vector F(I) = (F1(I), … , FN(I)). The goal is to estimate the a posteriori probability images/c12_I0001.gif for C = CC, CS. This can be written using Bayes' law, as

12.1 12.1

The a priori probability images/c12_I0003.gif is the probability that I is a steganogram, as well as we can gauge it, before we have actually looked at I or calculated the features images/c12_I0004.gif. The a posteriori probability images/c12_I0005.gif, similarly, is the probability after we have looked at I. The final two elements, images/c12_I0006.gif and images/c12_I0007.gif, describe the probability distribution of the features over a single class and over all images respectively. Note that (12.1) is valid whether images/c12_I0008.gif is a PMF or a PDF.

The Bayesian classifier is as simple as estimating images/c12_I0009.gif using (12.1), and concluding that the image is a steganogram if images/c12_I0010.gif, or equivalently if

12.2 12.2

This, however, is not necessarily simple. The probability distribution of images/c12_I0012.gif must be estimated, and we will discuss some relevant methods below. The estimation of the a priori probability images/c12_I0013.gif is more controversial; it depends heavily on the application scenario and is likely to be subjective.

The explicit dependence on images/c12_I0014.gif is a good thing. It means that the classifier is tuned to the context. If we employ a steganalyser to monitor random images on the Internet, images/c12_I0015.gif must be expected to be very small, and very strong evidence from images/c12_I0016.gif is then required to conclude that the image is a steganogram. If the steganalyser is used where there is a particular reason to suspect a steganogram, images/c12_I0017.gif will be larger, and less evidence is required to conclude that I is more likely a steganogram than not.

On the contrary how do we know what images/c12_I0018.gif is for a random image? This will have to rely on intuition and guesswork. It is customary to assume a uniform distribution, images/c12_I0019.gif, if no information is available, but in steganalysis we will often find ourselves in the position where we know that images/c12_I0020.gif is far off, but we have no inkling as to whether images/c12_I0021.gif or images/c12_I0022.gif is closer to the mark.

12.1.1 Classification Regions and Errors

The classification problem ultimately boils down to identifying disjoint decision regions images/c12_I0023.gif, where we can assume that I is a steganogram for all images/c12_I0025.gif and a clean image for all images/c12_I0026.gif. It is possible to allow points images/c12_I0027.gif, which would mean that both class labels are equally likely for image images/c12_I0028.gif and we refuse to make a prediction. For simplicity we assume that images/c12_I0029.gif, so that a prediction will be made for any image images/c12_I0030.gif.

In Figure 12.1 we show an example plot of images/c12_I0031.gif and images/c12_I0032.gif for a one-dimensional feature vector. The decision regions images/c12_I0033.gif and images/c12_I0034.gif will be subsets of the images/c12_I0035.gif-axis. The Bayes decision rule would define images/c12_I0036.gif and images/c12_I0037.gif so that every point images/c12_I0038.gif in the feature space would be associated with the class corresponding to the higher curve. In the plot, that means images/c12_I0039.gif is the segment to the left of the vertical line and images/c12_I0040.gif is the segment to the right.

Figure 12.1 Feature probability distribution for different classes

12.1

The shaded area represents the classification error probability images/c12_I0041.gif, or Bayes' error, which can be defined as the integral

12.3 12.3

If we move the boundary between the decision regions to the right, we will get fewer false positives, but more false negatives. Similarly, if we move it to the left we get fewer false negatives at the expense of false positives. Either way, the total error probability will increase. One of the beauties of the Bayes classifier is its optimality, in the sense that it minimises the probability images/c12_I0043.gif of classification error. This is not too hard to prove, as we shall see below.


Theorem 12.1.1
The Bayes classification rule minimises the probability images/c12_I0044.gif of classification error.

 


Proof
The decision rule in (12.2) defines two decision regions images/c12_I0045.gif and images/c12_I0046.gif, so that images/c12_I0047.gif when (12.2) is satisfied and images/c12_I0048.gif otherwise. To minimise the images/c12_I0049.gif expression in (12.3), images/c12_I0050.gif and images/c12_I0051.gif must be chosen such that every point images/c12_I0052.gif appears in the integral with the smaller integrand, that is

images/c12_I0053.gif

which is exactly the Bayes decision rule.

12.1.2 Misclassification Risk

Risk captures both the probability of a bad event and the degree of badness. When different types of misclassification carry different costs, risk can be a useful concept to capture all error types in one measure. False positives and false negatives are two different bad events which often carry different costs. We are familiar with the error probabilities images/c12_I0054.gif and images/c12_I0055.gif, and we introduce the costs images/c12_I0056.gif and images/c12_I0057.gif.

The risk associated with an event is commonly defined as the product of the probability and cost. In particular, the risk of a false positive can be defined as

12.4 12.4

where images/c12_I0059.gif denotes the event that we intercept a clean image. Similarly, the risk of a false negative would be

12.5 12.5

The total risk of a classifier naturally is the sum of the risks associated with different events, i.e.

12.6 12.6

In this formulation, we have assumed for simplicity that there is zero cost (or benefit) associated with correct classifications. It is often useful to add costs or benefits images/c12_I0062.gif and images/c12_I0063.gif in the event of correct positive and negative classification, respectively. Often images/c12_I0064.gif and images/c12_I0065.gif will be negative.

It is important to note that the cost does not have to be a monetary value, and there are at least two reasons why it might not be. Sometimes non-monetary cost measures are used because it is too difficult to assign a monetary value. Other times one may need to avoid a monetary measure because it does not accurately model the utility or benefit perceived by the user. Utility is an economic concept used to measure subjective value to an agent, that is Wendy in our case. One important observation is that utility does not always scale linearly in money. Quite commonly, individuals will think that losing £1000 is more than a hundred times worse than losing £10. Such a sub-linear relationship makes the user risk averse, quite happy to accept an increased probability of losing a small amount if it also reduces the risk of major loss.

In order to use risk as an evaluation heuristic for classifiers, we need to estimate images/c12_I0066.gif and images/c12_I0067.gif so that they measure the badness experienced by Wendy in the event of misclassification. The misclassification risk can be written as

images/c12_I0068.gif

assuming zero cost for correct classification.

We can define an a posteriori risk as well:

images/c12_I0069.gif

It is straightforward to rework the Bayesian decision rule to minimise the a posteriori risk instead of the error probability.

12.1.3 The Naïve Bayes Classifier

The Bayes classifier depends on the probability distribution of feature vectors. For one-dimensional feature vectors, this is not much of a problem. Although density estimation can be generalised for arbitrary dimension in theory, it is rarely practical to do so. The problem is that accurate estimation in high dimensions requires a very large sample, and the requirement grows exponentially.

The naïve Bayes classifier is a way around this problem, by assuming that the features are statistically independent. In other words, we assume that

12.7 12.7

This allows us to simplify (12.1) to

12.8 12.8

Thus, we can estimate the probability distribution of each feature separately, using one-dimensional estimation techniques.

Of course, there is a reason that the classifier is called naïve Bayes. The independence assumption (12.7) will very often be false. However, the classifier has proved very robust to violations of the independence assumption, and in practice, it has proved very effective on many real-world data sets reported in the pattern recognition literature.

12.1.4 A Security Criterion

The Bayes classifier illustrates the fundamental criterion for classification to be possible. If the probability distribution of the feature vector is the same for both classes, that is, if

12.9 12.9

the Bayes decision rule will simply select the a priori most probable class, and predict the same class label regardless of the input. If there is no class skew, the output will be arbitrary.

The difference between the two conditional probabilities in (12.9) is a measure of security; the greater the difference, the more detectable is the steganographic embedding. A common measure of similarity of probability distributions is the relative entropy.


Definition 12.1.2
The relative entropy (aka. Kullback–Leibler divergence) between two (discrete) probability distributions images/c12_I0073.gif and images/c12_I0074.gif is defined as

images/c12_I0075.gif

Note that the definition is only valid if images/c12_I0076.gif for all images/c12_I0077.gif such that images/c12_I0078.gif.

We have images/c12_I0079.gif, with equality if and only if the two distributions are equal. The relative entropy is often thought of as a distance, but it is not a distance (or metric) in the mathematical sense. This can easily be seen by noting that images/c12_I0101.gif is not in general equal to images/c12_I0081.gif.

Cachin (1998) suggested using the relative entropy to quantify the security of a stego-system. Since security is an absolute property, it should not depend on any particular feature vector, and we have to consider the probability distribution of clean images and steganograms.


Definition 12.1.3 (Cachin, 1998)
Consider a stego-system images/c12_I0082.gif and let images/c12_I0083.gif be a cover and images/c12_I0084.gif a steganogram. Both images/c12_I0085.gif and images/c12_I0086.gif are considered as stochastic variables drawn from some set of possible images. We say that images/c12_I0087.gif is images/c12_I0088.gif-secure against a passive warden if

images/c12_I0089.gif

If images/c12_I0090.gif, the stego-system is perfectly secure.

 


Theorem 12.1.4
If a stego-system is images/c12_I0091.gif-secure, any steganalytic attack with false positive probability images/c12_I0092.gif and false negative probability images/c12_I0093.gif will satisfy

12.10 12.10


 


Proof
The steganalysis attack can be considered as a mapping of one random variable, namely the image images/c12_I0095.gif, into a Boolean stochastic variable which is either ‘stego’ or ‘clean’. A standard result in information theory says that such deterministic processing cannot increase the relative entropy. Hence

images/c12_I0096.gif

where images/c12_I0097.gif and images/c12_I0098.gif are the probability distributions of the predicted class label for clean images and steganograms respectively.
The right-hand side of (12.10) is just the binary relative entropy, sometimes denoted images/c12_I0099.gif, where images/c12_I0100.gif and images/c12_I0101.gif are binary stochastic variables with probability distributions images/c12_I0102.gif and images/c12_I0103.gif respectively.

Cachin's security criterion high lights a core point in steganography and steganalysis. Detectability is directly related to the difference between the probability distributions of clean images and steganograms. This difference can be measured in different ways; as an alternative to the Kullback–Leibler divergence, Filler and Fridrich (2009a,b) used Fisher information. In theory, Alice's problem can be described as the design of a cipher or stego-system which mimics the distribution of clean images, and Wendy's problem is elementary hypothesis testing.

In practice, the information theoretic research has hardly led to better steganalysis nor steganography. The problem is that we are currently not able to model the probability distributions of images. If Alice is not able to model images/c12_I0104.gif, she cannot design an images/c12_I0105.gif-secure stego-system. In contrast, if Wendy cannot model images/c12_I0106.gif either, Alice probably does not need a theoretically images/c12_I0107.gif-secure stego-system.

12.2 Estimating Probability Distributions

There are at least three major approaches to estimating probability density functions based on empirical data. Very often, we can make reasonable assumptions about the shape of the PDF, and we only need to estimate parameters. For instance, we might assume the distribution is normal, in which case we will only have to estimate the mean images/c12_I0108.gif and variance images/c12_I0109.gif in order to get an estimated probability distribution. This is easily done using the sample mean images/c12_I0110.gif and sample variance images/c12_I0111.gif.

If we do not have any assumed model for the PDF, there are two approaches to estimating the distributions. The simplest approach is to discretise the distribution by dividing images/c12_I0112.gif into a finite number of disjoint bins images/c12_I0113.gif and use the histogram. The alternative is to use a non-parametric, continuous model. The two most well-known non-parametric methods are kernel density estimators, which we will discuss below, and images/c12_I0114.gif nearest neighbour.

12.2.1 The Histogram

Let images/c12_I0115.gif be a stochastic variable drawn from some probability distribution images/c12_I0116.gif on some continuous space images/c12_I0117.gif. We can discretise the distribution by defining a discrete random variable images/c12_I0118.gif as

images/c12_I0119.gif

The probability mass function of images/c12_I0120.gif is easily estimated by the histogram images/c12_I0121.gif of the observations of images/c12_I0122.gif.

We can view the histogram as a density estimate, by assuming a PDF images/c12_I0123.gif which is constant within each bin:

images/c12_I0124.gif

where images/c12_I0125.gif is the number of observations and images/c12_I0126.gif is the width (or volume) of the bin images/c12_I0127.gif. If the sample size images/c12_I0128.gif is sufficiently large, and the bin widths are wisely chosen, then images/c12_I0129.gif will indeed be a good approximation of the true PDF of images/c12_I0130.gif. The challenge is to select a good bin width. If the bins are too wide, any high-frequency variation in images/c12_I0131.gif will be lost. If the bins are too narrow, each bin will have very few observations, and images/c12_I0132.gif will be dominated by noise caused by the randomness of observations. In the extreme case, most bins will have zero or one element, with many empty bins. The result is that images/c12_I0133.gif is mainly flat, with a spike wherever there happens to be an observation.

There is no gospel to give us the optimal bin size, and it very often boils down to trial and error with some subjective evaluation of the alternatives. However, some popular rules of thumb exist. Most popular is probably Sturges' rule, which recommends to use images/c12_I0134.gif bins, where images/c12_I0135.gif is the sample size. Interested readers can see Scott (2009) for an evaluation of this and other bin size rules.

12.2.2 The Kernel Density Estimator

Finally, we shall look at one non-parametric method, namely the popular kernel density estimator (KDE). The KDE is sometimes called Parzen window estimation, and the modern formulation is very close to the one proposed by Parzen (1962). The principles and much of the theory predates Parzen; see for instance Rosenblatt (1956), who is sometimes co-credited for the invention of KDE.

The fundamental idea in KDE is that every observation images/c12_I0136.gif is evidence of some density in a certain neighbourhood images/c12_I0137.gif around images/c12_I0138.gif. Thus, images/c12_I0139.gif should make a positive contribution to images/c12_I0140.gif for all images/c12_I0141.gif and a zero contribution for images/c12_I0142.gif. The KDE adds together the contributions of all the observations images/c12_I0143.gif.

The contribution made by each observation images/c12_I0144.gif can be modelled as a kernel function images/c12_I0145.gif, where

images/c12_I0146.gif

Summing over all the observations, the KDE estimate of the PDF is defined as

12.11images/c12_I0147.gif

To be a valid PDF, images/c12_I0148.gif has to integrate to 1, and to ensure this, we require that the kernel also integrates to one, that is

images/c12_I0149.gif

It is also customary to require that the kernel is symmetric around images/c12_I0150.gif, that is images/c12_I0151.gif. It is easy to verify that if images/c12_I0152.gif is non-negative and integrates to 1, then so does images/c12_I0153.gif as defined in (12.11). Hence, images/c12_I0154.gif is a valid probability density function. Note that the kernel functions used here are not necessarily compatible with the kernels used for the kernel trick in Chapter 11, although they may sometimes be based on the same elementary functions, such as the Gaussian.

The simplest kernel is a hypercube, also known as a uniform kernel, defined as

images/c12_I0155.gif

where images/c12_I0156.gif is the dimensionality of images/c12_I0157.gif. The uniform kernel, which assigns a constant contribution to the PDF within the neighbourhood, will make a PDF looking like a staircase. To make a smoother PDF estimate, one would use a kernel function which is diminishing in the distance from the origin. Alongside the uniform kernel, the Gaussian kernel is the most commonly used. It is given by

images/c12_I0158.gif

and will give a much smoother PDF estimate.

The kernel density estimate is sensitive to scaling. This is easy to illustrate with the uniform kernel. If the sample space is scaled so that the variance of the observations images/c12_I0159.gif increases, then fewer observations will fall within the neighbourhood defined by the kernel. We would get the same effect as with a histogram with a very small bin width. Most bins or neighbourhoods are empty, and we get a short spike in images/c12_I0160.gif wherever there is an observation. In the opposite extreme, where the observations have very low variance, the PDF estimate would have absolutely no high-frequency information. In order to get a realistic estimate we thus need to scale either the kernel or the data in a way similar to tuning the bin width of a histogram.

The KDE analogue of the bin width is known as the bandwidth. The scaling can be incorporated into the kernel function, by introducing the so-called scaled kernel

images/c12_I0161.gif

where the bandwidth images/c12_I0162.gif scales the argument to images/c12_I0163.gif. It is easily checked that if images/c12_I0164.gif integrates to 1, then so does images/c12_I0165.gif, thus images/c12_I0166.gif can replace images/c12_I0167.gif in the KDE definition (12.11). In the previous discussion, we related the scaling to the variance. It follows that the bandwidth should be proportional to the standard deviation so that the argument images/c12_I0168.gif to the kernel function has fixed variance.


Code Example 12.1 : Code example to plot the KDE estimate with Scott's bandwidth as shown in Figure 12.2
from scipy.stats.kde import gaussian_kde as gkde
import numpy as np
import matplotlib.pyplot as plt
S = np.random.randn( int(opt.size) )
X = [ i*0.1 for i in xrange(−40,40) ]
kde = gkde(S)   # Instantiate estimator
Y = kde(X)       # Evaluate estimated PDF
plt.plot( X, Y, "k−." )

The effect of changing bandwidth is illustrated in Figure 12.2. Scott's rule dictates a bandwidth of

images/c12_I0169.gif

where images/c12_I0170.gif is the sample standard deviation. We can see that this gives a rather good estimate in the figure. If the bandwidth is reduced to a quarter, we see significant noise, and if it is four times wider, then the estimated PDF becomes far too flat. Code Example 12.1 shows how a KDE can be computed using the standard scipy library in Python. Only the Gaussian kernel is supported, and unfortunately, the bandwidth can only be changed by modifying the source code.

Figure 12.2 Examples of Gaussian KDE estimates from 100 pseudo-random points from a standard normal distribution

12.2

In the multivariate case images/c12_I0171.gif, there is no reason to expect the same bandwidth to be optimal in all dimensions. To scale the different axes separately, we use a bandwidth matrix images/c12_I0172.gif instead of the scalar bandwidth images/c12_I0173.gif, and the scaled kernel can be written as

12.12 12.11

where images/c12_I0175.gif denotes the determinant of images/c12_I0176.gif. For instance, the scaled Gaussian kernel can be written as

images/c12_I0177.gif

In the scalar case, we argued that the bandwidth should be proportional to the standard deviation. The natural multivariate extension is to choose images/c12_I0178.gif proportional to the sample covariance matrix images/c12_I0179.gif, i.e.

images/c12_I0180.gif

for some scaling factor images/c12_I0181.gif. When we use the Gaussian kernel, it turns out that it suffices to calculate images/c12_I0182.gif; we do not need images/c12_I0183.gif directly. Because images/c12_I0184.gif is symmetric, we can rewrite the scaled Gaussian kernel as

images/c12_I0185.gif

Two common choices for the scalar factor images/c12_I0186.gif are

images/c12_I0187.gif

As we can see, the multivariate KDE is straightforward to calculate for arbitrary dimension images/c12_I0188.gif in principle. In practice, however, one can rarely get good PDF estimates in high dimension, because the sample set becomes too sparse. This phenomenon is known as the curse of dimensionality, and affects all areas of statistical modelling and machine learning. We will discuss it further in Chapter 13.


Code Example 12.2 : Implementation of KDE. An object is instantiated with a data sample (training set) and optionally a bandwidth matrix. The default bandwidth matrix uses the covariance matrix and Scott's factor. The evaluate() method is used to evaluate the estimated PDF at arbitrary points
from scipy import linalg
import numpy as np
class KDE(object):
    def _ _init_ _(self, dataset, bandwidth=None ):
         self.dataset = np.atleast_2d(dataset)
         self.d, self.n = self.dataset.shape
         self.setBandwidth( bandwidth )
    def evaluate(self, points):
         d, m = points.shape
         result = np.zeros((m,), points.dtype)
         for i in range(m):
             D = self.dataset − points[:,i,np.newaxis]
             T = np.dot( self.inv_bandwidth, D )
             E = np.sum( D*T, axis=0 ) / 2.0
             result[i] = np.sum( np.exp(−E), axis=0 )
         norm = np.sqrt(linalg.det(2*np.pi*self.bandwidth)) * self.n
         return result / norm
   def setBandwidth(self,bandwidth=None):
        if bandwidth != None: self.bandwidth = bandwidth
        else:
             factor = np.power(self.n, −1.0/(self.d+4)) # Scotts' factor
             covariance = np.atleast_2d(
                        np.cov(self.dataset, rowvar=1, bias=False) )
             self.bandwidth = covariance * factor * factor
        self.inv_bandwidth = linalg.inv(self.bandwidth)

An implementation of the KDE is shown in Code Example 12.2. Unlike the standard implementation in scipy.stats, this version allows us to specify a custom bandwidth matrix, a feature which will be useful in later chapters. The evaluation considers one evaluation point images/c12_I0189.gif at a time. Matrix algebra allows us to process every term of the sum in (12.11) with a single matrix operation. Thus, D and T have one column per term. The vector E, with one element per term, contains the exponents for the Gaussian and norm is the scalar factor images/c12_I0190.gif.

12.3 Multivariate Regression Analysis

Regression denotes the problem of predicting one stochastic variable images/c12_I0191.gif, called the response variable, by means of observations of some other stochastic variable images/c12_I0192.gif, called the predictor variable or independent variable. If images/c12_I0193.gif is a vector variable, for instance a feature vector, we talk about multivariate regression analysis. Where classification aims to predict a class label from some discrete set, regression predicts a continuous variable images/c12_I0194.gif.

Let images/c12_I0195.gif and images/c12_I0196.gif. The underlying assumption for regression is that images/c12_I0197.gif is approximately determined by images/c12_I0198.gif, that is, there is some function images/c12_I0199.gif so that

12.13 12.12

where images/c12_I0201.gif is some small, random variable. A regression method is designed to calculate a function images/c12_I0202.gif which estimates images/c12_I0203.gif. Thus, images/c12_I0204.gif should be an estimator for images/c12_I0205.gif, given an observation images/c12_I0206.gif of images/c12_I0207.gif.

Multivariate regression can apply to steganalysis in different ways. Let the predictor variable images/c12_I0208.gif be a feature vector from some object which may be a steganogram. The response variable images/c12_I0209.gif is some unobservable property of the object. If we let images/c12_I0210.gif be the length of the embedded message, regression can be used for quantitative steganalysis. Pevný et al. (2009b) tested this approach using support vector regression. Regression can also be used for binary classification, by letting images/c12_I0211.gif be a numeric class label, such as images/c12_I0212.gif. Avcibascedil et al. (2003) used this technique for basic steganalysis.

12.3.1 Linear Regression

The simplest regression methods use a linear regression function images/c12_I0213.gif, i.e.

12.14 12.13

where we have fixed images/c12_I0215.gif for the sake of brevity. Obviously, we need to choose the images/c12_I0216.gifs to get the best possible prediction, but what do we mean by best possible?

We have already seen linear regression in action, as it was used to calculate the prediction error image for Farid's features in Section 7.3.2. The assumption was that each wavelet coefficient images/c12_I0217.gif can be predicted using a function images/c12_I0218.gif of coefficients images/c12_I0219.gif, including adjacent coefficients, corresponding coefficients from adjacent wavelet components and corresponding coefficients from the other colour channels. Using the regression formula images/c12_I0220.gif, images/c12_I0221.gif denotes the image being tested, images/c12_I0222.gif is the prediction image and images/c12_I0223.gif the prediction error image.

The most common criterion for optimising the regression function is the method of least squares. It aims to minimise the squared error images/c12_I0224.gif in (12.13). The calculation of the regression function can be described as a training phase. Although the term is not common in statistics, the principle is the same as in machine learning. The input to the regression algorithm is a sample of observations images/c12_I0225.gif of the response variable images/c12_I0226.gif, and corresponding observations images/c12_I0227.gif of the predictor variable. The assumption is that images/c12_I0228.gif depends on images/c12_I0229.gif.

The linear predictor can be written in matrix form as

images/c12_I0230.gif

where the observation images/c12_I0231.gif of the deterministic dummy variable images/c12_I0232.gif is included in images/c12_I0233.gif. Let images/c12_I0234.gif be a column vector containing all the observed response variables, and let images/c12_I0235.gif be an images/c12_I0236.gif matrix containing the observed feature vectors as rows. The prediction errors on the training set can be calculated simultaneously using a matrix equation:

images/c12_I0237.gif

Thus the least-squares regression problem boils down to solving the optimisation problem

images/c12_I0238.gif

where

images/c12_I0239.gif

The minimum is the point where the derivative vanishes. Thus we need to solve

images/c12_I0240.gif

Hence, we get the solution given as

images/c12_I0241.gif

and the regression function (12.14) reads

images/c12_I0242.gif

This is the same formula we used for the linear predictor of Farid's features in Chapter 7.

12.3.2 Support Vector Regression

There is also a regression algorithm based on support vector machines. Schölkopf et al. (2000) even argue that support vector regression (SVR) is simpler than the classification algorithm, and that it is the natural first stage.

Note that the usual SVM algorithm results in a function images/c12_I0243.gif, which may look like a regression function. The only problem with the standard SVM is that it is not optimised for regression, only for classification. When we formulate support vector regression, the function will have the same structure:

12.15 12.14

but the images/c12_I0245.gif are tuned differently for classification and for regression.

A crucial property of SVM is that a relatively small number of support vectors are needed to describe the boundary hyperplanes. In order to carry this sparsity property over to regression, Vapnik (2000) introduced the images/c12_I0246.gif-insensitive loss function:

12.16 12.15

In other words, prediction errors images/c12_I0248.gif less than images/c12_I0249.gif are accepted and ignored. The threshold images/c12_I0250.gif is a free variable which can be optimised.

This leads to a new optimisation problem, which is similar to those we saw in Chapter 11:

images/c12_I0251.gif

The constants images/c12_I0252.gif and images/c12_I0253.gif are design parameters which are used to control the trade-off between minimising the slack variables images/c12_I0254.gif and images/c12_I0255.gif, the model complexity images/c12_I0256.gif and the sensitivity threshold images/c12_I0257.gif. The constraints simply say that the regression error images/c12_I0258.gif should be bounded by images/c12_I0259.gif plus some slack, and that the threshold and slack variables should be non-negative.

Using the standard techniques as with other SVM methods, we end up with a dual and simplified problem as follows:

images/c12_I0260.gif

subject to

images/c12_I0261.gif

The solutions images/c12_I0262.gif and images/c12_I0263.gif are used to form the regression function

images/c12_I0264.gif

The most significant difference between this regression function and that from least-squares linear regression is of course that the kernel trick makes it non-linear. Hence much more complex dependencies can be modelled. There is support for SVM regression in libSVM.

12.4 Unsupervised Learning

The classifiers we have discussed so far employ supervised learning. That is, the input to the training algorithm is feature vectors with class labels, and the model is fitted to the known classes. In unsupervised learning, no labels are available to the training algorithm. When we use unsupervised learning for classification, the system will have to identify its own classes, or clusters, of objects which are similar. This is known as cluster analysis or clustering. Similarity often refers to Euclidean distance in feature space, but other similarity criteria could be used. This may be useful when no classification is known, and we want to see if there are interesting patterns to be recognised.

In general, supervised learning can be defined as learning in the presence of some error or reward function. The training algorithm receives feedback based on what the correct solution should be, and it adjusts the learning accordingly. In unsupervised learning, there is no such feedback, and the learning is a function of the data alone.

The practical difference between a ‘cluster’ as defined in cluster analysis and a ‘class’ as considered in classification is that a class is strictly defined whereas a cluster is more loosely defined. For instance, an image is either clean or a steganogram, and even though we cannot necessarily tell the difference by inspecting the image, the class is strictly determined by how the image was created. Clusters are defined only through the appearance of the object, and very often there will be no objective criteria by which to determine cluster membership. Otherwise clustering can be thought of, and is often used, as a method of classification.

We will only give one example of a statistical clustering algorithm. Readers who are interested in more hard-core learning algorithms, should look into the self-organising maps of Kohonen (1990).

12.4.1 images/c12_I0265.gif-means Clustering

The images/c12_I0266.gif-means clustering algorithm is a standard textbook example of clustering, and it is fairly easy to understand and implement. The symbol images/c12_I0267.gif just denotes the number of clusters. A sample images/c12_I0268.gif of images/c12_I0269.gif objects, or feature vectors, images/c12_I0270.gif is partitioned into images/c12_I0271.gif clusters images/c12_I0272.gif.

Each cluster images/c12_I0273.gif is identified by its mean images/c12_I0274.gif, and each object images/c12_I0275.gif is assigned to the cluster corresponding to the closest mean, that is to minimise images/c12_I0276.gif. The solution is found iteratively. When the mean is calculated, the cluster assignments have to change to keep the distances minimised, and when the cluster assignment changes, the means will suddenly be wrong and require updating. Thus we alternate between updating the means and the cluster assignments until the solution converges. To summarise, we get the following algorithm:

1. Choose images/c12_I0277.gif random points images/c12_I0278.gif for images/c12_I0279.gif.
2. Form images/c12_I0280.gif disjoint sets images/c12_I0281.gif, so that images/c12_I0282.gif if images/c12_I0283.gif solves images/c12_I0284.gif.
3. Update images/c12_I0285.gif to be the sample mean of images/c12_I0286.gif, i.e.

images/c12_I0287.gif

4. Repeat from Step 2 until the solution converges.

It can be shown that the solution always converges.


Code Example 12.3 : images/c12_I0308.gif-means clustering in Python; the input X is an images/c12_I0309.gif array with images/c12_I0310.gif feature vectors as rows. The output is a pair images/c12_I0311.gif, where images/c12_I0312.gif is an integer array with element images/c12_I0313.gif the index of the cluster containing images/c12_I0314.gif and images/c12_I0315.gif is a images/c12_I0316.gif array where the images/c12_I0317.gifth row is the mean of cluster images/c12_I0318.gif
def kmeans(X,K=2,eps=0.1):
    (N,m) = X.shape
   M = np.random.randn(N,m)
   while True:
      L = [ np.sum( (X − M[[i],:])**2, axis=1 ) for i in xrange(K) ]
      A = np.vstack( L )
      C = np.argmin( A, axis=0 )
      L = [ np.mean( X[(C==i),:], axis=0 ) for i in xrange(K) ]
      M1 = np.vstack(M1)
      if np.sum((M−M1)**2) < eps: return (C,M1)
      else: M = M1

Code Example 12.3 shows how this can be done in Python. The example is designed to take advantage of efficient matrix algebra in numpy, at the expense of readability. The rows of the input images/c12_I0288.gif are the feature vectors. Each row of images/c12_I0289.gif is the mean of one cluster, and is initialised randomly. The matrix images/c12_I0290.gif is images/c12_I0291.gif, where images/c12_I0292.gif. Then images/c12_I0293.gif is calculated so that the images/c12_I0294.gifth element is the index of the cluster containing images/c12_I0295.gif, which corresponds to the row index of the maximal element in each column of images/c12_I0296.gif. Finally, images/c12_I0297.gif is calculated as the new sample mean of each cluster, to replace images/c12_I0298.gif. If the squared Euclidean distance between images/c12_I0299.gif and images/c12_I0300.gif is smaller than a preset threshold we return, otherwise we reiterate.

Criticism and Drawbacks

The images/c12_I0301.gif-means algorithm is very intuitive and relatively fast. One may say it is too simplistic in that each cluster is defined only by its sample mean, without any notion of shape, density or volume. This can have some odd effects when the cluster shapes are far from spherical. Figure 12.3, for example, shows two elongated clusters, and human observers would probably focus on the gap between the upper and lower clusters. The images/c12_I0302.gif-means algorithm, implicitly looking for spherical clusters, divides it into a lefts-hand and a right-hand cluster each containing about half each of the elongated clusters. Similar problems can occur if the clusters have very different sizes.

Figure 12.3 Elongated clusters with images/c12_I0305.gif-means. Two sets, indicated by squares and circles, were generated using a Gaussian distribution with different variances and means. The colours, black and white, indicate the clustering found by images/c12_I0306.gif-means for images/c12_I0307.gif

12.3

Another question concerns the distance function, which is somewhat arbitrary. There is no objective reason to choose any one distance function over another. Euclidean distance is common, because it is the most well-known distance measure. Other distance functions may give very different solutions. For example, we can apply the kernel trick, and calculate the distance as

12.17 12.16

for some kernel function images/c12_I0304.gif.

12.5 Summary

The fundamental classification problem on which we focused until the start of this chapter forms part of a wider discipline of pattern recognition. The problem has been studied in statistics since long before computerised learning became possible. Statistical methods, such as Bayesian classifiers in this chapter and Fisher linear discriminants in the previous one, tend to be faster than more modern algorithms like SVM.

We have seen examples of the wider area of pattern recognition, in unsupervised learning and in regression. Regression is known from applications in quantitative steganalysis, and unsupervised learning was used as an auxiliary function as part of the one-class steganalyser with multiple hyperspheres of Lyu and Farid (2004) (see Section 11.5.3). Still, we have just scratched the surface.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.139.168