Chapter 13

Feature Selection and Evaluation

Selecting the right features for classification is a major task in all areas of pattern matching and machine learning. This is a very difficult problem. In practice, adding a new feature to an existing feature vector may increase or decrease performance depending on the features already present. The search for the perfect vector is an NP-complete problem. In this chapter, we will discuss some common techniques that can be adopted with relative ease.

13.1 Overfitting and Underfitting

In order to get optimal classification accuracy, the model must have just the right level of complexity. Model complexity is determined by many factors, one of which is the dimensionality of the feature space. The more features we use, the more degrees of freedom we have to fit the model, and the more complex it becomes.

To understand what happens when a model is too complex or too simple, it is useful to study both the training error rate images/c13_I0001.gif and the testing error rate images/c13_I0002.gif. We are familiar with the testing error rate from Chapter 10, where we defined the accuracy as images/c13_I0003.gif, while the training error rate images/c13_I0004.gif is obtained by testing the classifier on the training set. Obviously, it is only the testing error rate which gives us any information about the performance on unknown data.

If we start with a simple model with few features and then extend it by adding new features, we will typically see that images/c13_I0005.gif decreases. The more features, the better we can fit the model to the training data, and given enough features and thus degrees of freedom, we can fit it perfectly to get images/c13_I0006.gif. This is illustrated in Figure 13.1, where we see the monotonically decreasing training error rate.

Figure 13.1 Training error rate and testing error rate for varying dimensionality (for illustration)

13.1

Now consider images/c13_I0007.gif in the same way. In the beginning we would tend to have images/c13_I0008.gif, where both decrease with increasing complexity. This situation is known as underfitting. The model is too simple to fit the data very well, and therefore we get poor classification on the training and test set alike. After a certain point, images/c13_I0009.gif will tend to rise while images/c13_I0010.gif continues to decline. It is natural that images/c13_I0011.gif declines as long as degrees of freedom are added. When images/c13_I0012.gif starts to increase, so that images/c13_I0013.gif, it means that the model no longer generalises for unknown data. This situation is known as overfitting, where the model has become too complex. Overfitting can be seen as support for Occam's Razor, the principle that, other things being equal, the simpler theory is to be preferred over the more complex one. An overfitted solution is inferior because of its complexity.

It may be counter-intuitive that additional features can sometimes give a decrease in performance. After all, more features means more information, and thus, in theory, the classifier should be able to make a better decision. This phenomenon is known as the curse of dimensionality. High-dimensionality feature vectors mean that the classifier gets many degrees of freedom and the model can be fitted perfectly to the training data. Thus, the model will not only capture statistical properties of the different classes, but also random noise elements which are peculiar for each individual object. Such a model generalises very poorly for unseen objects, leading to the discrepancy images/c13_I0014.gif. Another way to view the curse of dimensionality is to observe that the distance between points in space increases as the dimension of the space increases. Thus the observations we make become more scattered and it seems reasonable that less can be learnt from them.

There are other causes of overfitting, besides the curse of dimensionality. In the case of SVM, for instance, tuning the parameter images/c13_I0015.gif (or images/c13_I0016.gif) is critical, as it controls the weighting of the two goals of minimising the (training) errors and maximising the margin. If images/c13_I0017.gif is too large, the algorithm will prefer a very narrow margin to avoid errors, and a few outliers which are atypical for the distribution can have an undue impact on the classifier. This would push images/c13_I0018.gif up and images/c13_I0019.gif down, indicating another example of overfitting. Contrarily, if images/c13_I0020.gif is too small, the importance of errors is discounted, and the classifier will accept a fairly high images/c13_I0021.gif giving evidence of underfitting. Thus, the problems with under- and overfitting make up one of the motivations behind the grid search and cross-validation that we introduced in Section 3.3.2.

13.1.1 Feature Selection and Feature Extraction

In order to avoid the curse of dimensionality, it is often necessary to reduce the dimension of the feature space. There are essentially two approaches to this, namely feature selection and feature extraction.

Feature extraction is the most general approach, aiming to map feature vectors in images/c13_I0022.gif into some smaller space images/c13_I0023.gif (images/c13_I0024.gif). Very often, almost all of the sample feature vectors are approximately contained in some images/c13_I0025.gif-dimensional subspace of images/c13_I0026.gif. If this is the case, we can project all the feature vectors into images/c13_I0027.gif without losing any structure, except for the odd outlier. Even though the mapping images/c13_I0028.gif may in general sacrifice information, when the sample feature vectors really span a space of dimension greater than images/c13_I0029.gif, the resulting feature vectors in images/c13_I0030.gif may still give superior classification by avoiding overfitting. There are many available techniques for feature extraction, with principal component analysis (PCA) being one of the most common examples. In steganalysis, Xuan et al. (2005b) used a feature extraction method of Guorong et al. (1996), based on the Bhattacharyya distance. The details are beyond the scope of this book, and we will focus on feature selection in the sequel.

Feature selection is a special and limited case of feature extraction. Writing a feature vector as images/c13_I0031.gif, we are only considering maps images/c13_I0032.gif which can be written as

images/c13_I0033.gif

for some subset images/c13_I0034.gif. In other words, we select individual features from the feature vector images/c13_I0035.gif, ignoring the others completely.

In addition to reducing the dimensionality to avoid overfitting, feature selection is also a good help to interpret the problem. A classifier taking a 1000-D feature vector to get a negligible error rate may be great to solve the steganalyser's practical problem. However, it is almost impossible to analyse, and although the machine can learn a lot, the user learns rather little.

If we can identify a low-dimensional feature vector with decent classification performance, we can study each feature and find out how and why it is affected by the embedding. Thus a manageable number of critical artifacts can be identified, and the design of new steganographic systems can focus on eliminating these artifacts. Using feature selection in this context, we may very well look for a feature vector giving inferior classification accuracy. The goal is to find one which is small enough to be manageable and yet explains most of the detectability. Because each feature is either used or ignored, feature selection gives easy-to-use information about the artifacts detected by the classifier.

13.2 Scalar Feature Selection

Scalar feature selection aims to evaluate individual, scalar features independently. Although such an approach disregards dependencies between the different features in a feature vector, it still has its uses. An important benefit is that the process scales linearly in the number of candidate features. There are many heuristics and approaches which can be used. We will only have room for ANOVA, which was used to design one of the pioneering feature vectors in steganalysis.

13.2.1 Analysis of Variance

Analysis of variance (ANOVA) is a technique for hypothesis testing, aiming to determine if a series of sub-populations have the same mean. In other words, if a population is made up of a number of disjoint classes images/c13_I0036.gif and we observe a statistic images/c13_I0037.gif, we want to test the null hypothesis:

images/c13_I0038.gif

where images/c13_I0039.gif is the mean of images/c13_I0040.gif when drawn from images/c13_I0041.gif. A basic requirement for any feature in machine learning is that it differs between the different classes images/c13_I0042.gif. Comparing the means of each class is a reasonable criterion.

Avcibascedil et al. (2003) started with a large set of features which they had studied in a different context and tested them on a steganalytic problem. They selected the ten features with the lowest images/c13_I0043.gif-value in the hypothesis test to form their IQM feature vector (see Section 6.2). It is true that a feature may be discarded because it has the same mean for all classes and still have some discriminatory power. However, the ANOVA test does guarantee that those features which are selected will have discriminatory power.

There are many different ANOVA designs, and entire books are devoted to the topic. We will give a short presentation of one ANOVA test, following Johnson and Wichern (1988), but using the terminology of steganalysis and classification. We have a feature images/c13_I0044.gif representing some property of an object images/c13_I0045.gif which is drawn randomly from a population consisting of a number of disjoint classes images/c13_I0046.gif, for images/c13_I0047.gif. Thus images/c13_I0048.gif is a stochastic variable. We have a number of observations images/c13_I0049.gif, where images/c13_I0050.gif and images/c13_I0051.gif is drawn from images/c13_I0052.gif.

There are three assumptions underlying ANOVA:

1. The observations images/c13_I0053.gif are independent.
2. All the classes have the same variance images/c13_I0054.gif.
3. Each class is normally distributed.

The first assumption should be no surprise. We depend on this whenever we test a classifier to estimate error probabilities too, and it amounts to selecting media objects randomly and independently. The third assumption, on the distribution, can be relaxed if the sample is sufficiently large, due to the Central Limit Theorem.

The second assumption, that the variances are equal, gives cause for concern. A priori, there is no reason to think that steganographic embedding should not have an effect on the variance instead of or in addition to the mean. Hence, one will need to validate this assumption if the quantitative conclusions made from the ANOVA test are important.

To understand the procedure, it is useful to decompose the statistic or feature into different effects. Each observation can be written as a sum

images/c13_I0055.gif

where images/c13_I0056.gif is the mean which is common for the entire population and images/c13_I0057.gif is a so-called class effect which is the difference between the class mean and the population mean, i.e. images/c13_I0058.gif. The last term can be called a random error images/c13_I0059.gif, that is the effect due to the individuality of each observation.

With the notation of the decomposition, the null hypothesis may be rephrased as

images/c13_I0060.gif

In practice the means are unknown, but we can make an analogous decomposition using sample means as follows:

13.1 13.1

for images/c13_I0062.gif and images/c13_I0063.gif. This can be written as a vector equation,

13.2 13.2

where images/c13_I0065.gif is an all-one vector, and

images/c13_I0066.gif

We are interested in a decomposition of the sample variance images/c13_I0067.gif, where squaring denotes an inner product of a vector and itself. We will show that

13.3 13.3

This equation follows from the fact that the three vectors on the right-hand side of (13.2) are orthogonal. In other words, images/c13_I0069.gif, images/c13_I0070.gif and images/c13_I0071.gif regardless of the actual observations. To demonstrate this, we will give a detailed derivation of (13.3).

Subtracting images/c13_I0072.gif on both sides of (13.1) and squaring gives

images/c13_I0073.gif

and summing over images/c13_I0074.gif, while noting that images/c13_I0075.gif, gives

images/c13_I0076.gif

Next, summing over images/c13_I0077.gif, we get

images/c13_I0078.gif

which is equivalent to (13.3).

The first term, images/c13_I0079.gif, is the component of variance which is explained by the class. The other component, images/c13_I0080.gif, is unexplained variance. If images/c13_I0081.gif is true, the unexplained variance images/c13_I0082.gif should be large compared to the explained variance images/c13_I0083.gif. A standard images/c13_I0084.gif-test can be used to compare the two.

It can be shown that images/c13_I0085.gif has images/c13_I0086.gif degrees of freedom, while images/c13_I0087.gif has images/c13_I0088.gif degrees of freedom, where images/c13_I0089.gif. Thus the images/c13_I0090.gif statistic is given as

images/c13_I0091.gif

which is images/c13_I0092.gif distributed with images/c13_I0093.gif and images/c13_I0094.gif degrees of freedom. The images/c13_I0095.gif-value is the probability that a random images/c13_I0096.gif-distributed variable is greater than or equal to the observed value images/c13_I0097.gif as defined above. This probability can be found in a table or by using statistics software. We get the same ordering whether we rank features by images/c13_I0098.gif-value or by images/c13_I0099.gif-value.

13.3 Feature Subset Selection

Given a set of possible features images/c13_I0100.gif, feature selection aims to identify a subset images/c13_I0101.gif which gives the best possible classification. The features in the optimal subset images/c13_I0102.gif do not have to be optimal individually. An extreme and well-known example can be seen in the XOR example from Section 11.2.1. We have two features images/c13_I0103.gif and images/c13_I0104.gif and a class label images/c13_I0105.gif. Let images/c13_I0106.gif with a 50/50 probability distribution, and both images/c13_I0107.gif and images/c13_I0108.gif are independent of images/c13_I0109.gif, that is

images/c13_I0110.gif

Thus, neither feature images/c13_I0111.gif is individually useful for predicting images/c13_I0112.gif. Collectively, in contrast, the joint variable images/c13_I0113.gif gives complete information about images/c13_I0114.gif, giving a perfect classifier.

In practice, we do not expect such an extreme situation, but it is indeed very common that the usefulness of some features depends heavily on other features included. Two types of dependence are possible. In the XOR example, we see that features reinforce each other. The usefulness of a group of features is more than the sum of its components. Another common situation is where features are redundant, in the sense that the features are highly correlated and contain the same information about the class label. Then the usefulness of the group will be less than the sum of its components.

It is very hard to quantify the usefulness of features and feature vectors, and thus examples necessarily become vague. However, we can note that Cartesian calibration as discussed in Chapter 9 aims to construct features which reinforce each other. The calibrated features are often designed to have the same value for a steganogram and the corresponding cover image, and will then have no discriminatory power at all. Such features are invented only to reinforce the corresponding uncalibrated features. We also have reason to expect many of the features in the well-known feature vectors to be largely redundant. For instance, when we consider large feature vectors where all the features are calculated in a similar way, like SPAM or Shi et al.'s Markov features, it is reasonable to expect the individual features to be highly correlated. However, it is not easy to identify which or how many can safely be removed.

Subset selection consists of two sub-problems. Firstly, we need a method to evaluate the usefulness of a given candidate subset, and we will refer to this as subset evaluation. Secondly, we need a method for subset search to traverse possible subsets looking for the best one we can find. We will consider the two sub-problems in turn.

13.3.1 Subset Evaluation

The most obvious way to evaluate a subset of features is to train and test a classifier and use a performance measure such as the accuracy, as the heuristic of the feature subset. This gives rise to so-called wrapper search; the feature selection algorithm wraps around the classifier. This approach has the advantage of ranking feature vectors based exactly on how they would perform with the chosen classifier. The down-side is that the training/testing cycle can be quite expensive.

The alternative is so-called filter search, where we use some evaluation heuristic independent of the classifier. Normally, such heuristics will be faster than training and testing a classifier, but it may also be an advantage to have a ranking criterion which is independent of the classifier to be used.

The main drawback of wrapper search is that it is slow to train. This is definitely the case for state-of-the-art classifiers like SVM, but there are many classifiers which are faster to train at the expense of accuracy. Such fast classifiers can be used to design a filter search. Thus we would use the evaluation criterion of a wrapper search, but since the intention is to use the selected features with arbitrary classifiers, it is properly described as a filter search. Miche et al. (2006) propose a methodology for feature selection in steganalysis, where they use a images/c13_I0115.gif nearest neighbours (K-NN) classifier for the filter criterion and SVM for the classification.

The images/c13_I0116.gif nearest neighbour algorithm takes no time at all to train; all the calculations are deferred until classification. The entire training set is stored, with a set of feature vectors images/c13_I0117.gif and corresponding labels images/c13_I0118.gif. To predict the class of a new feature vector images/c13_I0119.gif, the Euclidean distance images/c13_I0120.gif is calculated for each images/c13_I0121.gif, and the images/c13_I0122.gif nearest neighbours images/c13_I0123.gif are identified. The predicted label is taken as the most frequent class label among the images/c13_I0124.gif nearest neighbours images/c13_I0125.gif.

Clearly, the classification time with images/c13_I0126.gif nearest neighbour depends heavily on the size of the training set. The distance calculation is images/c13_I0127.gif, where images/c13_I0128.gif is the number of features and images/c13_I0129.gif is the number of training vectors. SVM, in contrast, may be slow to train, but the classification time is independent of images/c13_I0130.gif. Thus images/c13_I0131.gif nearest neighbour is only a good option if the training set can be kept small. We will present a couple of alternative filter criteria later in this chapter, and interested readers can find many more in the literature.

13.3.2 Search Algorithms

Subset search is a general algorithmic problem which appears in many different contexts. Seeking a subset images/c13_I0132.gif maximising some heuristic images/c13_I0133.gif, there are images/c13_I0134.gif potential solutions. It is clear that we cannot search them all except in very small cases. Many different algorithms have been designed to search only a subset of the possible solutions in such a way that a near-optimal solution can be expected.

In sequential forward selection, we build our feature set images/c13_I0135.gif, starting with an empty set and adding one feature per round. Considering round images/c13_I0136.gif, let images/c13_I0137.gif be the features already selected. For each candidate feature images/c13_I0138.gif, we calculate our heuristic for images/c13_I0139.gif. The images/c13_I0140.gifth feature images/c13_I0141.gif is chosen as the one maximising the heuristic.

Where forward selection starts with an empty set which is grown to the desired size, backward selection starts with a large set of features which is pruned by removing one feature every round. Let images/c13_I0142.gif be the set of all features under consideration in round images/c13_I0143.gif. To form the feature set for round images/c13_I0144.gif, we find images/c13_I0145.gif by solving

images/c13_I0146.gif

and set images/c13_I0147.gif.

Sequential methods suffer from the fact that once a given feature has been accepted (in forward search) or rejected (in backward search), this decision is final. There is no way to correct a bad choice later. A straight forward solution to this problem is to alternate between forward and backward search, for instance the classic ‘plus images/c13_I0148.gif, take away images/c13_I0149.gif’ search, where we alternately make images/c13_I0150.gif steps forward and then images/c13_I0151.gif steps backward. If we start with an empty set, we need images/c13_I0152.gif. The backward steps can cancel bad choices in previous forward steps and vice versa.

The simple ‘plus images/c13_I0153.gif, take away images/c13_I0154.gif’ algorithm can be made more flexible by making images/c13_I0155.gif and images/c13_I0156.gif dynamic and data dependent. In the classic floating search of Pudil et al. (1994), a backward step should only be performed if it leads to an improvement. It works as follows:

1. Given a current set images/c13_I0157.gif of images/c13_I0158.gif features, add the most significant feature images/c13_I0159.gif and set images/c13_I0160.gif.
2. Conditionally remove the least significant feature images/c13_I0161.gif, to form images/c13_I0162.gif.
3. If the current subset images/c13_I0163.gif is the best one of size images/c13_I0164.gif seen so far, we let images/c13_I0165.gif and continue removing features by returning to Step 2.
4. Otherwise, we return the conditionally removed feature to the set, and continue adding features by returning to Step 1. (In this case, images/c13_I0166.gif and images/c13_I0167.gif are unchanged since the previous completion of Step 1.)

The search continues until the heuristic converges.

13.4 Selection Using Information Theory

One of the 20th-century milestones in engineering was Claude Shannon's invention of a quantitative measure of information (Shannon, 1948). This led to an entirely new field of research, namely information theory. The original application was in communications, but it has come to be used in a range of other areas as well, including machine learning.

13.4.1 Entropy

Consider two stochastic variables images/c13_I0168.gif and images/c13_I0169.gif. Information theory asks, how much information does images/c13_I0170.gif contain about images/c13_I0171.gif? In communications, images/c13_I0172.gif may be a transmitted message and images/c13_I0173.gif is a received message, distorted by noise. In machine learning, it is interesting to ask how much information a feature or feature vector images/c13_I0174.gif contains about the class label images/c13_I0175.gif.

Let's first define the uncertainty or entropy images/c13_I0176.gif of a discrete stochastic variable images/c13_I0177.gif, as follows:

images/c13_I0178.gif

Note that the base of the logarithm matters little. Changing the base would only scale the expression by a constant factor. Using base 2, we measure the entropy in bits. With the natural logarithm, we measure it in nats.

Shannon's definition is clearly inspired by the concept of entropy from thermodynamics, but the justification is based on a number of intuitive axioms about uncertainty, or ‘lack of information’. Shannon (1948) motivated the definition with three key properties or axioms:

  • The entropy is continuous in images/c13_I0179.gif for each images/c13_I0180.gif.
  • If the distribution is uniform, then images/c13_I0181.gif is a monotonically increasing function in images/c13_I0182.gif.
  • ‘If a choice be broken down into two successive choices, the original images/c13_I0183.gif should be the weighted sum of the individual values of images/c13_I0184.gif.’

The last of these properties may be a little harder to understand than the first two. In order to ‘break down’ the entropy into successive choices, we need to extend the concept of entropy for joint and conditional distributions. Joint entropy is straightforward. Given two random variables images/c13_I0185.gif and images/c13_I0186.gif, we can consider the joint random variable images/c13_I0187.gif. Using the joint probability distribution, the joint entropy is given as

images/c13_I0188.gif

If the two variables images/c13_I0189.gif and images/c13_I0190.gif are independent, it is trivial to break down the images/c13_I0191.gif into successive choices:

13.4 13.4

giving us a sum of individual entropies as required by the axiom.

In the example above, the remaining entropy after observing images/c13_I0193.gif is given as images/c13_I0194.gif, but this is only because images/c13_I0195.gif and images/c13_I0196.gif are independent. Regardless of what we find from observing images/c13_I0197.gif, the entropy of images/c13_I0198.gif is unchanged. In general, however, the entropy of images/c13_I0199.gif will change, and vary depending on what images/c13_I0200.gif is. This is determined by the conditional probability distribution images/c13_I0201.gif. Every possible value of images/c13_I0202.gif gives a different distribution for images/c13_I0203.gif, each with an entropy given as

images/c13_I0204.gif

where we can see the weighted sum of constituent entropies for different observations of the first variable images/c13_I0205.gif. We define the conditional entropy images/c13_I0206.gif as the expected value of images/c13_I0207.gif when images/c13_I0208.gif is drawn at random from the corresponding probability distribution. In other words,

images/c13_I0209.gif

It is straightforward, following the lines of (13.4), to see that

images/c13_I0210.gif

as a Bayes law for entropy. This shows how a choice in general can be broken down into successive choices. The total entropy is given as the unconditional entropy of the first choice and the conditional entropy of the second choice. Moreover, the conditional entropy is the weighted sum (average) of the possible entropies corresponding to different outcomes of the first choice, as Shannon stipulated.

We can also see that a deterministic variable, that is one with a single possible outcome images/c13_I0211.gif with images/c13_I0212.gif, has zero entropy as there is no uncertainty in this case. Given a fixed size images/c13_I0213.gif we see that images/c13_I0214.gif is maximised for the uniform distribution. This is logical because then everything is equally likely, and any attempt at prediction would be pure guesswork, or in other words, maximum uncertainty.

13.4.2 Mutual Information

When the entropy of one variable images/c13_I0215.gif changes, from our viewpoint, as a result of observing a different variable images/c13_I0216.gif, it is reasonable to say that images/c13_I0217.gif gives us information about images/c13_I0218.gif. Entropy gives us the means to quantify this information.

The information images/c13_I0219.gif gives about images/c13_I0220.gif can be defined as the change in the entropy of images/c13_I0221.gif as a result of observing images/c13_I0222.gif, or formally

13.5 13.5

The quantity images/c13_I0224.gif is called the mutual information between images/c13_I0225.gif and images/c13_I0226.gif, and it is indeed mutual, or symmetric, because

images/c13_I0227.gif

This equation is commonly illustrated with the Venn diagram in Figure 13.2. We can also rewrite the definition of images/c13_I0228.gif purely in terms of probabilities, as follows:

13.6 13.6

Figure 13.2 Venn diagram showing the various entropies and mutual information of two random variables images/c13_I0444.gif and images/c13_I0445.gif

13.2

We will also need the conditional mutual information. The definition of images/c13_I0230.gifimages/c13_I0230.gif follows directly from the above, replacing the conditional probability distributions of images/c13_I0231.gif and images/c13_I0232.gif given images/c13_I0233.gif. The conditional mutual information is given as the weighted average, in the same way as we defined conditional entropy:

images/c13_I0234.gif

The entropy and mutual information are defined from the probability distribution, which is a theoretical and unobservable concept. In practice, we will have to estimate the quantities. In the discrete case, this is straightforward and uncontroversial. We simply use relative frequencies to estimate probabilities, and calculate the entropy and mutual information substituting estimates for the probabilities.


Example 13.4.1 (The XOR problem)
Recall the XOR example where we have three random variables images/c13_I0235.gif, with images/c13_I0236.gif and images/c13_I0237.gif independent and images/c13_I0238.gif. Thus each variable is given as the XOR of the other two. Assume that each variable is uniformly distributed. The three variables have the same entropy,

images/c13_I0239.gif

Each pair images/c13_I0240.gif, images/c13_I0241.gif and images/c13_I0242.gif is uniformly distributed on the set images/c13_I0243.gif. Thus, the joint entropy is given as

images/c13_I0244.gif

It follows that the mutual information is given as

images/c13_I0245.gif

In other words, images/c13_I0246.gif contains zero information about images/c13_I0247.gif, confirming what we argued previously, that images/c13_I0248.gif is not individually a useful feature to predict images/c13_I0249.gif.

There is a strong relationship between mutual information and entropy on the one hand, and classifier performance on the other.


Theorem 13.4.2
The Bayes error of a classifier is bounded as images/c13_I0250.gif, where images/c13_I0251.gif is the predicted class for a given feature vector images/c13_I0252.gif.

Note that the entropy images/c13_I0253.gif is the regular, discrete entropy of images/c13_I0254.gif. It is only the conditional variable images/c13_I0255.gif which is continuous, but images/c13_I0256.gif is still well-defined as the expected value of images/c13_I0257.gif.


Proof
The Bayes error is

images/c13_I0258.gif

where images/c13_I0259.gif is the set of images/c13_I0260.gif observations that would be classified as class images/c13_I0261.gif. The entropy, in contrast is given as

images/c13_I0262.gif

To prove the theorem, we need to show that

images/c13_I0263.gif

where images/c13_I0264.gif for images/c13_I0265.gif and images/c13_I0266.gif otherwise. It is sufficient to show for all images/c13_I0267.gif that

images/c13_I0268.gif

Let images/c13_I0269.gif such that images/c13_I0270.gif, i.e. images/c13_I0271.gif is the posterior probability that the predicted class be correct. Recall that images/c13_I0272.gif is defined so that images/c13_I0273.gif is minimised, and therefore

images/c13_I0274.gif

The entropy is bounded as

images/c13_I0275.gif

since images/c13_I0276.gif is the largest conditional probability images/c13_I0277.gif for any images/c13_I0278.gif. It can easily be shown that images/c13_I0279.gif (images/c13_I0280.gif), completing the proof.

13.4.3 Multivariate Information

To present the general framework for feature selection using information theory, we will need multivariate mutual information. This can be defined in many different ways. Yeung (2002) develops information theory in terms of measure theory, where entropy, mutual information and multivariate information are special cases of the same measure. As beautiful as this framework is, there is no room for the mathematical detail in this book. We give just the definition of multivariate information, as follows.


Definition 13.4.3
Multivariate mutual information is defined for any set images/c13_I0281.gif of stochastic variables, as follows:

images/c13_I0282.gif


We can clearly see traces of the inclusion/exclusion theorem applied to some measure images/c13_I0283.gif. It is easily checked that mutual information is a special case where images/c13_I0284.gif by noting that

images/c13_I0285.gif

We note that the definition also applies to singleton and empty sets, where images/c13_I0286.gif and images/c13_I0287.gif. The first author to generalise mutual information for three or more variables was McGill (1954), who called it interaction information. His definition is quoted below, but it can be shown that interaction information and multivariate mutual information are equivalent (Yeung, 2002). We will use both definitions in the sequel.


Definition 13.4.4 (Interaction information (McGill, 1954))
The interaction information of a set of random variables images/c13_I0288.gif for images/c13_I0289.gif is defined recursively as

images/c13_I0290.gif

where the conditional form is found simply by marginalising over the distribution of images/c13_I0291.gif. For images/c13_I0292.gif, we define images/c13_I0293.gif.

Note that images/c13_I0294.gif is not in any way considered as a joint stochastic variable images/c13_I0295.gif. Considering a single joint variable images/c13_I0296.gif gives the interaction information

images/c13_I0297.gif

which is the joint entropy, while images/c13_I0298.gif leads to

images/c13_I0299.gif

which is the mutual information. Considering three variables, the interaction information is given as

images/c13_I0300.gif

which does not coincide with the mutual information images/c13_I0301.gif, nor with images/c13_I0302.gif. The relationship is visualised in Figure 13.3.

Figure 13.3 Venn diagram showing interaction information for three random variables. The dark grey area is the third-order interaction information images/c13_I0446.gif. The total grey area (dark and light) is the mutual information images/c13_I0447.gif

13.3

Example 13.4.5
Consider again the XOR example (Example 13.4.1). The multivariate information is given as

images/c13_I0303.gif

Using McGill's definition, we see that when images/c13_I0304.gif is known, images/c13_I0305.gif gives complete information about images/c13_I0306.gif, whereas with images/c13_I0307.gif unknown we get no information. Thus

images/c13_I0308.gif

which is the same result, as expected.

Our interest in the multivariate mutual information comes from the fact that we can use it to estimate the pairwise mutual information of joint stochastic variables. The following theorem is taken from Brown (2009).


Theorem 13.4.6
Given a set of features images/c13_I0309.gif and a class label images/c13_I0310.gif, their mutual information can be expanded as

images/c13_I0311.gif


We omit the proof, which is straightforward but tedious. An interested reader can find it by inserting for the definition of interaction information in each term on the right-hand side.

Clearly, images/c13_I0312.gif is a suitable heuristic for feature selection. A good feature set images/c13_I0313.gif should have high mutual information with the class label. For large images/c13_I0314.gif, the curse of dimensionality makes it hard to obtain a robust estimate for images/c13_I0315.gif. The theorem gives a series expansion of the mutual information into a sum of multivariate mutual information terms, and it is reasonable and natural to approximate the mutual information by truncating the series. Thus we may be left with terms of sufficiently low order to allow robust estimation.

The first-order terms are the heuristics images/c13_I0316.gif of each individual feature images/c13_I0317.gif, and they are non-negative. Including only the first-order terms would tend to overestimate images/c13_I0318.gif because dependencies between the features are ignored. The second-order terms images/c13_I0319.gif may be positive or negative, and measure pairwise relationships between features. Negative terms are the result of pair-wise redundancy between features. A positive second-order term, as for instance in the XOR example, indicates that two features reinforce each other. Higher-order terms measure relationships within larger groups of features.

13.4.4 Information Theory with Continuous Sets

In most cases, our features are real or floating point numbers, and the discrete entropy we have discussed so far is not immediately applicable. One common method is to divide the range of the features into a finite collection of bins. Identifying each feature value with its bin, we get a discrete distribution, and discrete entropy and mutual information can be used. We call this binning. Alternatively, it is possible to extend the definition of entropy and information for continuous random variables, using integrals in lieu of sums. This turns out very well for mutual information, but not as well for entropy.

The differential entropy is defined as a natural continuous set analogue of the discrete entropy:

13.7 13.7

where images/c13_I0321.gif is the probability density function. Although this definition is both natural and intuitive, it does not have the same nice properties as the discrete entropy. Probabilities are bounded between zero and one and guarantee that the discrete entropy is non-negative. Probability densities, in contrast, are unbounded above, and this may give rise to a negative differential entropy.

Another problem is that the differential entropy is sensitive to scaling of the sample space. Replacing images/c13_I0322.gif by images/c13_I0323.gif for images/c13_I0324.gif will make the distribution flatter, so the density images/c13_I0325.gif will be lower and the entropy will be larger. The same problem is seen with binned variables, where the entropy will increase with the number of bins. Intuitively, one would think that the uncertainty and the information should be independent of the unit of measurement, so this sensitivity to scaling may make differential entropy unsuitable.

Unlike differential entropy, continuous mutual information can be shown to retain the interpretations of discrete mutual information. It can be derived either from images/c13_I0326.gif using the definition of differential entropy (13.7), or equivalently as a natural analogue of the discrete formula (13.6). Thus we can write

13.8 13.8

It is possible to show that this definition is the limit of discrete mutual information with binned variables as the bin size decreases. One can see this intuitively, by noting that the scaling factors in the numerator and denominator cancel out. Thus the continuous mutual information images/c13_I0328.gif between a feature images/c13_I0329.gif and the class label images/c13_I0330.gif is a potentially good heuristic for the classification ability, even if continuous entropy is not well-defined.

13.4.5 Estimation of Entropy and Information

The problem of estimating entropy and information is closely related to that of estimating probability density functions (Section 12.2). The simplest approach is to use binning to discretise the random variable as addressed above. This corresponds to using the histogram to estimate the PDF. This approach is dominant in the literature on machine learning and it is rarely subject to discussion. The challenge is, both for density estimation and for mutual information, to choose the right bin width. We can use the same advice on bin width selection for entropy estimation as for density estimation. Sturges' rule using images/c13_I0331.gif bins is a good starting point, but in the end it boils down to trial and error to find the ideal bin width.

For the purpose of feature selection, our main interest is in the mutual and interaction information, and not in the entropy per se. We can estimate information using any of the rules

13.9 13.9

13.10 13.9

13.11 13.9

where images/c13_I0333.gif and images/c13_I0334.gif are estimates of the probability density and the entropy respectively. In principle, they all give the same results but they may very well have different error margins.

Bin size selection is a particularly obvious problem in entropy estimation, as the entropy will increase as a function of the number of bins. Thus, entropy estimates calculated with different bin sizes may not be comparable. Estimating mutual information, one should use the same bin boundaries on each axis for all the estimates used, although images/c13_I0335.gif and images/c13_I0336.gif may be binned differently. We will investigate this further in the context of continuous information.

There are several methods to estimate differential entropy and continuous mutual information. The one which is intuitively easiest to understand is the entropy estimate of Ahmad and Lin (1976). Then either (13.9) or (13.10) can be used, depending on whether the joint or conditional entropy is easiest to estimate. For the purpose of this book, we will be content with this simple approach, which relies on kernel density estimators as discovered in Section 12.2.2. Interested readers may consult Kraskov et al. (2004) for a more recent take on mutual information estimation, and an estimator based on images/c13_I0337.gifth nearest neighbour estimation.


Code Example 13.1 : The Ahmad–Lin estimator in Python, using the Gaussian KDE from scipy
import numpy as np
from scipy.stats import gaussian_kde as kde
def ahmadlin(X):
    f = kde(X)
    y = np.log( f.evaluate(X) )
    return = − sum(y)/len(y)

The Ahmad–Lin estimator is given as

images/c13_I0338.gif

where images/c13_I0339.gif is the set of observations of images/c13_I0340.gif and images/c13_I0341.gif is the kernel density estimate. An implementation is shown in Code Example 13.1. The term images/c13_I0342.gif in the definition of differential entropy does not occur explicitly in the estimator. However, we sample images/c13_I0343.gif at random points images/c13_I0344.gif drawn according to images/c13_I0345.gif, and thus the images/c13_I0346.gif term is represented implicitly as the sum.

13.4.6 Ranking Features

We have already argued that images/c13_I0347.gif is a useful evaluation heuristic for a set of features images/c13_I0348.gif and a class label images/c13_I0349.gif. It is a very intuitive measure, as it quantifies the information about images/c13_I0350.gif contained in the features. The larger the mutual information, the better classification accuracy can be expected. Now we turn to the computational problem of estimating this information.

For large images/c13_I0351.gif, the curse of dimensionality makes it impractical to calculate images/c13_I0352.gif. We can approximate it by a truncation of the series expansion in Theorem 13.4.6. It is common to use the second-order approximation

13.12 13.10

At worst, only three-dimensional sample spaces have to be calculated. This can be used as a filter criterion as it stands, but it is still computationally expensive for large images/c13_I0354.gif. In sequential forward selection, it is possible to simplify the expression by reusing computations from previous rounds.

In the images/c13_I0355.gif-th round we compare subsets where images/c13_I0356.gif have been fixed by the previous round, and only images/c13_I0357.gif is to be selected. If we write images/c13_I0358.gif for the ‘winning’ heuristic from round images/c13_I0359.gif, we can rewrite (13.12) as

13.13 13.11

Since images/c13_I0361.gif is constant, the interesting heuristic to evaluate a single candidate feature can be written as

13.14 13.12

where the last equality follows from McGill's definition (Definition 13.4.4). Brown (2009) calls this the first-order utility (FOU). Note that only images/c13_I0363.gif and images/c13_I0364.gif need to be estimated in round images/c13_I0365.gif. All the other terms have already been calculated in previous rounds.

In order to estimate images/c13_I0366.gif, we need to estimate both unconditional and conditional mutual information. Recall that images/c13_I0367.gif is discrete and usually drawn from a small set of class labels, thus we can write

images/c13_I0368.gif

where images/c13_I0369.gif can be calculated as the estimated mutual information considering the samples from class images/c13_I0370.gif only.

Many of the feature selection heuristics in the literature are variations of the FOU formula. A great deal of them can be written in the form

13.15 13.13

by varying the parameters images/c13_I0372.gif. For images/c13_I0373.gif we obviously get the FOU. Brown (2009) recommends using joint mutual information (JMI), which we define as

images/c13_I0374.gif

In other words, we use images/c13_I0375.gif in (13.15). For images/c13_I0376.gif, we let images/c13_I0377.gif.

The JMI criterion was originally introduced by Yang and Moody (1999), who defined the JMI heuristic as

13.16 13.14

Note that even though images/c13_I0379.gif, they both give the same ranking of features for images/c13_I0380.gif. In fact,

images/c13_I0381.gif

Both the last term and the multiplication factor are constant for all features images/c13_I0382.gif considered in round images/c13_I0383.gif. For images/c13_I0384.gif, images/c13_I0385.gif cannot rank the features, and one would use images/c13_I0386.gif as the criterion in the first round.

In order to calculate images/c13_I0387.gif empirically, we note that the first images/c13_I0388.gif terms images/c13_I0389.gif in the sum are the score from the previous round. We only need to calculate

images/c13_I0390.gif

For a two-class problem with no class skew, this gives

images/c13_I0391.gif

Some care must be taken to ensure that the entropy estimates are compatible. A working approach based on the Ahmad–Lin estimator is to scale all the features to have unit variance and make sure to use the same bandwidth matrix for the KDE for all three entropy estimates.

An implementation of JMI selection is shown in Code Example 13.2. An object is instantiated with a 2-D array of feature vectors T and a 1-D array of class labels Y. The object has a state maintaining a list of features selected thus far, and a list of evaluation scores from the previous round. The _getRankData() method evaluates a single feature, and next() returns the best feature for the current round and proceeds to the next round. Note that we use the KDE class from Code Example 12.2, so that we can explicitly use the same bandwidth matrix for all three KDEs. The entropy h2 and conditional entropy c2 are calculated with Ahmad and Lin's formula as before. The computational cost is dominated by the evaluation of three KDEs per round.

13.5 Boosting Feature Selection

Boosting feature selection (BFS) was introduced by Tieu and Viola (2004), and Dong et al. (2008) proposed using it in the context of steganalysis. BFS builds on a previous idea of boosting, which is used to design a classifier as a linear combination of multiple constituent classifiers. To use boosting for feature selection, we consider each individual feature images/c13_I0392.gif as a classification heuristic, where negative and positive values of images/c13_I0393.gif are taken to indicate different classes. The BFS algorithm combines this idea with forward sequential search.

The BFS classifier after images/c13_I0394.gif rounds of forward search is written as

images/c13_I0395.gif

for any point images/c13_I0396.gif in feature space; where images/c13_I0397.gif has at most images/c13_I0398.gif elements not equal to 0. Note that before the first round, the above formula gives the ‘empty’ classifier images/c13_I0399.gif. In round images/c13_I0400.gif we seek to add one new feature to the classifier, so that the new classifier will be given as

images/c13_I0401.gifwhere images/c13_I0402.gif and images/c13_I0403.gif are to be optimised.

Let images/c13_I0404.gif for images/c13_I0405.gif be the training feature vectors, and let images/c13_I0406.gif be the associated class labels. In each round, we start with a classifier images/c13_I0407.gif and define a new classifier images/c13_I0408.gif by selecting a new feature images/c13_I0409.gif and updating the associated weight images/c13_I0410.gif. Each feature is selected by minimising the squared error of the classifier images/c13_I0411.gif, i.e.

13.17 13.15

where images/c13_I0413.gif is the images/c13_I0414.gifth feature of the images/c13_I0415.gifth training vector. Let images/c13_I0416.gif be the solution of the minimisation problem. The index images/c13_I0417.gif identifies a feature to be included in the selection, whereas images/c13_I0418.gif represents an update to the classifier. The classifier images/c13_I0419.gif for the next round is defined by reassigning the weight vector images/c13_I0420.gif by setting images/c13_I0421.gif. The process can be repeated until the square error converges.


Code Example 13.2 : A class to do JMI feature selection using Yang and Moody's formula. Note that KDE is the class shown in Code Example 12.2. The class interface is explained in the text
import numpy as np
class JMI(object):
    def _ _init_ _(self,Y,T ):
        (self.dim,self.size) = T.shape
        self.score = None
        self.selected = []
        (self.Y, self.T) = (Y,T)
    def next(self,*a,**kw):
        R = map( self._getRankData, xrange( self.dim ) )
        if len(self.selected) > 0: self.score = [ y for (x,y) in R ]
        R.sort( cmp=lambda x,y : cmp(x[1],y[1]) )
        ( gamma, score ) = R[0]        # Pick the winner
        self.selected.append( gamma )
        return gamma
   def _getRankData( self, gamma ):
        idx = [gamma]
        if len(self.selected) > 0: idx.append( self.selected[−1] )
        # Get score from previous round:
        if self.score != None: R = self.score[gamma] 
        else: R = 0
        # Calculate kernel density estimates
        C = self.T[:,(self.Y == 0)]   # Select clean images
        S = self.T[:,(self.Y == 1)]   # Select steganograms
        K = KDE( self.T[idx,:] )
        KC = KDE( C[idx,:], bandwidth=K.bandwidth )
        KS = KDE( S[idx,:], bandwidth=K.bandwidth )
        # Calculate empirical entropies
        (nC,d) = C.shape
        (nS,d) = S.shape
        h2 = − sum( np.log( K.evaluate(X) ) ) / (nC+nS)
        HC = − sum( np.log( KC.evaluate(X) ) ) / nC
        HS = − sum( np.log( KS.evaluate(X) ) ) / nS
        c2 = ( n1*HC + n2*HS ) / (n1+n2) 
        # Return the feature index and its score
        return ( gamma, R + h2 − c2 )

 


Code Example 13.3 : A class to do BFS feature selection. The interface mimics that from Code Example 13.2
import numpy as np
class BFS(object):
    def _ _init_ _(self,Y,T ):
         (self.dim,self.size) = T.shape
         self.beta = np.zeros( self.dim )
         (self.Y, self.T) = (Y,T)
    def next(self):
         R = map( self._getRankData, xrange( self.dim ) )
         R.sort( cmp=lambda x,y : cmp(x[1],y[1]) )
         selected = R[0]
         self.beta[selected[0]] += selected[2]
         return selected[0]
    def _getRankData( self, gamma ):
         F = np.dot( self.beta, self.T )         # Prediction
         W = self.Y − F                             # Error
         b = self.T[gamma,:]                      # current feature
         beta = sum( W * b ) / sum( b**2 ) # New beta value
         T = W − beta * self.T[gamma,:]     # New error
         E = sum( T**2 ) / len(self.Y)         # Total squared error
         return ( gamma, E, beta )

The cost function images/c13_I0422.gif to be optimised in (13.17) is obviously not continuous in the discrete variable images/c13_I0423.gif, so the easiest way to optimise it is to solve

images/c13_I0424.gif

for each images/c13_I0425.gif. This gives a heuristic images/c13_I0426.gif ranking the features, and we use the feature images/c13_I0427.gif with the lowest heuristic value as the images/c13_I0428.gifth feature for our classifier. The corresponding weighting images/c13_I0429.gif is the weighting of images/c13_I0430.gif in the updated BFS classifier.

Code Example 13.3 shows how this can be done in Python using roughly the same interface as in the previous JMI implementation. A BFS object is instantiated by providing the class labels as a 1-D numpy array, and the feature vectors as the rows of an images/c13_I0431.gif numpy array. Initially the BFS object is in round images/c13_I0432.gif. The next() method returns the next feature to be selected, and the state proceeds to round images/c13_I0433.gif, including updating the internal classifier images/c13_I0434.gif.


Remark 13.1
Note that BFS can be used both as a filter method for feature selection for an arbitrary classifier algorithm, and as a classifier in its own right. The function images/c13_I0435.gif, or rather images/c13_I0436.gif to be precise, constitutes the trained classifier. To use it just for feature selection, we select the features images/c13_I0437.gif that have non-zero weighting images/c13_I0438.gif.

13.6 Applications in Steganalysis

There are a few, but only a few, examples of feature selection being used in steganalysis. Pevný et al. (2009a) even state that there is no known case in steganalysis where feature selection improves the classifier. In a massive effort to break the HUGO system, Fridrich et al. (2011) ended up with a massive 24 993-dimensional feature vector, and they argue that high dimension poses no harm.

Part of the problem is that the size of the training set must correspond to the feature dimensionality. In the attack on HUGO, Fridrich et al. used a training set of 30000–90000 images. This poses two challenges. Firstly, the task of collecting a diverse and representative image base of tens of thousands of images is difficult and time-consuming, especially if multiple classes of images are going to be considered and compared. Secondly, the computational cost of training the classifier increases with the size of the training set. In fact, Fridrich et al. gave up SVM in favour of faster algorithms when they took up using tens of thousands of features.

Most authors use both shorter feature vectors and smaller training sets. If we try to combine many features from different authors, without having the opportunity to increase the training set correspondingly, then we will quite likely have a case for feature selection. A priori, there is no reason to expect the published feature vectors to be optimal selections. It is possible that better classifiers are available by combining existing features into new feature vectors. This idea is known as feature-level fusion. An interesting exercise is to fuse all known feature vectors into one, and then use feature selection techniques to find an optimal one. A pioneer in using feature-level fusion in steganalysis is Dong et al. (2008), who combined it with BFS.

The possible use of feature selection for the purpose of interpretability has been explored by Miche et al. (2009). They described the objective as a reverse engineering exercise, where the most useful features are used to identify artifacts caused by embedding and thus help in reverse engineering the stego-system. A number of stego-systems were tested, identifying the features which are sensitive to each system. It seems that their work only scratched the surface, and they did not cover all of the current stego-systems. It should be interesting to see more research along this line.

13.6.1 Correlation Coefficient

Pevný et al. (2009a) included experiments with feature selection from the SPAM features. Their approach goes a bit beyond the framework that we have discussed so far, in that their heuristic is based not on the class label, but on the number of coefficients changed by the embedding. In a sense, this tunes the feature selection for quantitative steganalysis using regression, rather than to binary classification. Still, features which are useful for one problem are likely to be useful for the other.

Let images/c13_I0439.gif denote the number of embedding changes in the intercepted image. For each feature images/c13_I0440.gif, the correlation coefficient is defined as

images/c13_I0441.gif

The sample correlation coefficient can be calculated using the sample means in lieu of the expectations, and this is used to rank the features and select those with the largest correlation coefficient. Evidently, one can replace images/c13_I0442.gif by a numeric class label, but we have not seen this approach tested.

Pevný et al. (2009a) reports that the accuracy increases until 200 features, and then it stabilises. In other words, they find no evidence of overfitting where feature selection can increase the accuracy in the SPAM features. Furthermore, the optimal features depend on the data set used. Across four image databases considered, 114 out of the 200 best features are shared for all image sets. Thus one might conclude that the full SPAM feature vector should be used for a universal steganalyser.

13.6.2 Optimised Feature Vectors for JPEG

In an attempt to make the optimal feature vector for JPEG, based on all the different features we have discussed, we have run a number of experiments using feature selection. We used the same training and test sets as we have used throughout, with 1000 training images and 4000 test images from the BOWS collection. To do the feature selection, we used the training set with 500 clean images JPEG compressed at QF75 and 500 steganograms with 512 byte messages embedded with F5. A total of about 6700 features had been implemented and used. Our experiments closely follow those reported by Schaathun (2011), except for a few minor modifications and bug fixes.

We have selected features using FOU, JMI and BFS. Figure 13.4 shows the relationship between dimensionality and testing accuracy, where the best features have been used according to each of the selection criteria. It is interesting to see how quickly the curve flattens, showing that decent accuracy can be achieved with less than 50 features, and little is gained by adding features beyond 100. Recalling that NCPEV-219 and CCPEV-438 had accuracies of 86.3% and 86.7% respectively, we note that automated feature selection does not improve the accuracy compared to state-of-the-art feature vectors.

Figure 13.4 Accuracy for SVM using automatically selected features

13.4

We note that there is no evidence of the curse of dimensionality kicking in within the 700 dimensions we have covered. In fact, both the training and testing error rates appear to stabilise. We also ran a comparison test using all the JPEG-based features (CCPEV-438, FRI-23, Markov-243, HCF-390 and CCCP-54) to see how much trouble the dimensionality can cause. It gave an accuracy of 87.2%, which is better than CCPEV-438, but not significantly so.

Even though automated feature selection does not improve the accuracy, improvements may be found by studying which features are selected. In Table 13.1 we have listed the first 20 features in the order they were selected with the JMI measure. The DCM-243 feature vector is Shi et al.'s Markov-based features subject to difference calibration. As we see, different variations of Markov-based features dominate the list, with a good number of co-occurrence and global histogram features. The single dual histogram feature is the odd one out. Continuing down the list, Markov-based features continue to dominate with local histogram features as a clear number two.

Table 13.1 Top 20 features in the selection according to JMI

c13tnt001

There is no obvious pattern to which individual features are selected from each feature group. Since all of these features are representative of fairly large groups, it is tempting to see if these groups form good feature vectors in their own right. The results are shown in Table 13.2. We note that just the Markov-based features from NCPEV-219 or CCPEV-438 give decent classification, and combining the Markov and co-occurrence features from CCPEV-438, HGS-212 gives a statistically significant improvement in the accuracy with less than half the dimensionality.

Table 13.2 Accuracies for some new feature vectors. Elementary subsets from known feature vectors are listed in the first part. New combinations are listed in the lower part

Name Accuracy Comprises
CCPM-162 81.9 Markov features from CCPEV-438
NCPM-81 81.0 Markov features from NCPEV-219
CCCM-50 74.5 Co-occurrence matrix from CCPEV-438
DCCM-25 69.5 Co-occurrence matrix from PEV-219
NCCM-25 68.5 Co-occurrence matrix from NCPEV-219
HGS-162 90.3 images/c13_I0468.gif from CCPEV-438
HGS-178 90.1 images/c13_I0469.gif images/c13_I0470.gif (CCPEV-438)
HGS-160 90.1 images/c13_I0471.gif images/c13_I0472.gif (NCPEV-219)
HGS-180 89.9 images/c13_I0473.gif from CCPEV-438
HGS-212 89.4 images/c13_I0474.gif
HGS-142 88.7 images/c13_I0475.gif histogram from PEV-219
HGS-131 88.2 images/c13_I0476.gif

The accuracy can be improved further by including other features as shown in the table. The local histogram features used in HGS-160 and HGS-178 are the images/c13_I0443.gif bins from each of the nine frequencies used, taken from NCPEV-219 and CCPEV-438, respectively. Although automated feature selection does not always manage to improve on the accuracy, we have demonstrated that feature selection methods can give very useful guidance to manual selection of feature vectors. Schaathun (2011) also reported consistent results when training and testing the same feature selection with Outguess 0.1 at quality factor 75 and with F5 at quality factor 85.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.152.58