Chapter 10

Simulation and Evaluation

In Chapter 3 we gave a simple framework for testing and evaluating a classifier using accuracy, and with the subsequent chapters we have a sizeable toolbox of steganalysers. There is more to testing and evaluation than just accuracy and standard errors, and this chapter will dig into this problem in more depth. We will look at the simulation and evaluation methodology, and also at different performance heuristics which can give a more detailed picture than the accuracy does.

10.1 Estimation and Simulation

One of the great challenges of science and technology is incomplete information. Seeking knowledge and understanding, we continuously have to make inferences from a limited body of data. Evaluating a steganalyser, for instance, we desire information about the performance of the steganalyser on the complete population of images that we might need to test in the future. The best we can do is to select a sample (test set) of already known images to check. This sample is limited both by the availability of images and by the time available for testing.

Statistics is the discipline concerned with quantitative inferences drawn from incomplete data. We will discuss some of the basics of statistics while highlighting concepts of particular relevance. Special attention will be paid to the limitations of statistical inference, to enable us to take a critical view on the quantitative assessments which may be made of steganalytic systems. For a more comprehensive introduction to statistics, we recommend Bhattacharyya and Johnson (1977).

10.1.1 The Binomial Distribution

Let us start by formalising the concept of accuracy and the framework for estimating it. In testing a classifier, we are performing a series of independent experiments or trials. Each trial considers a single image, and it produces one out of two possible outcomes, namely success (meaning that the object was correctly classified) or failure (meaning that the object was misclassified).

Such experiments are known as Bernoulli trials, defined by three properties:

1. Each trial has two possible outcomes, often labelled ‘success’ (images/c10_I0001.gif) and ‘failure’ (images/c10_I0002.gif).
2. Every trial in the series has the same probabilities images/c10_I0003.gif and images/c10_I0004.gif.
3. The trials are independent, so that the probabilities images/c10_I0005.gif and images/c10_I0006.gif are invariant, and do not change as a function of previous outcomes.

Evaluating a classifier according to the framework from Chapter 3, we are interested in the probability of success, or correct classification. Thus the accuracy is equal to the success probability, i.e. images/c10_I0007.gif.

A Bernoulli trial can be any kind of independent trial with a binary outcome. The coin toss, which has ‘head’ (images/c10_I0008.gif) or ‘tail’ (images/c10_I0009.gif) as the two possible outcomes, is the fundamental textbook example. Clearly the probability is images/c10_I0010.gif, unless the coin is bent or otherwise damaged. In fact, we tend to refer to any Bernoulli trial with images/c10_I0011.gif as a coin toss. Other examples include:

  • testing a certain medical drug on an individual and recording if the response is positive (images/c10_I0012.gif) or negative (images/c10_I0013.gif);
  • polling an individual citizen and recording if he is in favour of (images/c10_I0014.gif) or against (images/c10_I0015.gif) a closer fiscal union in the EU;
  • polling an individual resident and recording whether he/she is unemployed (images/c10_I0016.gif) or not (images/c10_I0017.gif);
  • catching a salmon and recording whether it is infected with a particular parasite (images/c10_I0018.gif) or not (images/c10_I0019.gif).

Running a series of images/c10_I0020.gif Bernoulli trials and recording the number of successes images/c10_I0021.gif gives rise to the binomial distribution. The stochastic variable images/c10_I0022.gif can take any value in the range images/c10_I0023.gif, and the probability that we get exactly images/c10_I0024.gif successes is given as

images/c10_I0025.gif

and the expected number of successes is

images/c10_I0026.gif

The test phase in the development of a steganalyser is such a series of Bernoulli trials. With a test set of images/c10_I0027.gif images, we expect to get images/c10_I0028.gif correct classifications on average, but the outcome is random so we can get more and we can get less. Some types of images may be easier to classify than others, but since we do not know which ones are easy, this is just an indistinguishable part of the random process.

10.1.2 Probabilities and Sampling

Much of our interest centres around probabilities; the most important one so far being the accuracy, or probability of correct classification. We have used the word probability several times, relying on the user's intuitive grasp of the concept. An event which has a high probability is expected to happen often, while one with low probability is expected to be rare. In reality, however, probability is an unusually tricky term to define, and definitions will vary depending on whether you are a mathematician, statistician, philosopher, engineer or layman.

A common confusion is to mistake probabilities and rates. Consider some Bernoulli trial which is repeated images/c10_I0029.gif times and let images/c10_I0030.gif denote the number of successes. The success rate is defined as images/c10_I0031.gif. In other words, the rate is the outcome of an experiment. Since images/c10_I0032.gif is a random variable, we will most likely get a different value every time we run the experiment.

It is possible to interpret probability in terms of rates. The success probability images/c10_I0033.gif can be said to be the hypothetical success rate one would get by running an infinite number of trials. In mathematical terms, we can say that

images/c10_I0034.gif

Infinity is obviously a point we will never reach, and thus no experiment will ever let us measure the exact probability. Statistics is necessary when we seek information about a population of objects which is so large that we cannot possibly inspect every object, be it infinite or large but finite. If we could inspect every object, we would merely count them, and we could measure precisely whatever we needed to know.

In steganalysis, the population of interest is the set images/c10_I0035.gif of all images we might intercept. All the images that Alice sends and Wendy inspects are drawn randomly from images/c10_I0036.gif according to some probability distribution images/c10_I0037.gif, which depends on Alice's tastes and interests. The accuracy images/c10_I0038.gif is the probability that the classifier will give the correct prediction when it is presented with an image drawn from images/c10_I0039.gif.

Trying to estimate images/c10_I0040.gif, or any other parameter of images/c10_I0041.gif or images/c10_I0042.gif, the best we can do is to study a sample images/c10_I0043.gif. Every parameter of the population has an analogue parameter defined on the sample space. We have the population accuracy or true accuracy, and we have a sample accuracy that we used as an estimate for the population accuracy. We can define these parameters in terms of the sample and the population. Let images/c10_I0044.gif be a steganalyser. For any image images/c10_I0045.gif, we write images/c10_I0046.gif for the predicted class of images/c10_I0047.gif and images/c10_I0048.gif for the true class. We can define the population accuracy as

images/c10_I0049.gif

where images/c10_I0050.gif if the statement images/c10_I0051.gif is true and images/c10_I0052.gif otherwise. The sample accuracy is defined as

images/c10_I0053.gif

If the sample is representative for the probability distribution on the population, then images/c10_I0054.gif. But here is one of the major open problems in steganalysis. We know almost nothing about what makes a representative sample of images. Making a representative sample in steganalysis is particularly hard because the probability distribution images/c10_I0055.gif is under adversarial control. Alice can choose images which she knows to be hard to classify.

10.1.3 Monte Carlo Simulations

It would be useful if we could test the steganalyser in the field, using real steganograms exchanged between real adversaries. Unfortunately, it is difficult to get enough adversaries to volunteer for the experiment. The solution is to use simulations, where we aim to simulate, or mimic, Alice's behaviour.

Simulation is not a well-defined term, and it can refer to a range of activities and techniques with rather different goals. Our interest is in simulating random processes, so that we can synthesise a statistical sample for analysis. Instead of collecting real statistical data in the field, we generate pseudo-random data which just simulates the real data. This kind of simulation is known as Monte Carlo simulation, named in honour of classic applications in simulating games.

Our simulations have a lot in common with applications in communications theory and error control coding, where Monte Carlo simulations are well established and well understood as the principal technique to assess error probabilities (Jeruchim et al., 2000). The objective, both in error control coding and in steganalysis, is to estimate the error probability of a system, and we get the estimate by simulating the operating process and counting error events.

Monte Carlo simulations are designed to generate pseudo-random samples from some probability distribution, and this is commonly used to calculate statistics of the distribution. The probability distribution we are interested in is that of classification errors in the classifier, that is a distribution on the sample space images/c10_I0056.gif, but without knowledge of the error probability we cannot sample it directly. What we can do is use Monte Carlo processes to generate input (images) for the classifier, run the classifier to predict a class label and then compare it to the true label. A sample Monte Carlo system is shown in Figure 10.1.

Figure 10.1 Monte Carlo simulator system for steganalysis. All the white boxes are either random processes or processes which we cannot model deterministically

10.1

The input to the classifier is the product of several random processes, or unknown processes which can only be modelled as random processes. Each of those random processes poses a Monte Carlo problem; we need some way to generate pseudo-random samples. The key will usually be a random bit string, so that poses no challenge. The message would also need to be random or random-looking, assuming that Alice compresses and/or encrypts it, but the message length is not, and we know little about the distribution of message lengths. Then we have Alice's choice of whether to transmit a clean image or a steganogram, which gives rise to the class skew. This is another unknown parameter. The cover image comes from some cover source chosen by Alice, and different sources have different distributions.

The figure illustrates the limitations of straight-forward Monte Carlo simulations. We still have five input processes to simulate, and with the exception of the key, we have hardly any information about their distribution. By pseudo-randomly simulating the five input processes, for given message length, cover source, stego algorithm and class skew, Monte Carlo simulations can give us error probabilities for the assumed parameters. We would require real field data to learn anything about those input distributions, and no one has attempted that research yet.

10.1.4 Confidence Intervals

Confidence intervals provide a format to report estimates of some parameter images/c10_I0057.gif, such as an accuracy or error probability. The confidence interval is a function images/c10_I0058.gif of the observed data images/c10_I0059.gif, and it has an associated confidence level images/c10_I0060.gif. The popular, but ambiguous, interpretation is that the true value of images/c10_I0061.gif is contained in the confidence interval with probability images/c10_I0062.gif. Mathematically, we write it as

images/c10_I0063.gif

The problem with this interpretation is that it does not specify the probability distribution considered. The definition of a confidence interval is that the a priori probability of the confidence interval containing images/c10_I0064.gif, when images/c10_I0065.gif is drawn randomly from the appropriate probability distribution, is images/c10_I0066.gif. Thus both images/c10_I0067.gif and images/c10_I0068.gif are considered as random variables.

A popular misconception occurs when the confidence interval images/c10_I0069.gif is calculated for a given observation images/c10_I0070.gif, and the confidence level is thought of as a posterior probability. The a posteriori probability

images/c10_I0071.gif

may or may not be approximately equal to the a priori probability images/c10_I0072.gif. Very often the approximation is good enough, but for instance MacKay (2003) gives instructive examples to show that it is not true in general. In order to relate the confidence level to probabilities, it has to be in relation to the probability distribution of the observations images/c10_I0073.gif.

The confidence interval is not unique; any interval images/c10_I0074.gif which satisfies the probability requirement is a confidence interval. The most common way to calculate a confidence interval coincides with the definition of error bounds as introduced in Section 3.2. Given an estimator images/c10_I0075.gif of images/c10_I0076.gif, we define an error bound with confidence level images/c10_I0077.gif as

images/c10_I0078.gif

where images/c10_I0079.gif is the standard deviation of images/c10_I0080.gif and images/c10_I0081.gif is the point such that images/c10_I0082.gif for a standard normally distributed variable images/c10_I0083.gif. Quite clearly, images/c10_I0084.gif satisfies the definition of a confidence interval with a confidence level of images/c10_I0085.gif, if images/c10_I0086.gif is normally distributed, which is the case if the underlying sample of observations is large according to the central limit theorem. In practice, we do not know images/c10_I0087.gif, and replacing images/c10_I0088.gif by some estimate, we get an approximate confidence interval.

Error bounds imply that some estimator has been calculated as well, so that the estimate images/c10_I0089.gif is quoted with the bounds. A confidence interval, in contrast, is quoted without any associated estimator. Confidence intervals and error bounds can usually be interpreted in the same way. Confidence intervals provide a way to disclose the error bounds, without explicitly disclosing the estimator.

10.2 Scalar Measures

We are familiar with the accuracy images/c10_I0090.gif and the error rate images/c10_I0091.gif, as well as the empirical estimators. As a performance indicator, they are limited, and hide a lot of detail about the actual performance of our classifier. Our first step is to see that not all errors are equal.

10.2.1 Two Error Types

Presented with an image images/c10_I0092.gif, our steganalyser will return a verdict of either ‘stego’ or ‘non-stego’. Thus we have two possible predicted labels, and also two possible true labels, for a total of four possible outcomes of the test, as shown in the so-called confusion matrix (Fawcett, 2006) in Figure 10.2. The columns represent what images/c10_I0093.gif actually is, while the rows represent what the steganalyser thinks it is. Both the upper-right and lower-left cells are errors:

False positives (FP) where a natural image (negative) is incorrectly classified as stego (positive).
False negative (FN) where a steganogram is incorrectly classified as a natural image.

Figure 10.2 The confusion matrix

10.2

In many situations, these two error types may have different associated costs. Criminal justice, for instance, has a very low tolerance for accusing the innocent. In other words, the cost of false positives is considered to be very high compared to that of false negatives.

Let images/c10_I0094.gif and images/c10_I0095.gif denote the probability, respectively, of a false positive and a false negative. These are conditional probabilities

images/c10_I0096.gif

where images/c10_I0097.gif and images/c10_I0098.gif are, respectively, the set of all cover images and all steganograms. The detection probability images/c10_I0099.gif, as a performance measure, is equivalent to the false negative probability, but is sometimes more convenient. As is customary in statistics, we will use the hat images/c10_I0100.gif to denote empirical estimates. Thus, images/c10_I0101.gif and images/c10_I0102.gif denote the false positive and false negative rates, respectively, as calculated in the test phase.

The total error probability of the classifier is given as

10.1 10.1

In other words, the total error probability depends, in general, not only on the steganalyser, but also on the probability of intercepting a steganogram. This is known as the class skew. When nothing is known about the class skew, one typically assumes that there is none, i.e. images/c10_I0104.gif. However, there is no reason to think that this would be the case in steganalysis.


Example 10.2.1
Provos and Honeyman (2003) ran an experiment to check if image steganography is in active use on eBay and on Usenet news. They used a web crawler to harvest images, and every image found was steganalysed. They found up to 2.1% positives (including false positives) testing for JP Hide on Usenet, and 1% on eBay. These rates are comparable to the false positive rates for the steganalyser, and thus the initial test gives no information as to whether steganography is being used.
In order to complete the test, they ran a secondary attack, consisting of dictionary attacks aiming to recover the password used by the stego-systems. Such password attacks are time-consuming, and thus constitute a cost of false positives. If the initial steganalyser has a high false positive rate, it will slow down the secondary attack before more candidate images must be tested.
In such a scientific experiment, where the aim was to verify or quantify the use of steganography in the wild, a high false negative rate is not an issue. Unless the number of steganograms in the sample set is negligibly small, we will find a certain number of steganograms, and if we multiply by images/c10_I0105.gif we get a good estimate of the total number of steganograms. The FN rate just needs to be so low that we identify a statistically significant number of steganograms.

 


Example 10.2.2
Continuing the example above, one could imagine Provos and Honeyman's (2003) system being used by law enforcement or intelligence to identify organised crime applications of steganography. Now, the goal is not to quantify the use, but rather to identify all users, and the FN rate becomes an issue. The FP rate still carries a cost, as false positives will require resources for further investigation, and this cost must be weighted against the risk of missing an important adversary.

The examples are undoubtedly cases of extreme class skew. Most Internet users have probably not even heard of steganography, let alone considered using it. If, say, images/c10_I0106.gif, it is trivial to make a classifier with error probability images/c10_I0107.gif. We just hypothesise ‘non-stego’ regardless of the input, and we only have an error in the rare case that images/c10_I0108.gif is a steganogram. In such a case, the combined error probability images/c10_I0109.gif is irrelevant, and we would clearly want to estimate images/c10_I0110.gif and images/c10_I0111.gif separately.

10.2.2 Common Scalar Measures

Besides images/c10_I0112.gif and images/c10_I0113.gif, there is a large number of other measures which are used in the literature. It is useful to be aware of these, even though most of them depend very much on the class skew.

In practice, all measures are based on empirical statistics from the testing; i.e. we find images/c10_I0114.gif distinct natural images, and create steganograms out of images/c10_I0115.gif of them. We run the classifier on each image and count the numbers of false positives (images/c10_I0116.gif), true positives (images/c10_I0117.gif), false negatives (images/c10_I0118.gif) and true negatives (images/c10_I0119.gif). Two popular performance indicators are the precision and the recall. We define them as follows:

images/c10_I0120.gif

The precision is the fraction of the predicted steganograms which are truly steganograms, whereas the recall is the number of actual steganograms which are correctly identified. Very often, a harmonic average of the precision and recall is used. This is defined as follows:

images/c10_I0121.gif

which can also be weighted by a constant images/c10_I0122.gif. Both images/c10_I0123.gif and images/c10_I0124.gif are common:

images/c10_I0125.gif

Finally, we repeat the familiar accuracy:

images/c10_I0126.gif

which is the probability of correct classification. Of all of these measures, only the recall is robust to class skew. All the others use numbers from both columns in the confusion matrix, and therefore they will change when we change the class sizes images/c10_I0127.gif and images/c10_I0128.gif relative to each other.

10.3 The Receiver Operating Curve

So far we have discussed the evaluation of discrete classifiers which make a hard decision about the class prediction. Many classifiers return or can return soft information, in terms of a classification score images/c10_I0129.gif for any object (image) images/c10_I0130.gif. The hard decision classifier will then consist of two components; a model, or heuristic function images/c10_I0131.gif which returns the classification score, and a threshold images/c10_I0132.gif. The hard decision rule assigns one class for images/c10_I0133.gif and the other for images/c10_I0134.gif. Classification scores close to the threshold indicate doubt. If images/c10_I0135.gif is large, the class prediction can be made with confidence.

The threshold images/c10_I0136.gif can be varied to control the trade-off between the false negative rate images/c10_I0137.gif and the false positive rate images/c10_I0138.gif. When we move the threshold, one will increase and the other will decrease. This means that a soft decision classifier defined by images/c10_I0139.gif leaves us some flexibility, and we can adjust the error rates by varying the threshold. How can we evaluate just the model images/c10_I0140.gif universally, without fixing the threshold?

Figure 10.3 shows an example of the trade-off controlled by the threshold images/c10_I0141.gif. We have plotted the detection rate images/c10_I0142.gif on the images/c10_I0143.gif-axis and the false positive rate images/c10_I0144.gif on the images/c10_I0145.gif-axis. Every point on the curve is a feasible point, by choosing a suitable threshold. Thus we can freely choose the false positive rate, and the curve tells us the resulting detection rate, or vice versa. This curve is known as the receiver operating characteristic (ROC). We will discuss it further below, starting with how to calculate it empirically.

Figure 10.3 Sample ROC curves for the PEV-219 feature vector against 512 bytes F5 embedding. The test set is the same as the one used in Chapter 8

10.3

10.3.1 The libSVM API for Python

Although the command line tools include everything that is needed to train and test a classifier, more advanced analysis is easier by calling libSVM through a general-purpose programming language. The libSVM distribution includes two SVM modules, svm and svmutil.


Code Example 10.1 : Example of training and testing of a classifier using the libSVM API for Python. The training set is given as a list of labels labels1 and a list of feature vectors fvectors1, where each feature vector is a list of numeric values. The test set is similarly given by labels2 and fvectors2
train   = svm.svm_problem( labels1, fvectors1 )
param   = svm.svm_parameter( )
model   = svmutil.svm_train( dataset, param )
(L,A,V) = svmutil.svm_predict( labels2, fvectors2, model )

Code Example 10.1 shows how training and testing of an SVM classifier can be done within Python. The output is a tuple images/c10_I0146.gif. We will ignore images/c10_I0147.gif, which includes accuracy and other performance measures. Our main interest is the list of classification scores, images/c10_I0148.gif. The list images/c10_I0149.gif of predicted labels is calculated from images/c10_I0150.gif, with a threshold images/c10_I0151.gif. Thus a negative classification score represents one class, and a positive score the other.

It may be instructive to see the classification scores visualised as a histogram. Figure 10.4 gives an example of classification scores from two different classifiers using Shi et al.'s Markov features. Steganograms (black) and natural images (grey) are plotted separately in the histogram. A threshold is chosen as a point on the images/c10_I0152.gif-axis, so that scores to the left will be interpreted as negatives (natural image) and scores to the right as positives (steganograms). Grey bars to the right of the threshold point are false positives, while black bars to the left are false negatives.

Figure 10.4 Distribution of classification scores for an SVM steganalyser using Shi et al.'s Markov model features: (a) default libSVM parameters; (b) optimised SVM parameters

10.4

Clearly, we have a trade-off between false positives and false negatives. The further to the left we place our threshold, the more positives will be hypothesised. This results in more detections, and thus a lower false negative rate, but it also gives more false positives. Comparing the two histograms in Figure 10.4(b), it is fairly easy to choose a good threshold, almost separating the steganograms from the natural images. The steganalyser with default parameters has poor separation. Regardless of where we place the threshold, there will be a lot of instances on the wrong side. In this case the trade-off is very important, and we need to prioritise between the two error rates.


Code Example 10.2 : Generating the ROC plot in Python. The input parameters are a list of classification labels (images/c10_I0329.gif) Y and a list of classification scores V
def rocplot(Y,V):
   V1 = [ V[i] for i in range(len(V)) if Y[i] == −1 ]
   V2 = [ V[i] for i in range(len(V)) if Y[i] == +1 ]
   (mn,mx) = ( min(V), max(V) )
   bins = [ mn + (mx−mn)*i/250 for i in range(251) ]
   (H1,b) = np.histogram( V1, bins )
   (H2,b) = np.histogram( V2, bins )
   H1 = np.cumsum( H1.astype( float ) ) / len(V1)
   H2 = np.cumsum( H2.astype( float ) ) / len(V2)
   matplotlib.pyplot.plot( H1, H2 )

The ROC curve is also a visualisation of the classification scores. Let images/c10_I0153.gif and images/c10_I0154.gif be the sets of classification scores in testing of clean images and steganograms, respectively. We calculate the relative, cumulative histograms

images/c10_I0155.gif

The ROC plot is then defined by plotting the points images/c10_I0156.gif for varying values of images/c10_I0157.gif. Thus each point on the curve represents some threshold value images/c10_I0158.gif, and it displays the two corresponding probability estimates images/c10_I0159.gif and images/c10_I0160.gif on the respective axes. Code Example 10.2 illustrates the algorithm in Python.

It may be necessary to tune the bin boundaries to make a smooth plot. Some authors suggest a more brute-force approach to create the ROC plot. In our code example, we generate 250 points on the curve. Fawcett (2006) generates one plot point for every distinct classification score present in the training set. The latter approach will clearly give the smoothest curve possible, but it causes a problem with vector graphics output (e.g. PDF) when the training set is large, as every point has to be stored in the graphics file. Code Example 10.2 is faster, which hardly matters since the generation of the data set Y and V would dominate, and more importantly, the PDF output is manageable regardless of the length of V.

10.3.2 The ROC Curve

The ROC curve visualises the trade-off between detection rate and false positives. By changing the classification threshold, we can move up and down the curve. Lowering the threshold will lead to more positive classifications, both increasing the detection (true positives) and false positives alike.

ROC stands for ‘receiver operating characteristic’, emphasising its origin in communications research. The acronym is widely used within different domains, and the origin in communications is not always relevant to the context. It is not unusual to see the C interpreted as ‘curve’. The application of ROC curves in machine learning can be traced back at least to Swets (1988) and Spackman (1989), but it has seen an increasing popularity in the last decade. Ker (2004b) interpreted ROC as region of confidence, which may be a more meaningful term in our applications. The region under the ROC curve describes the confidence we draw from the classifier.

The ROC space represents every possible combination of FP and TP probabilities. Every discrete classifier corresponds to a point images/c10_I0161.gif in the ROC space. The perfect, infallible classifier would be at the point images/c10_I0162.gif. A classifier which always returns ‘stego’ regardless of the input would sit at images/c10_I0163.gif, and one which always returns ‘clean’ at images/c10_I0164.gif.

The diagonal line, from images/c10_I0165.gif to images/c10_I0166.gif, corresponds to some naïve classifier which returns ‘stego’ with a certain probability independently of the input. This is the worst possible steganalyser. If we had a classifier images/c10_I0167.gif operating under the main diagonal, i.e. where images/c10_I0168.gif, we could easily convert it into a better classifier images/c10_I0169.gif. We would simply run images/c10_I0170.gif and let images/c10_I0171.gif return ‘stego’ if images/c10_I0172.gif returns ‘non-stego’ and vice versa. The resulting error probabilities images/c10_I0173.gif and images/c10_I0174.gif of images/c10_I0175.gif would be given as

images/c10_I0176.gif

This conversion reflects any point on the ROC curve around the main diagonal and allows us to assume that any classifier should be above the main diagonal.

This is illustrated in Figure 10.5, where the worthless classifier is shown as the dashed line. Two classifiers, with and without grid search, are compared, and we clearly see that the one with grid search is superior for all values. We have also shown the reflection of the inferior ROC curve around the diagonal. The size of the grey area can be seen as its improvement over the worthless classifier.

Figure 10.5 Sample ROC curves for the PEV-219 feature vector against 512 bytes F5 embedding. The test set is the same as the one used in Chapter 8. The upper curve is from libSVM with grid search optimised parameters, whereas the lower curve is for default parameters. The shaded area represents the Giri coefficient of the non-optimised steganalyser. The features have been scaled in both cases

10.5

Where discrete classifiers correspond to a point in ROC space, classifiers with soft output, such as SVM, correspond to a curve. Every threshold corresponds to a point in ROC space, so varying the threshold will draw a curve, from images/c10_I0177.gif corresponding to a threshold of images/c10_I0178.gif to images/c10_I0179.gif which corresponds to a threshold of images/c10_I0180.gif.

The ROC curve for a soft output classifier is good at visualising the performance independently of the selected threshold. Figure 10.6 shows an example comparing different feature vectors. As we can see, we cannot consistently rank them based on the ROC curve. Two of the curves cross, so that different feature vectors excel at different FP probabilities. Which feature vector we prefer may depend on the use case.

Figure 10.6 ROC plot comparing different feature vectors on a 10% LSBimages/c10_I0326.gif embedding

10.6

One of the important properties of the ROC curve is that it is invariant irrespective of class skew. This follows because it is the collection of just feasible points images/c10_I0181.gif, and both images/c10_I0182.gif and images/c10_I0183.gif measure the performance for a single class of classifier input and consequently are invariant. We also note that classification scores are considered only as ordinal measures, that is, they rank objects in order of how plausibly they could be steganograms. The numerical values of the classification scores are irrelevant to the ROC plots, and thus we can compare ROC plots of different classifiers even if the classification scores are not comparable. This is important as it allows us to construct ROC plots for arbitrary classifier functions images/c10_I0184.gif without converting the scores into some common scale such as likelihoods or images/c10_I0185.gif-values. Any score which defines the same order will do.

10.3.3 Choosing a Point on the ROC Curve

There is a natural desire to search for one scalar heuristic to rank all classifiers according to their performance. What the ROC plot tells us is that there is no one such scalar. In Figure 10.6 we clearly see that one classifier may be superior at one point and inferior at another. Every point on the curve matters. Yet sometimes we need a scalar, and different authors have selected different points on the ROC curve for comparison. The following suggestions have been seen in the steganalysis literature, each with good theoretical reasons (Westfeld, 2007):

1. Fix images/c10_I0186.gif, and use this combined error probability as the performance heuristic.
2. Fix images/c10_I0187.gif, and use images/c10_I0188.gif as the heuristic.
3. Fix images/c10_I0189.gif, and use images/c10_I0190.gif as the heuristic.

Using equal error rates images/c10_I0191.gif is very common in for instance biometrics, and has the advantage that the total error probability in (10.1) becomes independent of class skew.

Aiming at a low false positive probability is natural when one targets applications where it is more serious to falsely accuse the innocent than fail to detect the guilty, which is the case in criminal law in most countries. Some early papers in steganalysis used zero false positives as their target point on the ROC curve. This would of course be great, with all innocent users being safe, if we could calculate it, but it is infeasible for two reasons. From a practical viewpoint, we should long have realised that no system is perfect, so even though we can get images/c10_I0192.gif arbitrarily close to zero, we will never get equality. The theoretical viewpoint is even more absolute. Estimating probabilities close to 0 requires very large samples. As the probability images/c10_I0193.gif to be estimated tends to zero, the required sample size will tend to infinity. The chosen point on the ROC curve has to be one which can be estimated with a reasonably small error bound. Using images/c10_I0194.gif is quite arbitrary, but it is a reasonable trade-off between low FP probability and feasibility of estimation.

The last proposal, with images/c10_I0195.gif, advocated by Ker (2004a), may sound strange. Intuitively, we may find 50% detection unacceptable for most applications. However, if the secret communications extend over time, and the steganalyst acts repeatedly, the steganography will eventually be detected. The advantage of images/c10_I0196.gif is that the threshold images/c10_I0197.gif is then equal to the median of the classifier soft output images/c10_I0198.gif for steganograms, at which point the detection probability is independent of the variance of images/c10_I0199.gif. This removes one unknown from the analysis, and less information is required about the statistical properties of steganograms.

10.3.4 Confidence and Variance

The ROC curves we learnt to draw in Section 10.3.1 are mere snapshots of performance on one single test set, just as a single observation of the sample accuracy images/c10_I0200.gif would be. There has been a tendency in the machine learning literature to overstate observations from such ROC plots (Fawcett, 2003), by treating better as better without distinguishing between approximately equal and significantly better. In order to make confident conclusions, we need a notion of standard errors or confidence intervals.

Variance and confidence intervals are defined for scalar points, and there is no one canonical way to generalise it for curves. However, since each point visualises two scalars, images/c10_I0201.gif and images/c10_I0202.gif, we can calculate confidence intervals or error bounds along each of the axes. This is shown in Figure 10.7, where the vertical and horizontal bars show a images/c10_I0203.gif error bound for images/c10_I0204.gif and images/c10_I0205.gif respectively. The error bounds are calculated using the standard estimator for the standard error of the binomial parameter, namely

Figure 10.7 Example of ROC plot marking confidence intervals for images/c10_I0327.gif and images/c10_I0328.gif for some thresholds

10.7

images/c10_I0206.gif

where images/c10_I0207.gif is the sample size. Note that there is no correspondence between points on the different curves. Each point corresponds to a given classifier threshold, but since classification scores are not comparable between different soft output classifiers, the thresholds are not comparable either.


PythonTip 10.1
The plot is generated using the matplotlib.pyplot.errorbar function.

In the example, we can observe that the confidence intervals are narrow and are far from reaching the other curve except for error rates close to images/c10_I0208.gif. This method of plotting error bars gives a quick and visual impression of the accuracy of the ROC plot, and is well worth including.

Several authors have studied confidence intervals for ROC curves, and many methods exist. Fawcett (2003) gives a broad discussion of the problem, and the importance of considering the variance. Macskassy and Provost (2004) discuss methods to calculate confidence bands for the ROC curve. It is common to use cross-validation or bootstrapping to generate multiple ROC curves, rather than the simple method we described above, but further survey or details on this topic is beyond the scope of this book.

10.3.5 The Area Under the Curve

There is one well-known alternative to picking a single point on the ROC curve, namely measuring the area under the curve (AUC), that is, integrate the ROC curve on the interval images/c10_I0209.gif:

images/c10_I0210.gif

where we consider images/c10_I0211.gif as a function of images/c10_I0212.gif, as visualised by the ROC curve. The worst possible classifier, namely the random guessing represented by a straight line ROC curve from images/c10_I0213.gif to images/c10_I0214.gif, has an AUC of images/c10_I0215.gif. No realistic classifier should be worse than this. Hence the AUC has range images/c10_I0216.gif to images/c10_I0217.gif.

It may seem more natural to scale the heuristic so that the worst possible classifier gets a value of 0. A popular choice is the so-called Gini coefficient (Fawcett, 2003), which is equivalent to the reliability images/c10_I0218.gif that Fridrich (2005) defined in the context of steganalysis. The Gini coefficient can be defined as

images/c10_I0219.gif

The images/c10_I0220.gif measure corresponds to the shaded area in Figure 10.5. Clearly, the perfect classifier has images/c10_I0221.gif and random guessing has images/c10_I0222.gif.

The AUC also has another important statistical property. If we draw a random steganogram images/c10_I0223.gif and a random clean image images/c10_I0224.gif, and let images/c10_I0225.gif and images/c10_I0226.gif denote their respective classification scores, then

images/c10_I0227.gif

In other words, the AUC gives us the probability of correct pairwise ranking of objects from different classes.

Westfeld (2007) compared the stability of the four scalar heuristics we have discussed, that is, he sought the measure with the lowest variation when the same simulation is repeated. He used synthetic data to simulate the soft output from the classifier, assuming either Gaussian or Cauchy distribution. He used 1000 samples each for stego and non-stego, and repeated the experiment to produce 100 000 ROC curves. Westfeld concluded that the error rate images/c10_I0228.gif at equal error rates and the Gini coefficient images/c10_I0229.gif were very close in stability, with images/c10_I0230.gif being the most stable for Cauchy distribution and images/c10_I0231.gif being the most stable for Gaussian distribution. The false positive rate images/c10_I0232.gif at 50% detection (images/c10_I0233.gif) was only slightly inferior to the first two. However, the detection rate at images/c10_I0234.gif false positives was significantly less stable. This observation is confirmed by our Figure 10.7, where we can see a high variance for the detection rate at low false positive rates, and a low variance for images/c10_I0235.gif when the detection rate is high.


Code Example 10.3 : Calculating the AUC. The input parameters are lists of the images/c10_I0330.gif and images/c10_I0331.gif coordinates of the points on the curve, that is the H1 and H2 variables calculated in Code Example 10.2
def auc(H1,H2):
   heights = [ (a+b)/2 for (a,b) in zip(H1[1:],H1[:−1]) ]
   widths  = [ (a−b) for (a,b) in zip(H2[1:],H2[:−1]) ]
   areas   = [ h*w for (h,w) in zip(heights,widths) ]
   return sum(areas)

The AUC can be calculated using the trapezoidal rule for numerical integration. For each pair of adjacent points on the ROC curve, images/c10_I0236.gif and images/c10_I0237.gif, we calculate the area of the trapezoid with corners at images/c10_I0238.gif, images/c10_I0239.gif, images/c10_I0240.gif, images/c10_I0241.gif. The area under the curve is (roughly) made up of all the disjoint trapezoids for images/c10_I0242.gif, and we simply add them together. This is shown in Code Example 10.3.

10.4 Experimental Methodology

Experiments with machine learning can easily turn into a massive data management problem. Evaluating a single feature vector, a single stego-system and one cover source is straight forward and gives little reason to think about methodology. A couple of cases can be done one by one manually, but once one starts to vary the cover source, the stego-system, the message length and introduce numerous variations over the feature vectors, manual work is both tedious and error-prone. There are many tests to be performed, reusing the same data in different combinations. A robust and systematic framework to generate, store, and manage the many data sets is required.

Feature extraction is the most time-consuming part of the job, and we clearly do not want to redo it unnecessarily. Thus, a good infrastructure for performing the feature extraction and storing feature data is essential. This is a complex problem, and very few authors have addressed it. Below, we will discuss the most important challenges and some of the solutions which have been reported in the literature.

10.4.1 Feature Storage

The first question is how to store the feature vectors. The most common approach is to use text files, like the sparse format used by the libSVM command line tools. Text files are easy to process and portable between programs. However, text files easily become cumbersome when one takes an interest in individual features and sets out to test variations of feature vectors. Normally, one will only be able to refer to an individual feature by its index, which may change when new features are added to the system, and it is easy to get confused about which index refers to which feature. A data structure that can capture metadata and give easy access to sub-vectors and individual features is therefore useful.

In the experiments conducted in preparation for this book, the features were stored as Python objects, which were stored in binary files using the pickle module. This solution is also very simple to program, and the class can be furnished with any kind of useful metadata, including names and descriptions of individual features. It is relatively straight forward to support extraction of sub-vectors. Thus one object can contain all known features, and the different feature vectors of interest can be extracted when needed.

There are problems with this approach. Firstly, data objects which are serialised and stored using standard libraries, like pickle, will often give very large files, and even though disk space is cheap, reading and writing to such files may be slow. Secondly, when new features are added to the system, all the files need to be regenerated.

Ker (2004b) proposed a more refined and flexible solution. He built a framework using an SQL database to store all the features. This would tend to take more time and/or skill to program, but gives the same flexibility as the Python objects, and on top of that, it scales better. Ker indicated that more than several hundred million feature vector calculations were expected. Such a full-scale database solution also makes it easy to add additional information about the different objects, and one can quickly identify missing feature vectors and have them calculated.

It may be necessary to store the seeds used for the pseudo-random number generators, to be able to reproduce every object of a simulation exactly. It is clearly preferable to store just the seeds that are used to generate random keys and random messages, rather than storing every steganogram, because the image data would rapidly grow out of hand when the experiments are varied.

10.4.2 Parallel Computation

The feature extraction in the tests we made for Chapters 4–9 took about two minutes per spatial image and more for JPEG. With 5000 images, this means close to a CPU week and upwards. Admittedly, the code could be made more efficient, but programmer time is more expensive than CPU time. The solution is to parallellise the computation and use more CPUs to reduce the running time.

Computer jobs can be parallellised in many different ways, but it always breaks down to one general principle. The work has to be broken into small chunks, so that multiple chunks can be done independently on different CPU cores. Many chunks will depend on others and will only be able to start when they have received data from other, completed chunks. The differences between paradigms stem from differences in the chunk sizes and the amount of data which must be exchanged when each parallel set of chunks is complete.

For a steganalysis system, a chunk could be the feature extraction from one image. The chunks are clearly independent, so there is no problem in doing them in parallel, and we are unlikely to have more CPU cores than images, so there is no gain in breaking the problem into smaller chunks. This makes parallelism very simple. We can just run each feature extraction as an independent process, dumping the result to file. Thus the computation can easily be distributed over a large network of computers. Results can be accumulated either by using a networked file system, or by connecting to a central database server to record them.

The simplest way to run a large series of jobs on multiple computers is to use a batch queuing system like PBS, Torque or condor. The different jobs, one feature extraction job per image for instance, can be generated and submitted, and the system will distribute the work between computers so that the load is optimally balanced. Batch queue systems are most commonly found on clusters and supercomputers tailored for number crunching, but they can also run on heterogeneous networks. It is quite feasible to install a batch queue system in a student lab or office network to take advantage of idle time out of hours.

There may be good reasons to create a special-purpose queuing system for complex experiments. This can be done using a client–server model, where the server keeps track of which images have been processed and the clients prompt the server for parameters to perform feature extraction. Ker (2004b) recommended a system in which queued jobs were stored in the same MySQL database where the results were stored. This has the obvious advantage that the same system is used to identify missing data and to queue the jobs to generate them.

It is quite fortunate that feature extraction is both the most time-consuming part of the experiment and also simple to parallellise across independent processes. Thus, most of the reward can be reaped with the above simple steps. It is possible to parallellise other parts of the system, but it may require more fine-grained parallelism, using either shared memory (e.g. open-MP) or message passing (e.g. MPI). Individual matrix operations may be speeded up by running constituent scalar operations in parallel, and this would allow us both to speed up feature extractions for one individual image for faster interactive use, and also to speed up SVM or other classifiers. Unless one can find a ready-made parallel-processing API to do the job, this will be costly in man hours, but when even laptop CPUs have four cores or more, this might be the way the world is heading. It can be very useful to speed up interactive trial and error exercises.

Not all programming languages are equally suitable for parallellisation. The C implementation of Python currently employs a global interpreter lock (GIL) which ensures that computation cannot be parallellised. The justification for GIL is that it avoids the overhead of preventing parallel code segments from accessing the same memory. The Java implementation, Jython, does not have this limitation, but then it does not have the same libraries, such as numpy.

10.4.3 The Dangers of Large-scale Experiments

There are some serious pitfalls when one relies on complex series of experiments. Most importantly, running enough experiments you will always get some interesting, surprising – and false – results. Consider, for instance, the simple case of estimating the accuracy, where we chose to quote twice the standard error to get images/c10_I0243.gif error bounds. Running one test, we feel rather confident that the true accuracy will be within the bound, and in low-risk applications this is quite justified.

However, if we run 20 tests, we would expect to miss the images/c10_I0244.gif error bound almost once on average. Running a thousand tests we should miss the error bound 46 times on average. Most of them will be a little off, but on average one will fall outside three times the standard error. Even more extreme results exist, and if we run thousands of tests looking for a surprising result, we will find it, even if no such result is correct. When many tests are performed in the preparation of analysis, one may require a higher level of confidence in each test. Alternatively, one can confirm results of particular interest by running additional experiments using a larger sample.

The discussion above, of course, implicitly assumes that the tests are independent. In practice, we would typically reuse the same image database for many tests, creating dependencies between the tests. The dependencies will make the analysis less transparent. Quite possibly, there are enough independent random elements to make the analysis correct. Alternatively, it is possible that extreme results will not occur as often, but in larger number when they occur at all.

To conclude, when large suites of experiments are used, care must be taken to ensure the level of confidence. The safest simple rule of thumb is probably to validate surprising results with extra, independent tests with a high significance level.

10.5 Comparison and Hypothesis Testing

In Chapter 3 we gave some rough guidelines on how much different two empirical accuracies had to be in order to be considered different. We suggested that accuracies within two standard errors of each other should be considered (approximately) equal.

A more formal framework to compare two classifiers would be to use a statistical hypothesis test. We remember that we also encountered hypothesis testing for the pairs-of-values test in Chapter 2, and it has been used in other statistical steganalysers as well. It is in fact quite common and natural to phrase the steganalysis problem as a statistical hypothesis test. In short, there are many good reasons to spend some time on hypothesis testing as a methodology.

10.5.1 The Hypothesis Test

A common question to ask in steganalysis research is this: Does steganalyser images/c10_I0245.gif have a better accuracy than steganalyser images/c10_I0246.gif? This is a question which is well suited for a hypothesis test. We have two hypotheses:

images/c10_I0247.gif

where images/c10_I0248.gif and images/c10_I0249.gif denote the accuracy of steganalyser images/c10_I0250.gif and images/c10_I0251.gif respectively. Very often the experimenter hopes to demonstrate that a new algorithm images/c10_I0252.gif is better than the current state of the art. Thus images/c10_I0253.gif represents the default or negative outcome, while images/c10_I0254.gif denotes, in a sense, the desired or positive outcome. The hypothesis images/c10_I0255.gif is known as the null hypothesis.

Statistical hypothesis tests are designed to give good control of the probability of a false positive. We want to ensure that the evidence is solid before we promote the new algorithm. Hence, we require a known probability model under the condition that images/c10_I0256.gif is true, so that we can calculate the probability of getting a result leading us to reject images/c10_I0257.gif when we should not. It is less important to have a known probability model under images/c10_I0258.gif, although we would need it if we also want to calculate the false negative probability.

The test is designed ahead of time, before we look at the data, in such a way that the probability of falsely rejecting images/c10_I0259.gif is bounded. The test is staged as a contest between the two hypotheses, with clearly defined rules. Once the contest is started there is no further room for reasoning or discussion. The winning hypothesis is declared according to the rules.

In order to test the hypothesis, we need some observable random variable which will have different distributions under images/c10_I0260.gif and images/c10_I0261.gif. Usually, that is an estimator for the parameters which occur in the hypothesis statements. In our case the hypothesis concerns the accuracies images/c10_I0262.gif and images/c10_I0263.gif, and we will observe the empirical accuracies images/c10_I0264.gif and images/c10_I0265.gif. If images/c10_I0266.gif, then images/c10_I0267.gif is plausible. If images/c10_I0268.gif, it is reasonable to reject images/c10_I0269.gif. The question is, how much larger ought images/c10_I0270.gif to be before we can conclude that images/c10_I0271.gif is a better steganalyser?

10.5.2 Comparing Two Binomial Proportions

The accuracy is the success probability of a Bernoulli trial in the binomial distribution, also known as the binomial proportion. There is a standard test to compare the proportions of two different binomial distributions, such as for two different classifiers images/c10_I0272.gif and images/c10_I0273.gif. To formulate the test, we rephrase the null hypothesis as

images/c10_I0274.gif

Thus we can consider the single random variable images/c10_I0275.gif, rather than two variables.

If the training set is at least moderately large, then images/c10_I0276.gif is approximately normally distributed. The expectation and variance of images/c10_I0277.gif are given as follows:

images/c10_I0278.gif

which can be estimated as

images/c10_I0279.gif

Thus we have a known probability distribution under images/c10_I0280.gif, as required. Under images/c10_I0281.gif we expect images/c10_I0282.gif, while any large value of images/c10_I0283.gif indicates that images/c10_I0284.gif is less plausible. For instance, we know that the probability images/c10_I0285.gif. In other words, if we decide to reject images/c10_I0286.gif if images/c10_I0287.gif, we know that the probability of making a false positive is less than images/c10_I0288.gif. The probability of falsely rejecting the null hypothesis is known as the significance level of the hypothesis test, and it is the principal measure of the reliability of the test.

We can transform the random variable images/c10_I0289.gif to a variable with standard normal distribution under images/c10_I0290.gif, by dividing by the standard deviation. We define

10.2 10.2

which has zero mean and unit variance under the null hypothesis. The standard normal distribution images/c10_I0292.gif makes the observed variable easier to handle, as it can be compared directly to tables for the normal distribution.

So far we have focused on the null hypothesis images/c10_I0293.gif, and its negation has implicitly become the alternative hypothesis:

images/c10_I0294.gif

This hypothesis makes no statement as to which of images/c10_I0295.gif or images/c10_I0296.gif might be the better, and we will reject images/c10_I0297.gif regardless of which algorithm is better, as long as they do not have similar performance.

Very often, we want to demonstrate that a novel classifier images/c10_I0298.gif is superior to an established classifier images/c10_I0299.gif. If images/c10_I0300.gif, it may be evidence that images/c10_I0301.gif is implausible, but it hardly proves the point we want to make. In such a case, the alternative hypothesis should naturally be

images/c10_I0302.gif

Which alternative hypothesis we make has an impact on the decision criterion. We discussed earlier that in a test of images/c10_I0303.gif against images/c10_I0304.gif at a significance level of images/c10_I0305.gif, we would reject the null hypothesis if images/c10_I0306.gif. If we test images/c10_I0307.gif against images/c10_I0308.gif however, we do not reject the null hypothesis for images/c10_I0309.gif. At a significance level of images/c10_I0310.gif we should reject images/c10_I0311.gif if images/c10_I0312.gif, since images/c10_I0313.gif.


Example 10.5.1
Suppose we want to check if Cartesian calibration as discussed in Chapter 9 improves the performance of the Pevný features. That is, we want to compare CCPEV-438 (classifier X) with NCPEV-219 (classifier Y). The hypotheses are given as

images/c10_I0314.gif

We observe the random variable images/c10_I0315.gif as defined in (10.2). Let's choose a significance level of images/c10_I0316.gif and test the algorithms on a sample of 4000 images. The experiment returns images/c10_I0317.gif and images/c10_I0318.gif, and we calculate

images/c10_I0319.gif

We can look up the value images/c10_I0320.gif such that images/c10_I0321.gif for a normally distributed variable images/c10_I0322.gif in Python:
In [34]: import scipy.stats

In [35]: N = scipy.stats.norm()

In [36]: N.ppf( [ 0.95 ] )
Out[36]: array([ 1.64485363])

Here N is defined as a standard normal distribution and ppf() is the inverse cumulative density function. Thus images/c10_I0323.gif, and we can reject images/c10_I0324.gif at images/c10_I0325.gif significance. We conclude that Cartesian calibration improves the Pevný features.

10.6 Summary

A much too frequent pitfall in steganalysis is to exaggerate conclusions based on very simplified heuristics. It would be useful to re-evaluate many well-known stego-systems, aiming to get a more balanced, complete and confident assessment. Most importantly, a notion of confidence, or statistical significance, is required. This is well understood in statistics, and in many disciplines it would be unheard of to publish results without an assessment of error bounds and significance. In steganalysis there are still open problems to answer, where the original authors have cut corners.

Furthermore one needs to be aware that different applications will require different trade-offs between false positives and false negatives, and knowledge of the full ROC curve is often necessary. We also know that classifier performance will vary depending on message lengths and cover sources. To get a good picture of field performance, more than just a few combinations must be tried. We will return to these questions in Chapter 14.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.98.207