© Umberto Michelucci 2018
Umberto MichelucciApplied Deep Learninghttps://doi.org/10.1007/978-1-4842-3790-8_6

6. Metric Analysis

Umberto Michelucci1 
(1)
toelt.ai, Dübendorf, Switzerland
 

Let’s consider the problem we analyzed in Chapter 3 for which we performed classification on the Zalando dataset. While doing all our work, we made a strong assumption without explicitly saying it: we assumed that all the observations were correctly labeled. We cannot say that with certainty. To perform the labelling, some manual intervention was needed, and, therefore, a certain number of images were surely wrongly classified, as humans are not perfect. This is an important revelation. Consider the following scenario: in Chapter 3, we achieved roughly 90% accuracy with our model. One could try to get better and better accuracy, but when is it sensible to stop trying? If your labels are wrong in 10% of cases, your model, as sophisticated as it may be, will never be able to generalize to new data with very high accuracy, because it will have learned wrong classes for many images. We spent quite some time checking and preparing the training data, normalizing it, for example, but we never spent any time checking the labels themselves. We also assumed that all classes have similar characteristics. (I will discuss later in this chapter what this exactly means, for the moment, an intuitive understanding of the concept will suffice.) What if the quality of the images for specific classes is worse than for others? What if the number of pixels whose gray value differs from zero is dramatically different for different classes? We also did not check if some images are completely blank. What happens in that case? As you can imagine, we cannot check all images manually, attempting to detect such issues. Suppose we have millions of images, a manual analysis is surely not possible.

We need a new weapon in our arsenal to be able to spot such cases and to be able to tell how a model is doing. This new weapon is the focus of this chapter, and it is what I call “metric analysis.” Very often, people in the field refer to this array of methods as “error analysis.” I find that this name is very confusing, especially for beginners. Error may refer to too many things: Python code bugs, errors in the methods, in algorithms, errors in the choice of optimizers, and so on. You will see in this chapter how to obtain fundamental information on how your model is doing and how good your data is. We will do this by evaluating your optimizing metric on a set of different datasets that you can derive from your data.

You have already seen a basic example previously. You will remember that we discussed, with regard to regression, how in the case of MSEtrain ≪ MSEdev, we are in a regime of overfitting. Our metric is the MSE (mean squared error), and evaluating it on two datasets, training and dev, and comparing the two values, can inform you whether the model is overfitting. I will expand on this methodology in this chapter, to allow you to extract much more information from your data and model.

Human-Level Performance and Bayes Error

In most of the datasets that we use for supervised learning, someone must have labeled the observations. Take, for example, a dataset in which we have images that are classified. If we ask people to classify all images (imagine this being possible, regardless of the number of images), the accuracy obtained will never be 100%. Some images may be too blurry to be classified correctly, and people make mistakes. If, for example, 5% of the images are not classifiable correctly, owing, for example, to how blurry they are, we must expect that the maximum accuracy people can reach will always be less than 95%.

Let’s consider a classification problem. First, let’s define what we mean by the word error. In this chapter, the word error will be used to indicate the following quantity, represented by ϵ:
$$ epsilon equiv 1- Accuracy $$

For example, if, with a model, we achieve an accuracy of 95%, we will have ϵ = 1 − 0.95 = 0.05 or, expressed as a percent, ϵ = 5%.

A useful concept to understand is human-level performance, which can be defined as follows:

Human-level performance ( definition 1) : The lowest value for the error ϵ that can be achieved by a person performing the classification task. We will indicate it with ϵhlp.

Let’s devise a concrete example. Suppose we have a set of 100 images. Now let’s suppose we ask three people to classify the 100 images. Imagine that they obtain 95%, 93%, and 94% accuracy. In this case, human-level performance accuracy will be ϵhlp = 5%. Note that someone else may be much better at this task, and, therefore, it is always important to consider that the value of ϵhlp we get is always an estimate and should only serve as a guideline.

Now let’s complicate things a bit. Suppose we are working on a problem in which doctors classify MRI scans in two classes: with signs of cancer and without. Now let’s suppose we calculate ϵhlp from the results of untrained students obtaining 15%, from doctors with a few years of experience obtaining 8%, from experienced doctors obtaining 2%, and from experienced groups of doctors obtaining 0.5%. What then is ϵhlp? You should always choose the lowest value you can get, for reasons I will discuss later.

We can now expand the definition of ϵhlp with a second definition.

Human level performance ( definition 2) : The lowest value for the error ϵ that can be achieved by people or groups of people performing the classification task

Note

You don’t have to decide which definition is right. Just use the one that gives you the lowest value of ϵhlp.

Now I’ll talk a bit about why we must choose the lowest value we can get for ϵhlp. Suppose that of the 100 images, 9 are too blurry to be correctly classified. This means that the lowest error any classifier will be able to reach is 9%. The lowest error that can be reached by any classifier is called the Bayes error . We will indicate this with ϵBayes. In this example, ϵBayes = 9%. Usually, ϵhlp is very close to ϵBayes, at least in tasks at which humans excel, such as image recognition. It is commonly said that human-level performance error is a proxy for the Bayes error. Normally it is impossible or very hard to know ϵBayes, and, therefore, practitioners use ϵhlp assuming the two are close, because the latter is easier (relatively) to estimate.

Keep in mind that it makes sense to compare the two values and assume that ϵhlp is a proxy for ϵBayes only if persons (or groups of persons) perform classification in the same way as the classifier. For example, it is OK if both use the same images to do classification. But, in our cancer example, if the doctors use additional scans and analysis to diagnose cancer, the comparison is no longer fair, because human-level performance will not be a proxy for a Bayes error anymore. Doctors, having more data at their disposal, clearly will be more accurate than the model that has as input only the images at its disposal.

Note

ϵhlp and ϵBayes are close to each other only in cases in which the classification is done in the same way by humans and from the model. So, always check if that is the case, before assuming that human-level performance is a proxy for the Bayes error.

Something else that you will notice when working on models is that with relatively little effort, you can achieve a quite low rate of error and often (almost) reach ϵhlp. After passing human-level performance (and, in several cases, that is possible), progress tends to be very, very slow, as is illustrated in Figure 6-1.
../images/463356_1_En_6_Chapter/463356_1_En_6_Fig1_HTML.jpg
Figure 6-1

Typical values of accuracy that can be achieved vs. amount of time invested . At the beginning, it is very easy to achieve quite a good accuracy with machine learning often reaching ϵhlp. This is intuitively indicated by the line in the plot. After that point, the progress tends to be very slow.

As long as the error of your algorithm is bigger than ϵhlp, you can use the following techniques to get better results:
  • Get better labels from humans or groups, for example, from groups of doctors, as in the case of medical data in our example.

  • Get more labeled data from humans or groups.

  • Do a good metric analysis to determine the best strategy for getting better results. You will learn how to do this in this chapter.

As soon as your algorithm exceeds human-level performance, you cannot rely on those techniques anymore. So, it is important to get an idea of those numbers, to decide what to do to obtain better results. Taking our example of MRI scans, we could get better labels by relying on sources that are not related to humans, for example, checking diagnoses a few years after the date of the MRI, when it is usually clear whether a patient has developed cancer. Or, for example, in the case of image classification, you may decide yourself to take a few thousands of images of specific classes. This is not usually possible, but I wanted to make the concept clear: you can get labels by means other than by asking humans to perform the same kind of task that your algorithm is performing.

Note

Human-level performance is a good proxy for Bayes error for tasks at which humans excel, such as image recognition. For tasks that humans are very bad at, performance can be very far from the Bayes error.

A Short Story About Human-Level Performance

I want to tell you a story about the work that Andrej Karpathy has done while trying to estimate human-level performance in a specific case. You can read the entire story on his blog post (a long post, but one that I suggest you read) at https://goo.gl/iqCbC0 . Let me summarize what he did, since it is extremely instructive concerning what human-level performance really is. Karpathy was involved in the ILSVRC (ImageNet Large Scale Visual Recognition Challenge) in 2014 ( https://goo.gl/PCHWMJ ). The task was made up of 1.2 million images (training set) classified in 1000 categories, including such objects as animals, abstract objects such as a spiral, scenes, and many more. Results were evaluated on a dev dataset. GoogleLeNet (a model developed by Google) reached an astounding 6.7% error. Karpathy wondered how humans would compare.

The question is a lot more complicated than it may seem at first sight. Because the images were all classified by humans, shouldn’t ϵhlp = 0%? Well, actually, no. In fact, the images were first obtained with a web search, then they were filtered and labeled by asking people binary questions, for example, Is this a hook or not? The images were collected, as Karpathy mentions in his blog post, in a binary way. People were not asked to assign to each image a class, choosing from the 1000 available, as the algorithms were doing. You may think that this is a technicality, but the difference in how the labeling occurs makes the correct evaluation of a ϵhlp quite a complicated matter. So, Karpathy set to work and developed a web interface that consisted of an image on the left, and the 1000 classes with examples on the right. You can see an example of the interface in Figure 6-2. You can try the interface (and I suggest you do so) at https://goo.gl/Rh8S6g , to understand how complicated such a task is. People trying the interface kept missing classes and making mistakes. The best error that was reached was about 15%. So, Karpathy did what every scientist at some point in his/her career must do: he bored himself to death and did a careful annotation himself, sometimes requiring 20 minutes for a single image. As he states in his blog post, he did it only #forscience. He was able to reach a stunning ϵhlp = 5.1%, 1.7% better than the best algorithm at the time. He listed sources of errors to which GoogLeNet is more susceptible than humans, such as problems with multiple objects in an image, and sources of errors to which humans are more susceptible than GoogLeNet, such as problems with classes with a huge granularity (dogs are classified in 120 different subclasses, for example).
../images/463356_1_En_6_Chapter/463356_1_En_6_Fig2_HTML.png
Figure 6-2

Web interface developed by Karpathy. Not everyone would find it amusing to look at 120 breeds of dogs, to try to classify the dog on the left (which, by the way, is a Tibetan mastiff).

If you have a few hours to spare, I suggest you try. You will gain a whole new appreciation of the difficulties of evaluating human-level performance. Defining and evaluating human-level performance is a very tricky task. It is important to understand that ϵhlp is dependent on how humans approach the classification task, which is dependent on the time invested, the patience of the persons performing the task, and on many factors that are difficult to quantify. The main reason for it being so important, apart from the philosophical aspect of knowing when a machine becomes better than humans, is that it is often taken as a proxy for the Bayes error, which gives a lower limit of our possibilities.

Human-Level Performance on MNIST

Before moving on to the next subject, I would like to give you another example of human-level performance on a dataset we have analyzed together: the MNIST dataset. Human-level performance has been widely analyzed, and it has been found that ϵhlp = 0.2%. (You can read a good review on the subject by Dan Cireşan: “Multi-column Deep Neural Networks for Image Classification,” Technical Report No. IDSIA-04-12, Dalle Molle Institute for Artificial Intelligence, https://goo.gl/pEHZVB .) Now you may wonder why a human cannot achieve 100% accuracy classifying simple digits, but see Figure 6-3, and attempt to identify which digits are in the images. I surely cannot. You may, therefore, better understand why ϵhlp = 0% is not possible, and why a person cannot achieve 100% accuracy. Other reasons may be related to which culture people come from. In some countries, the digit representing seven is written in a very similar way to that for ones, for example, and in some cases, mistakes can be made. In other countries, the digit seven has a small dash along the vertical bar, making it easier to distinguish from a one.
../images/463356_1_En_6_Chapter/463356_1_En_6_Fig3_HTML.jpg
Figure 6-3

A set of digits from the MNIST dataset that are almost impossible to recognize. Such examples are one of the reasons why ϵhlp cannot be zero.

Bias

Now let’s start with a metric analysis: a set of procedures that will give you information on how your model is doing and how good or bad your data is, by evaluating your optimizing metric on different datasets.

Note

Metric analysis consists of a set of procedures that will give you information on how your model is doing and how good or bad your data is, by looking at your evaluating your optimizing metric on different datasets.

To start, we must first define a third error: the one evaluated on the training dataset, indicated with ϵtrain.

The first question we want to answer is if our model is not as flexible or complex as needed to reach human-level performance. Or, in other words, we want to know if our model has a high bias, with respect to human-level performance.

To answer the previous question, we can do the following: calculate the error from our model from our training dataset ϵtrain and then calculate |ϵtrain − ϵhlp|. If the number is not small (bigger than a few percent), we are in the presence of bias (sometimes called avoidable bias), that is, our model is too simple to capture the real subtleties of our data.

Let’s define the following quantity
$$ Delta {epsilon}_{Bias}=left|{epsilon}_{train}-{epsilon}_{hlp}
ight| $$
The bigger ΔϵBias is, the more bias our model has. In this case, you want to do better on the training set, because you know you can do better on your training data. (We will look at the problem of overfitting in a moment.) The following techniques work to reduce bias:
  • Bigger networks (more layers or neurons)

  • More complex architectures (convolutional neural networks, for example)

  • Training your model longer (for more epochs)

  • Using better optimizers (such as Adam)

  • Doing a better hyperparameter search (covered in Chapter 7)

There is something else you need to understand. Knowing ϵhlp and reducing the bias to reach it are two very different things. Suppose you know the ϵhlp for your problem. This does not mean that you have to reach it. It may well be that you are using the wrong architecture, but you may not have the skills required to develop a more sophisticated network. It may even be that the effort required to achieve the desired error level would be prohibitive (in terms of hardware or infrastructure). Always keep in mind what your problem requirements are. Always try to understand what is good enough. For an application that recognizes cancer, you may want to invest as much as possible to achieve the highest accuracy possible: you don’t want to send someone home only to discover the presence of cancer months later. On the other hand, if you build a system to recognize cats from web images, you may find a higher error than ϵhlp completely acceptable.

Metric Analysis Diagram

In this chapter, we look at different problems that you will encounter when developing your models and how to spot them. We have looked at the first one: bias, sometimes also called avoidable bias. We have seen how this can be spotted by calculating ΔϵBias. At the end of this chapter, you will have a few of those quantities that you can calculate to spot problems. To make understanding them easier, I use what I like to call the metric analysis diagram (MAD). It is simply a bar diagram, in which each bar represents a problem. Let’s start to build one with (for the moment) the only quantity we have discussed: bias. You can see it in Figure 6-4. At the moment, it is a pretty dumb diagram, but you will see how useful it is to keep things under control when you have several problems at the same time.
../images/463356_1_En_6_Chapter/463356_1_En_6_Fig4_HTML.png
Figure 6-4

Metric analysis diagram (MAD) with only one of the quantities we will encounter in this chapter: ΔϵBias

Training Set Overfitting

Another problem we have discussed at length in the previous chapters is overfitting of training data. You will remember in Chapter 5, while executing regression, we saw an extreme case of overfitting, in which MSEtrain ≪ MSEdev. The same applies in classification problems. Let’s indicate with ϵtrain the error our model has on our training dataset and with ϵdev the one on the dev dataset. We can then say we are overfitting the training set if ϵtrain ≪ ϵdev. Let’s define a new quantity
$$ Delta {epsilon}_{overfitting train}=left|{epsilon}_{train}-{epsilon}_{dev}
ight| $$

With this quantity, we can say that we are overfitting the training dataset if Δϵoverfitting train is bigger than a few percent.

Let’s summarize what we have defined and discussed so far. We have three errors:
  • ϵtrain: The error of our classifier on the training dataset

  • ϵhlp: Human-level performance (as discussed in the previous sections)

  • ϵdev: The error of our classifier on the dev dataset

With those three quantities , we have defined
  • ΔϵBias = |ϵtrain − ϵhlp|: Measuring how much “bias” we have between the training dataset and human-level performance

  • Δϵoverfitting train = |ϵtrain − ϵdev|: Measuring the amount of overfitting of the training dataset

In addition, up to now, we have used two datasets
  • Training dataset: The dataset that we use to train our model (you should know it by now)

  • Dev dataset: A second dataset that we use to check the overfitting on the training dataset

Now let’s suppose our model has bias and is slightly overfitting the training dataset, meaning we have ΔϵBias = 6% and Δϵoverfitting train = 4%. Our MAD now becomes what is depicted in Figure 6-5.
../images/463356_1_En_6_Chapter/463356_1_En_6_Fig5_HTML.jpg
Figure 6-5

MAD diagram for our two problems: bias and overfitting of training dataset

As you can see in Figure 6-5, you can have a quick overview of the relative gravity of the problems we have, and you may decide which one you want to address first.

Usually when you are overfitting the training dataset, it is commonly known as a variance problem. When this happens, you can try the following techniques to minimize this problem:
  • Get more data for your training set

  • Use regularization (review Chapter 5 for a complete discussion of the subject)

  • Try data augmentation (for example, if you are working with images, you can try rotating them, shifting them, etc.)

  • Try “simpler” network architectures

As usual, there are no fixed rules, and you must test which techniques work best on your problem.

Test Set

I would like to quickly mention another problem you may encounter. We will look at it in detail in Chapter 7, because it is related to hyperparameter search. Recall how you choose the best model in a machine-learning project (this is not specific to deep learning, by the way)? Let’s suppose we are working on a classification problem. First, we decide which optimizing metric we want, let’s suppose we decide to use accuracy. Then we build an initial system, feed it with training data, and see how it is doing on the dev dataset, to check if we are overfitting our training data. You will remember that in previous chapters, we have talked often about hyperparameters—parameters that are not influenced by the learning process. Examples of hyperparameters are the learning rate, regularization parameter, etc. We have seen many of them in the previous chapters. Let’s say you are working with a specific neural network architecture. You need to search the best values for the hyperparameters, to see how good your model can get. To do this, you train several models with different values of the hyperparameters and check their performance on the dev dataset. What can happen is that your models work well on the dev dataset but don’t generalize at all, because you select the best values using only the dev dataset. You incur the risk of overfitting the dev dataset by choosing specific values for your hyperparameters. To check if this is the case, you create a third dataset, called the test dataset, cutting a portion of the observations from your starting dataset, which you use to check the performance of your models.

We must define a new quantity
$$ Delta {epsilon}_{overfittingkern0.5em dev}=left|{epsilon}_{dev}-{epsilon}_{test}
ight| $$
where ϵtest is the error evaluated on the test set. We can add it to our MAD diagram (Figure 6-6).
../images/463356_1_En_6_Chapter/463356_1_En_6_Fig6_HTML.jpg
Figure 6-6

The MAD diagram for the three problems we may encounter: bias, overfitting of training data, overfitting of dev data

Note that if you are not doing any hyperparameter search, you will not need a test dataset. It is only useful when you are doing extensive searches; otherwise, in most cases, it is useless and takes away observations that you may use for training. What we discussed so far assumes that your dev and test set observations have the same characteristics. For example, if you are working on an image recognition problem and you decide to use images from a smartphone with high resolution for training and the dev dataset, and images from the Web in low resolution for your test dataset, you may see a big |ϵdev − ϵtest|, but that will probably be owing to the differences in the images and not to an overfitting problem. I will discuss later in the chapter what can happen when different sets come from different distributions (another way of saying that the observations have different characteristics), what exactly this means, and what you can do about it.

How to Split Your Dataset

Now I would like to discuss briefly how to split your data in both a general and deep-learning context.

But what exactly does “split” mean? Well, as discussed in the previous section, you will require a set of observations to make the model learn, which you call your training set. You also will need a set of observations that will constitute your dev set, and a final set called the test set. Typically, you would see splits such as 60% of observations for the training set, 20% of observations for the dev set, and 20% of observations for the test set. Usually, these kinds of splits are indicated in the following form: 60/20/20, where the first number (60) refers to the percentage of the entire dataset that makes up the training set, the second (20) to the percentage of the entire dataset that makes up the dev set, and the last (20) to the percentage that makes up the test set. In books, blogs, or articles, you may encounter sentences such as “We will split our dataset 80/10/10.” You now have an explanation of what this means.

Usually, in the deep-learning field, you will deal with big datasets. For example, if we have m = 106, we could use a split such as 98/1/1. Keep in mind that 1% of 106 is 104—a big number! Remember that the dev/test set must be big enough to give high confidence to the performance of the model, but not unnecessarily big. Additionally, you will want to save as many observations as possible for your training set.

Note

When deciding on how to split your dataset, if you have a big number of observations (for example, 106 or even more), you can split your dataset 98/1/1 or 90/5/5. So, as soon as your dev and test dataset reach a reasonable size (depending on your problem), you can stop. When deciding how to split your dataset, keep in mind how big your dev/test sets must be.

Now remember that, as you may know, size is not everything. Your dev and test datasets should be representative of your training dataset and problem. Let’s create an example. Let’s consider the ImageNet challenge described earlier. There, you want to classify images in 1000 different classes. To know how your model is performing in your dev and test datasets, you will require enough images for each class in each set. If you decide to take only 1000 observations for the dev or test dataset, you are not going to get any reasonable result, because, in case all classes are represented in the dev set, you will only have one observation for each class. You should decide to build your dev and test dataset choosing, for example, 100 images for each class at least, building two datasets (dev and test), each containing 105 observations in total (remember we have 1000 classes). In this case, it would not be sensible to go below this number. This is not only relevant in a deep-learning context but in machine learning in general. You should always try to build a dev/test dataset reflecting the same distribution of observations you have in your training set. To understand what I mean, take the MNIST dataset, for example. Let’s load the dataset (as we have done before) with the following code:
import numpy as np
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
X,y = mnist["data"], mnist["target"]
total = 0
then we can check how often (in %) each digit appears in the dataset.
for i in range(10):
    print ("digit", i, "makes", np.around(np.count_nonzero(y == i)/70000.0*100.0, decimals=1), "% of the 70000 observations")
This gives us the result
digit 0 makes 9.9 % of the 70000 observations
digit 1 makes 11.3 % of the 70000 observations
digit 2 makes 10.0 % of the 70000 observations
digit 3 makes 10.2 % of the 70000 observations
digit 4 makes 9.7 % of the 70000 observations
digit 5 makes 9.0 % of the 70000 observations
digit 6 makes 9.8 % of the 70000 observations
digit 7 makes 10.4 % of the 70000 observations
digit 8 makes 9.8 % of the 70000 observations
digit 9 makes 9.9 % of the 70000 observations
Not every digit appears the same number of times in the dataset. When building our dev and test datasets, we should check that our distributions reflect this one; otherwise, when applying our model to the dev or test dataset, we could get a result that does not make much sense, because the model has learned from a different class distribution. You may remember that in Chapter 5, we created a dev dataset with a code like this one:
np.random.seed(42)
rnd = np.random.rand(len(y)) < 0.8
train_y = y[rnd]
dev_y = y[~rnd]
In this case, for the sake of clarity, I just split the labels, to see how the algorithm is working. In real life, you also would have to split the features, of course. Because our original distribution is almost uniform, you should expect a result that is very similar to the original one. Let’s check it with the following code:
for i in range(10):
    print ("digit", i, "makes", np.around(np.count_nonzero(train_y == i)/56056.0*100.0, decimals=1), "% of the 56056 observations")
This gives us the result
digit 0 makes 9.9 % of the 56056 observations
digit 1 makes 11.3 % of the 56056 observations
digit 2 makes 9.9 % of the 56056 observations
digit 3 makes 10.1 % of the 56056 observations
digit 4 makes 9.8 % of the 56056 observations
digit 5 makes 9.0 % of the 56056 observations
digit 6 makes 9.8 % of the 56056 observations
digit 7 makes 10.4 % of the 56056 observations
digit 8 makes 9.8 % of the 56056 observations
digit 9 makes 9.9 % of the 56056 observations
You can compare these results with those from the entire dataset. You will notice that they are very close—not the same (compare, for example, digit 2), but close enough. In this case, I would simply proceed without worries. But let’s create a slightly different example. Suppose that instead of choosing randomly the observations to create your training and dev datasets, you decide to take the first 80% of the observations and assign it to the training set and the last 20% and assign it to the dev set, because you assume that your observations are randomly distributed in your original NumPy arrays. Let’s try and see what happens. First, let’s build our train and dev datasets, using the first 56,000 (0.8*70000) observations for the training set and the rest for the dev set.
srt = np.zeros_like(y,  dtype=bool)
np.random.seed(42)
srt[0:56000] = True
train_y = y[srt]
dev_y = y[~srt]
We can again check how many digits we have with the following code:
 total = 0
for i in range(10):
    print ("class", i, "makes", np.around(np.count_nonzero(train_y == i)/56000.0*100.0, decimals=1), "% of the 56000 observations")
This gives us the result
class 0 makes 8.5 % of the 56000 observations
class 1 makes 9.6 % of the 56000 observations
class 2 makes 8.5 % of the 56000 observations
class 3 makes 8.8 % of the 56000 observations
class 4 makes 8.3 % of the 56000 observations
class 5 makes 7.7 % of the 56000 observations
class 6 makes 8.5 % of the 56000 observations
class 7 makes 9.0 % of the 56000 observations
class 8 makes 8.4 % of the 56000 observations
class 9 makes 2.8 % of the 56000 observations

Do you notice anything different? The biggest difference is that now, class 9 is only appearing in 2.8% of the cases. Before, it was appearing in 9.9% of the cases. Apparently, our hypothesis that the classes are distributed according to a random uniform distribution was not right. This can be quite dangerous when checking how the model is doing or because your model may end up learning from a so-called unbalanced class distribution .

Note

Usually, an unbalanced class distribution in a dataset refers to a classification problem in which one or more classes appear a different number of times than others. Generally, this becomes a problem in the learning process when the difference is significant. A few percent difference is often not an issue.

If you have a dataset with three classes, for example, where you have 1000 observations in each class, then the dataset has a perfectly balanced class distribution, but if you have in class 1 only 100 observations, in class 2 10,000 observations, and in class 3 5000, then we talk about an unbalanced class distribution. You should not think that this is a rare occurrence. Suppose you have to build a model that recognizes fraudulent credit card transactions. It is safe to assume that those transactions are a very small percent of the entire amount of transactions that you will have at your disposal.

Note

When splitting your dataset, you must pay great attention not only to the number of observations you have in each dataset but also to which observations go in each dataset. Note that this problem is not specific to deep learning but is important generally in machine learning.

To go into details on how to deal with unbalanced datasets would be beyond the scope of this book, but it is important to understand what kind of consequences they may have. In the next section, I will show you what can happen if you feed an unbalanced dataset to a neural network, so that you gain a concrete understanding of the possibility. At the end of the section, I will offer a few hints on what to do in such a case.

Unbalanced Class Distribution: What Can Happen

Because we are talking about how to split our dataset to perform metric analysis, it is important to grasp the concept of unbalanced class distribution and how to deal with it. In deep learning, you will find yourself very often splitting datasets, and you should be aware of the problems you may encounter if you do this in the wrong way. Let me give you a concrete example of how bad things can go if you do it wrongly.

We will use the MNIST dataset, and we will do basic logistic regression (as we did in Chapter 2) with a single neuron. Let’s look very quickly again at how to load and prepare the data. We will do it in a similar way as in Chapter 2, apart from some modifications that I will point out to you. First, we load the data
import numpy as np
from sklearn.datasets import fetch_mldata
from sklearn.metrics import confusion_matrix
import tensorflow as tf
mnist = fetch_mldata('MNIST original')
Xinput,yinput = mnist["data"], mnist["target"]
Here comes the important part. We create a new label in this way: we assign to all observations for the digit zero the label 0, and to all other digits (1, 2, 3 ,4, 5, 6, 7, 8, and 9) the label 1, with the code
y_ = np.zeros_like(yinput)
y_[np.any([yinput == 0], axis = 0)] = 0
y_[np.any([yinput > 0], axis = 0)] = 1
Now the array y_ will contain the new labels. Note that now the dataset is heavily unbalanced. Label 0 appears roughly in 10% of the cases, while label 1 appears in 90% of the cases. Let’s split the data randomly in a train and a dev dataset .
np.random.seed(42)
rnd = np.random.rand(len(y_)) < 0.8
X_train = Xinput[rnd,:]
y_train = y_[rnd]
X_dev = Xinput[~rnd,:]
y_dev = y_[~rnd]
We then normalize the training data.
X_train_normalised = X_train/255.0
We then transpose and prepare the tensors.
X_train_tr = X_train_normalised.transpose()
y_train_tr = y_train.reshape(1,y_train.shape[0])
Then we assign proper names to the variables.
Xtrain = X_train_tr
ytrain = y_train_tr
Then we build our network with one single neuron, exactly as we did in Chapter 2.
tf.reset_default_graph()
X = tf.placeholder(tf.float32, [n_dim, None])
Y = tf.placeholder(tf.float32, [1, None])
learning_rate = tf.placeholder(tf.float32, shape=())
W = tf.Variable(tf.zeros([1, n_dim]))
b = tf.Variable(tf.zeros(1))
init = tf.global_variables_initializer()y_ = tf.sigmoid(tf.matmul(W,X)+b)
cost = - tf.reduce_mean(Y * tf.log(y_)+(1-Y) * tf.log(1-y_))
training_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
If you don’t understand the code, review Chapter 2 for more details. I expect that you now understand this simple model well, as we have seen it several times. Next, we define the function to run the model (you have seen it several times in the previous chapters).
def run_logistic_model(learning_r, training_epochs, train_obs, train_labels, debug = False):
    sess = tf.Session()
    sess.run(init)
    cost_history = np.empty(shape=[0], dtype = float)
    for epoch in range(training_epochs+1):
        sess.run(training_step, feed_dict = {X: train_obs, Y: train_labels, learning_rate: learning_r})
        cost_ = sess.run(cost, feed_dict={ X:train_obs, Y: train_labels, learning_rate: learning_r})
        cost_history = np.append(cost_history, cost_)
        if (epoch % 10 == 0) & debug:
            print("Reached epoch",epoch,"cost J =", str.format('{0:.6f}', cost_))
    return sess, cost_history
Let’s run the model with the code
sess, cost_history = run_logistic_model(learning_r = 0.01,
                                training_epochs = 100,
                                train_obs = Xtrain,
                                train_labels = ytrain,
                                debug = True)
and check the accuracy with the following code (explained at length in Chapter 2, if you don’t remember):
correct_prediction=tf.equal(tf.greater(y_, 0.5), tf.equal(Y,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={X:Xtrain, Y: ytrain, learning_rate: 0.05}))
We get an incredible 91.2% accuracy. Not bad, right? But are we sure that the result is that good? Now let’s check the confusion matrix1 for our labels with the code
ypred = sess.run(tf.greater(y_, 0.5), feed_dict={X:Xtrain, Y: ytrain, learning_rate: 0.05}).flatten().astype(int)
confusion_matrix(ytrain.flatten(), ypred)
When you run the code, you get the following result:
array([[ 659, 4888],
       [ 6, 50503]], dtype=int64)
Slightly more nicely formatted and with some explanatory information, the matrix looks like Table 6-1.
Table 6-1

Confusion Matrix for the model described in the text

 

Predicted Class 0

Predicated Class 1

Real class 0

659

4888

Real class 1

6

50503

How should we read the table? In the column “Predicted Class 0,” you will see the number of observations that our model predicts as being of class 0 for each real class. 659 is the number of observations our model predicts as being of class 0 that are really in class 0. 6 is the number of observations that our model predicts in class 0 that are really in class 1.

It should be easy to see now that our model predicts effectively almost all observations to be in class 1 (a total of 4888 + 50,503 = 55,391). The number of correct classified observations is 659 (for class 0) and 50,503 (for class 1), for a total of 51,162 observations. Because we have a total of 56,056 observations in our training set, we get an accuracy of 51162/56056 = 0.912, as our TensorFlow code above told us. This is not because our model is good; it is simply because it has classified effectively all observations in class 1. In this case, we don’t need a neural network to achieve this accuracy. What happens is that our model sees observations belonging to class 0 so rarely that it almost doesn’t influence the learning, which is dominated by the observations in class 1.

What at the beginning seemed a nice result turns out to be a really bad one. This is an example of how bad things can go, if you don’t pay attention to the distributions of your classes. This, of course, applies not only when splitting your dataset, but in general, when you approach a classification problem, regardless of the classifier you want to train (it does not apply only to neural networks).

Note

When splitting your dataset in complex problems, you must pay special attention not only to the number of observations you have in your datasets but also to what observations you choose and to the distribution of the classes.

To conclude this section, let me give you a few hints on how to deal with unbalanced datasets.
  • Change your metric : In the preceding example, you may want to use something else instead of accuracy, because it can be misleading. You could try using the confusion matrix, for example, or other metrics, such as precision, recall, or F1. Another important way of checking how your model is doing, and one that I strongly suggest you learn, is the ROC curve, which will help you tremendously.

  • Work with an undersampled dataset . If, for example, you have 1000 observations in class 1 and 100 in class 2, you may create a new dataset with 100 random observations in class 1 and the 100 you have in class 2. The problem with this method, however, is that you usually will have a lot less data to feed to your model to train it.

  • Work with an oversampled dataset . You may try to do the opposite. You may take the 100 observations in class 2 mentioned above and simply replicate them 10 times, to end up with 1000 observations in class 2 (sometimes called sampling with replacement).

  • Try to get more data in the class with less observations : This is not always possible. In the case of fraudulent credit card transactions, you cannot go around and generate new data, unless you want to go to jail…

Precision, Recall, and F1 Metrics

Let’s look at some other metrics that are very useful when dealing with unbalanced datasets. Consider the following example. Suppose we are doing some tests to determine if a subject has a certain disease or not. Imagine that we have 250 test results. Consider the following confusion matrix (you should know what that is from our previous discussion):
$$ {displaystyle egin{array}{ccc}&amp; Prediction: NO&amp; Prediction: YES\ {} True value: NO&amp; 75&amp; 15\ {} True value: YES&amp; 10&amp; 150end{array}} $$
We will indicate with N the total number of test results, in this case N = 250. We will use the following terminology:
  • True positives (tp): Tests that predicted yes, and the subjects really have the disease

  • True negatives (tn): Tests that predicted no, and the subjects do not have the disease

  • False positives (fp): Tests that predicted yes, and the subjects do not have the disease

  • False negatives (fn): Tests that predicted no, but the subjects do have the disease

This translates visually to the following:
$$ {displaystyle egin{array}{ccc}&amp; Prediction: NO&amp; Prediction: YES\ {} True value: NO&amp; TRUE NEGATIVES&amp; FALSE POSITIVES\ {} True value: YES&amp; FALSE NEGATIVES&amp; TRUE POSITIVESend{array}} $$
Let’s also indicate with ty the number of patients who really have a disease, in this example, ty = 10 + 150 = 160, and with tno, the number of patients who don’t have a disease, in this example, tno = 75 + 15 = 90. In the examples we would have
$$ tp=150 $$
$$ tn=75 $$
$$ fp=15 $$
$$ fn=10 $$
We can express several metrics as functions of the previously discussed terms. For example:
  • Accuracy: (tp + tn)/N, how often our test is right

  • Misclassification rate: (fp + fn)/N, how often our test is wrong. Note that this is equal to 1 − accuracy.

  • Sensitivity/Recall: tp/ty, how often the test really predicts yes when the subjects have the disease

  • Specificity: tn/tno, when the subjects have no disease, how often our test predicts no

  • Precision: tp/(tp + fp), the portion of tests predicting correctly the subject having the disease with respect to all positive results obtained

All those quantities can be used as metrics, depending on your problem. Let’s create an example. Suppose your test should predict if a person has cancer or not. In this case, what you want is the highest sensitivity possible, because it is important to detect the disease. But at the same time, you also want the specificity to be high, because there is nothing worse than sending someone home without treatment when it is needed.

Let’s look a bit closer at precision and recall. Having a high precision means that when you say someone is sick, you are right. But you don’t know how many people really have the sickness, since the quantity is defined only by the result of your test. Precision is a measure of how your test is doing. Having a high recall means that you can identify all the sick people in your sample. Let me give you another example, to make the point even clearer. Suppose we have 1000 people. Only 10 are sick and 990 are healthy. Let’s suppose we want to identify healthy people (this is important), and we build a test that returns yes if someone is healthy and always predicts that people are healthy. The confusion matrix would look like this:
$$ {displaystyle egin{array}{ccc}&amp; Prediction: NOkern0.5em (Sick)&amp; Prediction: YESkern0.5em (Healthy)\ {} True: NOkern0.5em (Sick)&amp; 0&amp; 10\ {} True: YESkern0.5em (Healthy)&amp; 0&amp; 990end{array}} $$
We would have
$$ tp=990 $$
$$ tn=0 $$
$$ fp=10 $$
$$ fn=0 $$
That means that
  • Accuracy would be 99%.

  • The misclassification rate would be 10/1000 or, in other words, 1%.

  • Recall would be 990/990 or 100%.

  • Specificity would be 0%.

  • Precision would be 99%.

This looks good, right? If want to find healthy people, this test would be great. The only problem is that it is a lot more important to identify sick people! Let’s recalculate the preceding quantities, but this time, considering that a positive result is when someone is sick. In this case, the confusion matrix would like this:
$$ {displaystyle egin{array}{ccc}&amp; Prediction: NOkern0.5em (Healthy)&amp; Prediction: YESkern0.5em (Sick)\ {} True: NOkern0.5em (Healthy)&amp; 990&amp; 0\ {} True: YESkern0.5em (Sick)&amp; 10&amp; 0end{array}} $$
because this time, a yes result means that someone is sick and not, as before, that someone is healthy. Let’s calculate the quantities above again.
$$ tp=0 $$
$$ tn=990 $$
$$ fp=0 $$
$$ fn=10 $$
Therefore,
  • Accuracy would still be 99%.

  • The misclassification rate would still be 10/1000 or, in other words, 1%.

  • Recall would now be 0/10 or 0%.

  • Specificity would be 990/990 or 100%.

  • Precision would be (0 + 0)/1000 or 0%.

Note how the accuracy remains the same. If you look only at that, you would not be able to understand how your model is doing. We have simply changed what we want to predict and use only accuracy. We cannot say anything about the performance of our model. But look at how recall and precision changed. See the matrix below for a comparison.
$$ {displaystyle egin{array}{ccc}&amp; Predicting healthy people&amp; Predicting sick people\ {} Recall&amp; 100\%&amp; 0\%\ {} Precision&amp; 99\%&amp; 0\%end{array}} $$

Now we have something that changes that can give us enough information, depending on the question we pose. Note that changing what we want to predict will change how the confusion matrix will look. We can immediately say, looking at the preceding matrix, that our model that predicts that everyone is healthy works very well when predicting healthy people (not very useful) but fails miserably when trying to predict sick people.

There is another metric that is important to know, and that is the F1 score. It is defined as
$$ F1=frac{2}{frac{1}{Precision}+frac{1}{Recall}}=2frac{Precisioncdotp Recall}{Precision+ Recall} $$
An intuitive understanding is difficult to get, but it is basically the harmonic average of precision and recall. The example we created was a bit extreme and, having 0% for recall or precision , would not allow us to calculate F1. Let’s suppose that our model is bad at predicting sick people, but not that bad. Let’s suppose we have the following confusion matrix:
$$ {displaystyle egin{array}{ccc}&amp; Prediction: NOkern0.5em (Healthy)&amp; Prediction: YESkern0.5em (Sick)\ {} True: NOkern0.5em (Healthy)&amp; 985&amp; 5\ {} True: YESkern0.5em (Sick)&amp; 9&amp; 1end{array}} $$
In this case, we would have (I leave the calculation to you)
$$ 54.5\% $$
(Precision:)
$$ 10\% $$
(Recall:)
We would have
$$ F1=2cdotp frac{0.545ast 0.1}{0.545+0.1}=2frac{0.0545}{0.645}=mathrm{0.16.9}	o 16.9\% $$

This quantity will give you information keeping precision (the portion of tests predicting correctly the subject having the disease with respect to all positive results obtained) and recall (how often the test really predicts yes when the subjects have the disease) in consideration. For some problems, you want to maximize precision, and for others, you want to maximize recall. If that is the case, simply choose the right metric. The F1 score will be the same for two cases in which you have Precision = 32% and Recall = 45% and one with Precision = 45% and Recall = 32%. Be aware of this fact. Use F1 score, if you want to find a balance between Precision and Recall.

Note

The F1 score is used when you want to maximize the harmonic average of Precision and Recall, or, in other words, when you don’t want to maximize either Precision or Recall alone, but you want to find the best balance between the two.

If we calculate F1 when predicting healthy people, as we did at first, we would have
$$ F1=2frac{1.0kern0.6em cdot kern0.6em 0.99}{1.0+0.99}=2frac{0.99}{1.99}=0.995	o 99.5\% $$

This tells us that the model is quite good at predicting healthy people.

The F1 score is usually useful, because, normally, as a metric, you want one single number, and in this way, you don’t have to decide between precision or recall, as both are useful. Remember that the values of the metrics discussed will always be dependent on the question you are asking (what is yes and no for you). Be aware that an interpretation is always dependent on the question you want to answer.

Note

Remember that when calculating your metric, whatever it may be, changing your question will change the results. You must be very clear at the beginning of what you want to predict and then choose the right metric. In the case of highly unbalanced datasets, it is always a good idea to use not accuracy but other metrics as recall, precision, or, even better, F1, it being an average of precision and recall.

Datasets with Different Distributions

Now I would like to discuss another terminology issue, which will lead you to understand a common problem in the deep-learning world. Very often, you will hear sentences such as “The sets come from different distributions.” This sentence is not always easy to understand. Take, for example, two datasets formed by images taken with a professional DSLR, and a second one made up of images taken with a dodgy smartphone. In the deep-learning world, we would characterize those two sets as coming from different distributions. But what is the real meaning of the sentence? The two datasets differ for various reasons: resolution of images, blurriness resulting from different quality lenses, amount of colors, quality of the focus, and possibly more. All these differences are what is usually meant by distributions. Let’s look at another example. We could consider two datasets: one made of images of white cats and one made of images of black cats. Also, in this case, we are talking about different distributions. This becomes a problem when you train a model on one set and want to apply it to the other. For example, if you train a model on a set of images of white cats, you probably are not going to do very well on the dataset of black cats, because your model has never seen black cats during training.

Note

When talking about datasets coming from different distributions, it is usually meant that the observations have different characteristics in the two datasets: black and white cats, high and low-resolution images, speech recorded in Italian and German, and so on.

Because data is so precious, people often try to create different datasets (train, dev, etc.) from different sources. For example, you may decide to train your model on a set made of images taken from the Web and check how good it is with a set made of images you’ve taken with your smartphone. It may seem like a good idea to be able to use as much data as possible, but this may give you many headaches. Let’s see what happens in a real case, so that you may get a feeling of the consequences of doing something similar.

Let’s consider the subset of the MNIST dataset that we have used in Chapter 2, made of the two digits: 1 and 2. We will build a dev dataset coming from a different distribution, shifting a subset of the images 10 pixels to the right. We will train our model on the images as they are in the original dataset and apply the model to images shifted 10 pixels to the right and see what happens. Let’s first load the data (you can check for more details Chapter 2).
import numpy as np
from sklearn.datasets import fetch_mldata
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from random import *
mnist = fetch_mldata('MNIST original')
Xinput,yinput = mnist["data"], mnist["target"]
We will do the data preparation exactly as in Chapter 2. First, let’s select only digit 1 and 2.
X_ = Xinput[np.any([y == 1,y == 2], axis = 0)]
y_ = yinput[np.any([y == 1,y == 2], axis = 0)]
We have 14,867 observations in our dataset. Now let’s create a train and a dev dataset with our random selection (as we have done before), as in this case, we have roughly the same number of ones and twos.
np.random.seed(42)
rnd_train = np.random.rand(len(y_)) < 0.8
X_train = X_[rnd_train,:]
y_train = y_[rnd_train]
X_dev = X_[~rnd_train,:]
y_dev = y_[~rnd_train]
Then we normalize the features.
X_train_normalized = X_train/255.0
X_dev_normalized = X_dev/255.0
And then we transform the matrices to have them with the right dimensions.
X_train_tr = X_train_normalized.transpose()
y_train_tr = y_train.reshape(1,y_train.shape[0])
n_dim = X_train_tr.shape[0]
dim_train = X_train_tr.shape[1]
X_dev_tr = X_dev_normalized.transpose()
y_dev_tr = y_dev.reshape(1,y_dev.shape[0])
Finally, we shift the labels to have 0 and 1 (if you don’t remember why, you can quickly review Chapter 2).
y_train_shifted = y_train_tr - 1
y_dev_shifted = y_dev_tr - 1
Now let’s give the arrays reasonable names.
Xtrain = X_train_tr
ytrain = y_train_shifted
Xdev = X_dev_tr
ydev = y_dev_shifted
We can check the sizes of the arrays with the code
print(Xtrain.shape)
print(Xdev.shape)
This gives us
(784, 11893)
(784, 2974)
We have 11,893 observations in our training set and 2974 in the dev set. Now let’s duplicate the dev dataset and shift each image to the right by 10 pixels. We can do it quickly with the following code:
Xtraindev = np.zeros_like(Xdev)
for i in range(Xdev.shape[1]):
    tmp = Xdev[:,i].reshape(28,28)
    tmp_shifted = np.zeros_like(tmp)
    tmp_shifted[:,10:28] = tmp[:,0:18]
    Xtraindev[:,i] = tmp_shifted.reshape(784)
ytraindev = ydev
To make the shift easy, I first reshaped the images in a 28 × 28 matrix, then simply shifted the columns with tmp_shifted[:,10:28] = tmp[:,0:18], and then I simply reshaped the images in a one-dimensional array of 784 elements. The labels remain the same. In Figure 6-7, you can see a random image from the dev dataset on the left and its shifted version on the right.
../images/463356_1_En_6_Chapter/463356_1_En_6_Fig7_HTML.png
Figure 6-7

One random image from the dataset (left) and its shifted version (right)

Now let’s build a network with a single neuron and see what happens. We build the model as we have in Chapter 2.
tf.reset_default_graph()
X = tf.placeholder(tf.float32, [n_dim, None])
Y = tf.placeholder(tf.float32, [1, None])
learning_rate = tf.placeholder(tf.float32, shape=())
W = tf.Variable(tf.zeros([1, n_dim]))
b = tf.Variable(tf.zeros(1))
init = tf.global_variables_initializer()
y_ = tf.sigmoid(tf.matmul(W,X)+b)
cost = - tf.reduce_mean(Y * tf.log(y_)+(1-Y) * tf.log(1-y_))
training_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
To train the model, we will use the same function you have already seen
def run_logistic_model(learning_r, training_epochs, train_obs, train_labels, debug = False):
    sess = tf.Session()
    sess.run(init)
    cost_history = np.empty(shape=[0], dtype = float)
    for epoch in range(training_epochs+1):
        sess.run(training_step, feed_dict = {X: train_obs, Y: train_labels, learning_rate: learning_r})
        cost_ = sess.run(cost, feed_dict={ X:train_obs, Y: train_labels, learning_rate: learning_r})
        cost_history = np.append(cost_history, cost_)
        if (epoch % 10 == 0) & debug:
            print("Reached epoch",epoch,"cost J =", str.format('{0:.6f}', cost_))
    return sess, cost_history
and we will train the model with the code
sess, cost_history = run_logistic_model(learning_r = 0.01,
                                training_epochs = 100,
                                train_obs = Xtrain,
                                train_labels = ytrain,
                                debug = True)
This gives us the output
Reached epoch 0 cost J = 0.678501
Reached epoch 10 cost J = 0.562412
Reached epoch 20 cost J = 0.482372
Reached epoch 30 cost J = 0.424058
Reached epoch 40 cost J = 0.380005
Reached epoch 50 cost J = 0.345703
Reached epoch 60 cost J = 0.318287
Reached epoch 70 cost J = 0.295878
Reached epoch 80 cost J = 0.277208
Reached epoch 90 cost J = 0.261400
Reached epoch 100 cost J = 0.247827
Next, let’s calculate the accuracy of the three datasets: Xtrain, Xdev, and Xtraindev, with the code
correct_prediction=tf.equal(tf.greater(y_, 0.5), tf.equal(Y,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={X:Xtrain, Y: ytrain, learning_rate: 0.05}))
simply using the right feed_dict for the three datasets. We get the following results after 100 epochs:
  • For the training dataset, we get 96.8%.

  • For the dev dataset, we get 96.7%.

  • For the train-dev (you will see later why it is called like this), the one with the shifted images, we get 46.7%. A very bad result.

What has happened is that the model has learned from a dataset where all images are centered in the box and, therefore, could not generalize well to images shifted and no longer centered.

When training a model on a dataset, usually you will get good results for observations that are like the ones in the training set. But how can you find out if you have such a problem? There is a relatively easy way of doing that: expanding our MAD diagram . Let’s see how to do it.

Suppose you have a training dataset and a dev dataset in which the observations have different characteristics (come from different distributions). What you do is create a small subset from the training set, called the train-dev dataset, ending up with three datasets: a training and a train-dev from the same distribution (the observations have the same characteristics) and a dev set, for which the observations are somehow different, as I have discussed previously. What you do now is train your model on your training set and then evaluate your error ϵ on the three datasets: ϵtrain, ϵdev, and ϵtrain − dev. If your train and dev sets come from the same distributions, so does the train-dev set. In this case, you should expect ϵdev ≈ ϵtrain − dev. If we define
$$ Delta {epsilon}_{train- dev}={epsilon}_{train- dev} $$
we should expect Δϵtrain − dev ≈ 0. If the train (and train-dev) and the dev set come from different distributions (the observations have different characteristics), we should expect Δϵtrian − dev to be big. If we consider the MNIST example we have created before, we have, in fact, Δϵtrain − dev = 0.437, or 43.7%, which is a huge difference. Let’s recap what you should do to determine if your training and your dev (or test) dataset have observations with different characteristics (come from different observations).
  1. 1.

    Split your training set in two—one that you will use for training and one we will call the train set—and a smaller one that you will call train-dev set.

     
  2. 2.

    Train your model on the train set.

     
  3. 3.

    Evaluate your error ϵ on the three sets: train, dev, and train-dev.

     
  4. 4.

    Calculate the quantity Δϵtrain − dev. If it is big, this will provide strong evidence that the original training and dev sets come from different distributions.

     
In Figure 6-8, you can see an example of the MAD diagram with the added problem just discussed. Don’t look at the numbers; they are there only for illustrative purposes (read: I just put them there).
../images/463356_1_En_6_Chapter/463356_1_En_6_Fig8_HTML.jpg
Figure 6-8

Example of the MAD diagram with the data mismatch problem added. Don’t look at the numbers; they are there strictly for illustrative purposes.

The MAD diagram in Figure 6-8 can tell us the following things. (I highlight in the bulleted list only a few items. For a more complete list, review the previous sections.)
  • The bias (between training and human-level performance) is quite small, so we are not that far from the best we can achieve (let’s assume here that human-level performance is a proxy for the Bayes error). Here, you could try bigger networks, better optimizers, and so on.

  • We are overfitting the datasets, so we could try regularization or get more data.

  • We have a strong problem with data mismatch (sets coming from different distributions) between train and dev. At the end of this section, I suggest what you could do to solve this problem.

  • We are also slightly overfitting the dev dataset, during our hyperparameter search.

Note that you don’t need to create the bar plot, as I have done here. Technically, you require only the four numbers, to draw the same conclusions.

Note

Once you have your MAD diagram (or simply the numbers), interpreting it will give you hints on what you should try to get better results, for example, higher accuracy.

You can try the following techniques to address data mismatch between sets:
  • You can conduct manual error analysis, to understand the difference between the sets, and then decide what to do (in the last section of the chapter, I will give you an example). This is time-consuming and usually quite difficult, because once you know what the difference is, it may be very difficult to find a solution.

  • You could try to make the training set more like your dev/test sets. For example, if you are working with images and the test/dev sets have a lower resolution, you may decide to lower the resolution of the images in the training set.

As usual, there are no fixed rules. Just be aware of the problem and think about the following: your model will learn the characteristics from your training data, so when applied to completely different data, it (usually) won’t do well. Always get training data that reflect the data you want your model to work on, not vice versa.

K-Fold Cross-Validation

Now I would like to finish this chapter with another technique that is very powerful and should be known by any machine-learning practitioner (not only in the deep-learning world): k-fold cross-validation. The technique is a way of finding a solution to the following two problems:
  • What to do when your dataset is too small to split it in a train and dev/test set

  • How to get information on the variance of your metric

Let’s describe the idea with pseudo-code.
  1. 1.

    Partition your complete dataset in k equally big subsets: f1, f2, …, fk. The subsets are also called folds. Normally the subsets are not overlapping, that means that each observation appears in one and only one fold.

     
  2. 2.
    For i going from 1 to k:
    • Train your model on all the folds except fi

    • Evaluate your metric on the fold fi. The fold fi will be the dev set in iteration i

     
  3. 3.

    Evaluate the average and variance of your metric on the k results

     

A typical value for k is 10, but that depends on the size of your dataset and the characteristic of your problem.

Remember that the discussion we did on how to split a dataset applies here also.

Note

When you are creating your folds, you must take care that they reflect the structure of your original dataset. For example, if your original dataset has 10 classes, you must make sure that each of your folds has all the 10 classes, with the same proportions.

Although this may seem a very attractive technique to deal generally with datasets with less than optimal size, it may be quite complex to implement. But, as you will see shortly, checking your metric on the different folds will give you important information on possible overfitting of your training dataset.

Let’s try it on a real dataset and see how to implement it. Note that you can implement k-fold cross-validation easily in with the sklearn library, but I will develop it from scratch, to show you what is happening in the background. Everyone (well, almost) can copy code from the Web to implement k-fold cross-validation in sklearn, but not many can explain how it works or understand it, therefore being able to choose the right sklearn method or parameters. As a dataset, we will use the same we used in Chapter 2: the reduced MNIST dataset containing only digits 1 and 2. We will perform a simple logistic regression with one neuron, to make the code easy to understand and to let us concentrate on the cross-validation part and not on other implementation details that are not relevant here. The goal of this section is to let you understand how k-fold cross-validation works and why it is useful, not on how to implement it with the smallest number of lines of code possible.

Let’s import the necessary libraries, as usual.
import numpy as np
from sklearn.datasets import fetch_mldata
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from random import *
Then let’s import the MNIST dataset.
mnist = fetch_mldata('MNIST original')
Xinput_,yinput_ = mnist["data"], mnist["target"]
Remember that the dataset has 70,000 observations and is made of grayscale images, each 28 × 28 pixels in size. You can again check Chapter 2 for a detailed discussion. Then let’s select only digits 1 and 2 and rescale the labels, to make sure that digit 1 has label 0 and digit 2 has label 1. You will remember from Chapter 2 that the cost function we will use for logistic regression expects the two labels to be 0 and 1.
Xinput = Xinput_[np.any([yinput_ == 1,yinput_ == 2], axis = 0)]
yinput = yinput_[np.any([yinput_ == 1,yinput_ == 2], axis = 0)]
yinput = yinput - 1
We can check the number of observations with the code
Xinput.shape[0]
We have 14,867 observations (images). Now we perform a small trick. To keep the code simple, we want each fold to have the same number of observations. Technically speaking, this is not required, and you will often end up with the last fold having a number of observations that is smaller than the others. In this case, if we want 10 folds, we cannot have in each fold the same number of observations, because 14,867 is not a multiple of 10. To make things easier, let’s simply remove the last seven images from the dataset. (From an aesthetic point of view, this is horrible, but it will make our code much easier to understand and write.)
Xinput = Xinput[:-7,:]
yinput = yinput[:-7]
Now let’s create 10 arrays, each containing a list of indexes that we will use to select images.
foldnumber = 10
idx = np.arange(0,Xinput.shape[0])
np.random.shuffle(idx)
al = np.array_split(idx,foldnumber)
In each fold, we will have, as expected, 1486 images. Now let’s create the arrays containing the images.
Xinputfold = []
yinputfold = []
for i in range(foldnumber):
    tmp = Xinput[al[i],:]
    Xinputfold.append(tmp)
    ytmp = yinput[al[i]]
    yinputfold.append(ytmp)
Xinputfold = np.asarray(Xinputfold)
yinputfold = np.asarray(yinputfold)
if you think this code is convoluted, you are right. There are faster ways of doing it with the sklearn library, but it is very instructive to see how to do it manually, step by step. I am convinced that the preceding code, in which each step is isolated, makes understanding it easier. We first create empty lists: Xinputfold and yinputfold . Each element of the list will be a fold, that is, an array of images or labels. So, if we want to get all images in fold 2, we will simply use Xinputfold[1]. (Remember: In Python, indexes start from zero.). Those listed, converted with the last two lines in numpy arrays, will have three dimensions, as you can easily see with the statements
print(Xinputfold.shape)
print(yinputfold.shape)
This gives us
(10, 1486, 784)
(10, 1486)
In Xinputfold, the first dimension indicates the fold number, the second the observation, and the third the gray values of the pixels. In yinputfold, the first dimension indicates the fold number and the second the label. For example, to get an image with index 1234 from fold 0, you would have to use the following code:
Xinputfold[0][1234,:]
Remember: You should check that you still have a balanced dataset in each fold or, in other words, that you have as many ones as twos. Let’s check for fold 0 (you can do the same check for the others).
for i in range(0,2,1):
    print ("label", i, "makes", np.around(np.count_nonzero(yinputfold[0] == i)/1486.0*100.0, decimals=1), "% of the 1486 observations")
This gives us
label 0 makes 51.2 % of the 1486 observations
label 1 makes 48.8 % of the 1486 observations
That, for our purposes, is balanced enough. Now we need to normalize the features (as we did in Chapter 2).
Xinputfold_normalized = np.zeros_like(Xinputfold, dtype = float)
for i in range (foldnumber):
    Xinputfold_normalized[i] = Xinputfold[i]/255.0
You could normalize the data in one shot, but I would like to make it evident that we are dealing with folds, to make it clear for the reader. Now let’s reshape the arrays as we need them.
X_train = []
y_train = []
for i in range(foldnumber):
    tmp = Xinputfold_normalized[i].transpose()
    ytmp = yinputfold[i].reshape(1,yinputfold[i].shape[0])
    X_train.append(tmp)
    y_train.append(ytmp)
X_train = np.asarray(X_train)
y_train = np.asarray(y_train)
The code is written in the easiest way possible, for instructive purposes, not in the most optimized way. Now we can check the dimensions of the final arrays with
print(X_train.shape)
print(y_train.shape)
This gives us
(10, 784, 1486)
(10, 1, 1486)
Exactly what we need. Now we are ready to build our network. We will use a one-neuron network for logistic regression, with the sigmoid activation function.
import tensorflow as tf
tf.reset_default_graph()
X = tf.placeholder(tf.float32, [n_dim, None])
Y = tf.placeholder(tf.float32, [1, None])
learning_rate = tf.placeholder(tf.float32, shape=())
#W = tf.Variable(tf.zeros([1, n_dim]))
W = tf.Variable(tf.random_normal([1, n_dim], stddev= 2.0 / np.sqrt(2.0*n_dim)))
b = tf.Variable(tf.zeros(1))
y_ = tf.sigmoid(tf.matmul(W,X)+b)
cost = - tf.reduce_mean(Y * tf.log(y_)+(1-Y) * tf.log(1-y_))
training_step = tf.train.AdamOptimizer(learning_rate = learning_rate, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8).minimize(cost)
init = tf.global_variables_initializer()
Here, we have used the Adam optimizer, but the gradient descent would work as well. This is a very easy case. We will use our well-known function to train the model.
def run_logistic_model(learning_r, training_epochs, train_obs, train_labels, debug = False):
    sess = tf.Session()
    sess.run(init)
    cost_history = np.empty(shape=[0], dtype = float)
    for epoch in range(training_epochs+1):
        sess.run(training_step, feed_dict = {X: train_obs, Y: train_labels, learning_rate: learning_r})
        cost_ = sess.run(cost, feed_dict={ X:train_obs, Y: train_labels, learning_rate: learning_r})
        cost_history = np.append(cost_history, cost_)
        if (epoch % 200 == 0) & debug:
            print("Reached epoch",epoch,"cost J =", str.format('{0:.6f}', cost_))
    return sess, cost_history
At this point, we will have to iterate through the folds. Remember our pseudo code at the beginning? Select one fold as the dev set and train the model on all other folds concatenated. Proceed in this way for all the folds. The code could look like that following. (It is a bit long, so take a few minutes to understand it.) In the code, I have added comments indicating which step we are talking about, since you will find following a corresponding numbered list of explanatory steps.
train_acc = []
dev_acc = []
for i in range (foldnumber): # Step 1
    # Prepare the folds - Step 2
    lis = []
    ylis = []
    for k in np.delete(np.arange(foldnumber), i):
        lis.append(X_train[k])
        ylis.append(y_train[k])
        X_train_ = np.concatenate(lis, axis = 1)
        y_train_ = np.concatenate(ylis, axis = 1)
    X_train_ = np.asarray(X_train_)
    y_train_ = np.asarray(y_train_)
    X_dev_ = X_train[i]
    y_dev_ = y_train[i]
    
    # Step 3
    print('Dev fold is', i)
    sess, cost_history = run_logistic_model(learning_r = 5e-4,
                                training_epochs = 600,
                                train_obs = X_train_,
                                train_labels = y_train_,
                                debug = True)
    # Step 4
    correct_prediction=tf.equal(tf.greater(y_, 0.5), tf.equal(Y,1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    print('Train accuracy:',sess.run(accuracy, feed_dict={X:X_train_, Y: y_train_, learning_rate: 5e-4}))
    train_acc = np.append( train_acc, sess.run(accuracy, feed_dict={X:X_train_, Y: y_train_, learning_rate: 5e-4}))
    correct_prediction=tf.equal(tf.greater(y_, 0.5), tf.equal(Y,1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    print('Dev accuracy:',sess.run(accuracy, feed_dict={X:X_dev_, Y: y_dev_, learning_rate: 5e-4}))
    dev_acc = np.append( dev_acc, sess.run(accuracy, feed_dict={X:X_dev_, Y: y_dev_, learning_rate: 5e-4}))
    sess.close()
The code follows these steps:
  1. 1.

    Do a loop over all the folds (in this case, from 1 to 10), iterating with the variable i from 0 to 9.

     
  2. 2.

    For each i, use the fold i as the dev set, and concatenate all other folds and use the result as train set.

     
  3. 3.

    For each i, train the model.

     
  4. 4.

    For each i, evaluate the accuracy on the two datasets (train and dev) and save the values in the two lists: train_acc and dev_acc.

     
If you run this code, you will get an output that will look like the following, for each fold (you will get 10 times the following output, once for each fold):
Dev fold is 0
Reached epoch 0 cost J = 0.766134
Reached epoch 200 cost J = 0.169536
Reached epoch 400 cost J = 0.100431
Reached epoch 600 cost J = 0.074989
Train accuracy: 0.987289
Dev accuracy: 0.984522
You will notice that you will get for each fold slightly different accuracy values. It is very instructive to study how the accuracy values are distributed. Because we have 10 folds, we have 10 values to study. In Figure 6-9, you can see the distribution of the values for the train set (left) and for the dev set (right).
../images/463356_1_En_6_Chapter/463356_1_En_6_Fig9_HTML.jpg
Figure 6-9

Distribution of the accuracy values for the train set (left) and for the dev set (right). Note that the two plots use the same scale on both axes.

The image is quite instructive. You can see that the accuracy values for the training set are quite concentrated around the average, while the ones evaluated on the dev set are much more spread. This shows how the model on new data behaves less well than on the data it has trained on. The standard deviation for the training data is 5.4 · 10−4 and for the dev set 2.4 · 10−3, 4.5 times larger than the value on the train set. In this way, you also get an estimate of the variance of your metric when applied on new data, and on how it generalizes. If you are interested in learning how to do this quickly with sklearn, you can check the official documentation for the KFold method here: https://goo.gl/Gq1Ce4 . When you are dealing with datasets with many classes (remember the discussion on how to split your sets?), you must pay attention and do what is called stratified sampling.2 sklearn provides a method to do that too: stratifiedKFold, which can be accessed here: https://goo.gl/ZBKrdt .

You can now easily find averages and standard deviations. For the training set, we have an average accuracy of 98.7% and a standard deviation of 0.054%, while for the dev set, we have an average of 98.6% with a standard deviation of 0.24%. So, now you can even give an estimate of the variance of your metric. Pretty cool!

Manual Metric Analysis: An Example

I mentioned earlier that sometimes it is useful to do a manual analysis of your data, to check if the results (or the errors) you are getting are plausible. I would like to give you a basic example here, to give you a concrete idea of what is involved and how complicated it can be. Let’s consider the following: our very simple model (remember, we are using only one neuron) can get 98% of accuracy. Is the problem of recognizing digits that easy? Let’s try to see if that is the case. First, note that our training set does not even have the two-dimensional information of the images. If you remember, each image is converted in a one-dimensional array of values: the gray values of each pixel, starting on the top left and going row by row from top to bottom. Are the ones and the twos so easy to recognize? Let’s check how the real input for our model might look. Let’s start analyzing the digit 1. Let’s take an example from fold 0. In Figure 6-10, you can see the image on the left and a bar plot of the gray values of 784 pixels, as they are seen from our model. Remember that as observations, we have a one-dimensional array of the 784 gray values of the pixels of the image.
../images/463356_1_En_6_Chapter/463356_1_En_6_Fig10_HTML.jpg
Figure 6-10

Example from fold 0 for the digit 1. The image on the left is a bar plot of the gray values of the 784 pixels , as they are seen in our model on the right. Remember that as inputs, we have a one-dimensional array of the 784 gray values of the pixels of the image.

Remember that we reshape our 28 × 28 pixels image in a one-dimensional array , so when reshaping the digit 1 in Figure 6-10, we will find black points roughly each 28 pixels, because the 1 is almost a vertical column of black points. In Figure 6-11, you can see other ones, and you will notice how, when reshaped as one dimensional, they all look the same: several bars roughly equally spaced. Now that you know what to look for, you can easily say that all the images in Figure 6-11 are all of digit 1.
../images/463356_1_En_6_Chapter/463356_1_En_6_Fig11_HTML.jpg
Figure 6-11

Four examples of the digit 1 reshaped as one-dimensional arrays . All look the same: a number of bars roughly equally spaced.

Now let’s look at the digit 2. In Figure 6-12, you can see an example, similar to what we had in Figure 6-10.
../images/463356_1_En_6_Chapter/463356_1_En_6_Fig12_HTML.jpg
Figure 6-12

Example from fold 0 for the digit 2. The image on the left is a bar plot of the gray values of the 784 pixels, as they are seen in our model. Remember that as observations, we have a one-dimensional array of the 784 gray values of the pixels of the image.

Now things look different. We have two regions in which the bars are much denser, seen in the plot on the right in Figure 6-12. This is the case between pixels 100 and 200 and especially after pixel 500. Why? Well, the two areas correspond to the two horizontal parts of the image. In Figure 6-13, I have highlighted how different parts look when reshaped as one-dimensional arrays.
../images/463356_1_En_6_Chapter/463356_1_En_6_Fig13_HTML.jpg
Figure 6-13

How different parts of the images look when reshaped as one-dimensional arrays . Horizontal parts are labeled (A) and (B), and the more vertical part is labeled (C).

Horizontal parts (A) and (B) are clearly different from part (C) when reshaped as a one-dimensional array . The vertical part (C) looks like the digit 1, with many equally spaced bars, as you can see in the lower right bar plot labeled (C), while the more horizontal parts appear as many bars clustered in groups, as can be seen in the upper right and lower left bar plots labeled (A) and (B). So, when reshaped, if you find those clusters of bars, you are looking at a 2. If you see only equally spaced small groups of bars, as in the plot (C) in Figure 6-13, you are looking at a 1. You don’t even have to see the two-dimensional image, if you know what to look for. Note that this pattern is very constant. In Figure 6-14, you can see four examples of the digit 2, and you can clearly see the wider clusters of bars.
../images/463356_1_En_6_Chapter/463356_1_En_6_Fig14_HTML.jpg
Figure 6-14

Four examples of the digit 2, reshaped as one-dimensional arrays . The wider clusters of bars can be seen clearly.

As you can imagine, this is an easy pattern to spot for an algorithm, and so it is to be expected that our model works well. Even a human can spot the images, even when reshaped, without any effort. Such a detailed analysis would not be necessary in a real-life project, but it is instructive to see what you can learn from your data. Understanding the characteristics of your data may help you in designing your model or understanding why it is not working. Advanced architectures, such as convolutional networks, will be able to learn those two-dimensional features exactly in a very efficient way.

Let’s also check how the network learned to recognize digits. You will remember that the output of our neuron is
$$ widehat{y}=sigma (z)=sigma left({w}_1{x}_1+{w}_2{x}_2+dots +{w}_{n_x}{x}_{n_x}+b
ight) $$
where σ is the sigmoid function, xi for i = 1, …, 784 are the gray values of the pixel of the image, wi for i = 1, …, 784 are the weights, and b is the bias. Remember that when $$ widehat{y}&gt;0.5 $$, we classify the image in class 1 (so, digit 2), and if $$ widehat{y}&lt;0.5 $$, we classify the image in class 0 (so, digit 1). Now from the discussion of the sigmoid function in Chapter 2, you will remember that σ(z) ≥ 0.5 when z ≥ 0 and σ(z) < 0.5 for z < 0. This means that our network should learn the weights in such a way that for all the ones, we have z < 0, and for all the twos, z ≥ 0. Let’s see if that is really the case. In Figure 6-15, you can see a plot for a digit 1, from which you can find the weights wi (solid line) of our trained network after 600 epochs (and after reaching an accuracy of 98%) and the gray value of the pixel xi rescaled to have a maximum of 0.5 (dashed line). Note how each time xi is big, wi is negative. And when wi > 0, the xi are almost zero. Clearly, the result $$ {w}_1{x}_1+{w}_2{x}_2+dots +{w}_{n_x}{x}_{n_x}+b $$ will be negative, and, therefore, σ(z) < 0.5, and the network will identify the image as a 1. In the image, I zoomed in to make this behavior more evident.
../images/463356_1_En_6_Chapter/463356_1_En_6_Fig15_HTML.jpg
Figure 6-15

Plot for a digit 1, from which you can find the weights wi (solid line) of our trained network after 600 epochs (and after reaching an accuracy of 98%) and the gray value of the pixel xi rescaled to have a maximum of 0.5 (dashed line)

In Figure 6-16 , you can see the same plot for a digit 2. You will remember from the previous discussion that for a 2, we can see many bars clustered together in groups up to pixel 250 (roughly). Let’s check how the weights in that region are. Now you will see that where the pixel gray values are big, the weights are positive, giving, then, a positive value of z and, therefore, σ(z) ≥ 0.5, and so the image would be classified as a 2.
../images/463356_1_En_6_Chapter/463356_1_En_6_Fig16_HTML.jpg
Figure 6-16

Plot for a digit 2, from which you can find the weights wi (solid line) of our trained network after 600 epochs (and after reaching an accuracy of 98%) and the gray value of the pixel xi rescaled to have a maximum of 0.5 (dashed line)

As an additional check, I plotted wi · xi for all values of i for a digit 1, shown in Figure 6-17. You can see how almost all points lie below zero. Note also that b =  − 0.16, in this case.
../images/463356_1_En_6_Chapter/463356_1_En_6_Fig17_HTML.jpg
Figure 6-17

wi · xi for i = 1, …, 784 for a digit 1. You can see how almost all values lie below zero. The thick line at zero is made of all the points i, such that wi · xi = 0.

As you can see, in very easy cases, it is possible to understand how a network learns and, therefore, it is much easier to debug strange behaviors. But don’t expect this to be possible when dealing with much more complex cases. The analysis we have done would not be so easy, for example, if you tried to do the same with digits 3 and 8, instead of 1 and 2.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.173.40