4 Structuring DL projects and hyperparameter tuning

This chapter covers

  • Defining performance metrics
  • Designing baseline models
  • Preparing training data
  • Evaluating a model and improving its performance

This chapter concludes the first part of this book, providing a foundation for deep learning (DL). In chapter 2, you learned how to build a multilayer perceptron (MLP). In chapter 3, you learned about a neural network architecture topology that is very commonly used in computer vision (CV) problems: convolutional neural networks (CNNs). In this chapter, we will wrap up this foundation by discussing how to structure your machine learning (ML) project from start to finish. You will learn strategies to quickly and efficiently get your DL systems working, analyze the results, and improve network performance.

As you might have already noticed from the previous projects, DL is a very empirical process. It relies on running experiments and observing model performance more than having one go-to formula for success that fits all problems. We often have an initial idea for a solution, code it up, run the experiment to see how it did, and then use the outcome of this experiment to refine our ideas. When building and tuning a neural network, you will find yourself making many seemingly arbitrary decisions:

  • What is a good architecture to start with?

  • How many hidden layers should you stack?

  • How many hidden units or filters should go in each layer?

  • What is the learning rate?

  • Which activation function should you use?

  • Which yields better results, getting more data or tuning hyperparameters?

In this chapter, you will learn the following:

  • Defining the performance metrics for your system --In addition to model accuracy, you will use other metrics like precision, recall, and F-score to evaluate your network.

  • Designing a baseline model --You will choose an appropriate neural network architecture to run your first experiment.

  • Getting your data ready for training --In real-world problems, data comes in messy, not ready to be fed to a neural network. In this section, you will massage your data to get it ready for learning.

  • Evaluating your model and interpreting its performance --When training is complete, you analyze your model’s performance to identify bottlenecks and narrow down improvement options. This means diagnosing which of the network components are performing worse than expected and identifying whether poor performance is due to overfitting, underfitting, or a defect in the data.

  • Improving the network and tuning hyperparameters --Finally, we will dive deep into the most important hyperparameters to help develop your intuition about which hyperparameters you need to tune. You will use tuning strategies to make incremental changes based on your diagnosis from the previous step.

TIP With more practice and experimentation, DL engineers and researchers build their intuition over time as to the most effective ways to make improvements. My advice is to get your hands dirty and try different architectures and approaches to develop your hyperparameter-tuning skills.

Ready? Let’s get started!

4.1 Defining performance metrics

Performance metrics allow us to evaluate our system. When we develop a model, we want to find out how well it is working. The simplest way to measure the “goodness” of our model is by measuring its accuracy. The accuracy metric measures how many times our model made the correct prediction. So, if we test the model with 100 input samples, and it made the correct prediction 90 times, this means the model is 90% accurate.

Here is the equation used to calculate model accuracy:

4.1.1 Is accuracy the best metric for evaluating a model?

We have been using accuracy as a metric for evaluating our model in earlier projects, and it works fine in many cases. But let’s consider the following problem: you are designing a medical diagnosis test for a rare disease. Suppose that only one in every million people has this disease. Without any training or even building a system at all, if you hardcode the output to be always negative (no disease found), your system will always achieve 99.999% accuracy. Is that good? The system is 99.999% accurate, which might sound fantastic, but it will never capture the patients with the disease. This means the accuracy metric is not suitable to measure the “goodness” of this model. We need other evaluation metrics that measure different aspects of the model’s prediction ability.

4.1.2 Confusion matrix

To set the stage for other metrics, we will use a confusion matrix : a table that describes the performance of a classification model. The confusion matrix itself is relatively simple to understand, but the related terminology can be a little confusing at first. Once you understand it, you’ll find that the concept is really intuitive and makes a lot of sense. Let’s go through it step by step.

The goal is to describe model performance from different angles other than prediction accuracy. For example, suppose we are building a classifier to predict whether a patient is sick or healthy. The expected classifications are either positive (the patient is sick) or negative (the patient is healthy). We run our model on 1,000 patients and enter the model predictions in table 4.1.

Table 4.1 Running our model to predict healthy vs. sick patients

Predicted sick (positive)

Predicted healthy (negative)

Sick patients (positive)

100

30

True positives (TP)

False negative (FN)

Healthy patients (negative)

70

800

False positives (FP)

True negatives (TN)

Let’s now define the most basic terms, which are whole numbers (not rates):

  • True positives (TP) --The model correctly predicted yes (the patient has the disease).

  • True negatives (TN) --The model correctly predicted no (the patient does not have the disease).

  • False positives (FP) --The model falsely predicted yes, but the patient actually does not have the disease (in some literature known as a Type I error or error of the first kind).

  • False negatives (FN) --The model falsely predicted no, but the patient actually does have the disease (in some literature known as a Type II error or error of the second kind).

The patients that the model predicts are negative (no disease) are the ones that the model believes are healthy, and we can send them home without further care. The patients that the model predicts are positive (have disease) are the ones that we will send for further investigation. Which mistake would we rather make? Mistakenly diagnosing someone as positive (has disease) and sending them for more investigation is not as bad as mistakenly diagnosing someone as negative (healthy) and sending them home at risk to their life. The obvious choice of evaluation metric here is that we care more about the number of false negatives (FN). We want to find all the sick people, even if the model accidentally classifies some healthy people as sick. This metric is called recall.

4.1.3 Precision and recall

Recall (also known as sensitivity) tells us how many of the sick patients our model incorrectly diagnosed as well. In other words, how many times did the model incorrectly diagnose a sick patient as negative (false negative, FN)? Recall is calculated by the following equation:

Precision (also known as specificity) is the opposite of recall. It tells us how many of the well patients our model incorrectly diagnosed as sick. In other words, how many times did the model incorrectly diagnose a well patient as positive (false positive, FP)? Precision is calculated by the following equation:

Identifying an appropriate metric

It is important to note that although in the example of health diagnostics we decided that recall is a better metric, other use cases require different metrics, like precision. To identify the most appropriate metric for your problem, ask yourself which of the two possible false predictions is more consequential: false positive or false negative. If your answer is FP, then you are looking for precision. If FN is more significant, then recall is your answer.

Consider a spam email classifier, for example. Which of the two false predictions would you care about more: falsely classifying a non-spam email as spam, in which case it gets lost, or falsely classifying a spam email as non-spam, after which it makes its way to the inbox folder? I believe you would care more about the former. You don’t want the receiver to lose an email because your model misclassified it as spam. We want to catch all spam, but it is very bad to lose a non-spam email. In this example, precision is a suitable metric to use.

In some applications, you might care about both precision and recall at the same time. That’s called an F-score, as explained next.

4.1.4 F-score

In many cases, we want to summarize the performance of a classifier with a single metric that represents both recall and precision. To do so, we can convert precision (p) and recall (r ) into a single F-score metric. In mathematics, this is called the harmonic mean of p and r:

The F-score gives a good overall representation of how your model is performing. Let’s take a look at the health-diagnostics example again. We agreed that this is a high-recall model. But what if the model is doing really well on the FN and giving us a high recall score, but it’s performing poorly on the FP and giving us a low precision score? Doing poorly on FP means, in order to not miss any sick patients, it is mistakenly diagnosing a lot of patients as sick, to be on the safe side. So, while recall might be more important for this problem, it is good to look at the model from both scores--precision and recall--together:

Precision

Recall

F-score

Classifier A

95%

90%

92.4%

Classifier B

98%

85%

91%

NOTE Defining the model evaluation metric is a necessary step because it will guide your approach to improving the system. Without clearly defined metrics, it can be difficult to tell whether changes to a ML system result in progress or not.

4.2 Designing a baseline model

Now that you have selected the metrics you will use to evaluate your system, it is time to establish a reasonable end-to-end system for training your model. Depending on the problem you are solving, you need to design the baseline to suit your network type and architecture. In this step, you will want to answer questions like these:

  • Should I use an MLP or CNN network (or RNN, explained later in the book)?

  • Should I use other object detection techniques like YOLO or SSD (explained in later chapters)?

  • How deep should my network be?

  • Which activation type will I use?

  • What kind of optimizer do I use?

  • Do I need to add any other regularization layers like dropout or batch normalization to avoid overfitting?

If your problem is similar to another problem that has been studied extensively, you will do well to first copy the model and algorithm already known to perform the best for that task. You can even use a model that was trained on a different dataset for your own problem without having to train it from scratch. This is called transfer learning and will be discussed in detail in chapter 6.

For example, in the last chapter’s project, we used the architecture of the popular AlexNet as a baseline model. Figure 4.1 shows the architecture of an AlexNet deep CNN, with the dimensions of each layer. The input layer is followed by five convolutional layers (CONV1 through CONV5), the output of the fifth convolutional layer is fed into two fully connected layers (FC6 through FC7), and the output layer is a fully connected layer (FC8) with a softmax function:

INPUT ⇒ CONV1 ⇒ POOL1 ⇒ CONV2 ⇒ POOL2 ⇒ CONV3 ⇒ CONV4 ⇒ CONV5 ⇒ POOL3 ⇒ FC6 ⇒ FC7 ⇒ SOFTMAX_8

Figure 4.1 The AlexNet architecture consists of five convolutional layers and three FC layers.

Looking at the AlexNet architecture, you will find all the network hyperparameters that you need to get started with your own model:

  • Network depth (number of layers): 5 convolutional layers plus 3 fully connected layers

  • Layers’ depth (number of filters): CONV1 = 96, CONV2 = 256, CONV3 = 384, CONV4 = 385, CONV5 = 256

  • Filter size: 11 × 11, 5 × 5, 3 × 3, 3 × 3, 3 × 3

  • ReLU as the activation function in the hidden layers (CONV1 all the way to FC7)

  • Max pooling layers after CONV1, CONV2, and CONV5

  • FC6 and FC7 with 4,096 neurons each

  • FC8 with 1000 neurons, using a softmax activation function

NOTE In the next chapter, we will discuss some of the most popular CNN architectures along with their code implementations in Keras. We will look at networks like LeNet, AlexNet, VGG, ResNet, and Inception that will build your understanding of what architecture works best for different problems and perhaps inspire you to invent your own CNN architecture.

4.3 Getting your data ready for training

We have defined the performance metrics that we will use to evaluate our model and have built the architecture of our baseline model. Let’s get our data ready for training. It is important to note that this process varies a lot based on the problem and data you have. Here, I’ll explain the basic data-massaging techniques that you need to perform before training your model. I’ll also help you develop an instinct for what “ready data” looks like so you can determine which preprocessing techniques you need.

4.3.1 Splitting your data for train/validation/test

When we train a ML model, we split the data into train and test datasets (figure 4.2). We use the training dataset to train the model and update the weights, and then we evaluate the model against the test dataset that it hasn’t seen before. The golden rule here is this: never use the test data for training. The reason we should never show the test samples to the model while training is to make sure the model is not cheating. We show the model the training samples to learn their features, and then we test how it generalizes on a dataset that it has never seen, to get an unbiased evaluation of its performance.

Figure 4.2 Splitting the data into training and testing datasets

What is the validation dataset?

After each epoch during the training process, we need to evaluate the model’s accuracy and error to see how it is performing and tune its parameters. If we use the test dataset to evaluate the model during training, we will break our golden rule of never using the testing data during training. The test data is only used to evaluate the final performance of the model after training is complete. So we make an additional split called a validation dataset to evaluate and tune parameters during training (figure 4.3). Once the model has completed training, we test its final performance over the test dataset.

Figure 4.3 An additional split called a validation dataset to evaluate the model during training while keeping the test subset for the final test after training

Take a look at this pseudo code for model training:

for each epoch for each training data instance
        propagate error through the network 
        adjust the weights 
        calculate the accuracy and error over training data 
for each validation data instance 
        calculate the accuracy and error over the validation data 

As we saw in the project in chapter 3, when we train the model, we get train_loss, train_acc, val_loss, and val_acc after each epoch (figure 4.4). We use this data to analyze the network’s performance and diagnose overfitting and underfitting, as you will see in section 4.4.

Figure 4.4 Training results after each epoch

What is a good train/validation/test data split?

Traditionally, an 80/20 or 70/30 split between train and test datasets is used in ML projects. When we add the validation dataset, we went with 60/20/20 or 70/15/15. But that was back when an entire dataset was just tens of thousands of samples. With the huge amount of data we have now, sometimes 1% for both the validation and the test set is enough. For example, if our dataset contains 1 million samples, 10,000 samples is very reasonable for each of the test and validation sets, because it doesn’t make sense to hold back several hundred thousand samples of your dataset. It is better to use this data for model training.

So, to recap, if you have a relatively small dataset, the traditional ratios might be okay. But if you are dealing with a large dataset, then it is fine to set your train and validation sets to much smaller values.

Be sure datasets are from the same distribution

An important thing to be aware of when splitting your data is to make sure your train/validation/test datasets come from the same distribution. Suppose you are building a car classifier that will be deployed on cell phones to detect car models. Keep in mind that DL networks are data-hungry, and the common rule of thumb is that the more data you have, the better your model will perform. So, to source your data, you decide to crawl the internet for car images that are all high-quality, professionally-framed images. You train your model and tune it, you achieve satisfying results on your test dataset, and you are ready to release the model to the world--only to discover that it is performing poorly on real-life images taken by phone cameras. This happens because your model has been trained and tuned to achieve good results on high-quality images, so it fails to generalize on real-life images that may be blurry or lower resolution or have different characteristics.

In more technical words, your training and validation datasets are composed of high-quality images, whereas the production images (real life) are lower-quality images. Thus it is very important that you add lower-quality images to your train and validate datasets. Hence, the train/validate/test datasets should come from the same distribution.

4.3.2 Data preprocessing

Before you feed your data to the neural network, you will need to do some data cleanup and processing to get it ready for your learning model. There are several preprocessing techniques to choose from, based on the state of your dataset and the problem you are solving. The good news about neural networks is that they require minimal data preprocessing. When given a large amount of training data, they are able to extract and learn features from raw data, unlike the other traditional ML techniques.

With that said, preprocessing still might be required to improve performance or work within specific limitations on the neural network, such as converting images to grayscale, image resizing, normalization, and data augmentation. In this section, we’ll go through these preprocessing concepts; we’ll see their code implementations in the project at the end of the chapter.

Image grayscaling

We talked in chapter 3 about how color images are represented in three matrices versus only one matrix for grayscale images; color images add computational complexity with their many parameters. You can make a judgment call about converting all your images to grayscale, if your problem doesn’t require color, to save on the computational complexity. A good rule of thumb here is to use the human-level performance rule: if you are able to identify the object with your eyes in grayscale images, then a neural network will probably be able to do the same.

Image resizing

One limitation for neural networks is that they require all images to be the same shape. If you are using MLPs, for example, the number of nodes in the input layer must be equal to the number of pixels in the image (remember how, in chapter 3, we flattened the image to feed it to the MLP). The same is true for CNNs. You need to set the input shape of the first convolutional layer. To demonstrate this, let’s look at the Keras code to add the first CNN layer:

model.add(Conv2D(filters=16, kernel_size=2, padding='same', activation='relu', input_shape=(32, 32, 3)))

As you can see, we have to define the shape of the image at the first convolutional layer. For example, if we have three images with dimensions of 32 × 32, 28 × 28, and 64 × 64, we have to resize all the images to one size before feeding them to the model.

Data normalization

Data normalization is the process of rescaling your data to ensure that each input feature (pixel, in the image case) has a similar data distribution. Often, raw images are composed of pixels with varying scales (ranges of values). For example, one image may have a pixel value range from 0 to 255, and another may have a range of 20 to 200. Although not required, it is preferred to normalize the pixel values to the range of 0 to 1 to boost learning performance and make the network converge faster.

To make learning faster for your neural network, your data should have the following characteristics:

  • Small values --Typically, most values should be in the [0, 1] range.

  • Homogenous --All pixels should have values in the same range.

Data normalization is done by subtracting the mean from each pixel and then dividing the result by the standard deviation. The distribution of such data resembles a Gaussian curve centered at zero. To demonstrate the normalization process, figure 4.5 illustrates the operation in a scatterplot.

Figure 4.5 To normalize data, we subtract the mean from each pixel and divide the result by the standard deviation.

TIP Make sure you normalize your training and test data by using the same mean and standard deviation, because you want your data to go through the same transformation and rescale exactly the same way. You will see how this is implemented in the project at the end of this chapter.

In non-normalized data, the cost function will likely look like a squished, elongated bowl. After you normalize your features, your cost function will look more symmetric. Figure 4.6 shows the cost function of two features, F1 and F2.

Figure 4.6 Normalized features help the GD algorithm go straight forward toward the minimum error, thereby reaching it quickly (left). With non-normalized features, the GD oscillates toward the direction of the minimum error and reaches the minimum more slowly (right).

As you can see, for normalized features, the GD algorithm goes straight forward toward the minimum error, thereby reaching it quickly. But for non-normalized features, it oscillates toward the direction of the minimum error and ends with a long march down the error mountain. It will eventually reach the minimum, but it will take longer to converge.

TIP Why does GD oscillate for non-normalized features? If we don’t normalize our data, the range of distribution of feature values will likely be different for each feature, and thus the learning rate will cause corrections in each dimension that differ proportionally from one another. This forces GD to oscillate to the direction of the minimum error and ends up with a longer path down the error.

Image augmentation

Data augmentation will be discussed in more detail later in this chapter, when we cover regularization techniques. But it is important for you to know that this is another preprocessing technique that you have in your toolbelt to use when needed.

4.4 Evaluating the model and interpreting its performance

After the baseline model is established and the data is preprocessed, it is time to train the model and measure its performance. After training is complete, you need to determine if there are bottlenecks, diagnose which components are performing poorly, and determine whether the poor performance is due to overfitting, underfitting, or a defect in the training data.

One of the main criticisms of neural networks is that they are “black boxes.” Even when they work very well, it is hard to understand why they work so well. Many efforts are being made to improve the interpretability of neural networks, and this field is likely to evolve rapidly in the next few years. In this section, I’ll show you how to diagnose neural networks and analyze their behavior.

4.4.1 Diagnosing overfitting and underfitting

After running your experiment, you want to observe its performance, determine if bottlenecks are impacting its performance, and look for indicators of areas you need to improve. The main cause of poor performance in ML is either overfitting or underfitting the training dataset. We talked about overfitting and underfitting in chapter 3, but now we will dive a little deeper to understand how to detect when the system is fitting the training data too much (overfitting) and when it is too simple to fit the data (underfitting):

  • Underfitting means the model is too simple: it fails to learn the training data, so it performs poorly on the training data. One example of underfitting is using a single perceptron to classify the

    and shapes in figure 4.7. As you can see, a straight line does not split the data accurately.

    Figure 4.7 An example of underfitting

  • Overfitting is when the model is too complex for the problem at hand. Instead of learning features that fit the training data, it actually memorizes the training data. So it performs very well on the training data, but it fails to generalize when tested with new data that it hasn’t seen before. In figure 4.8, you see that the model fits the data too well: it splits the training data, but this kind of fitting will fail to generalize.

    Figure 4.8 An example of overfitting

  • We want to build a model that is just right for the data: not too complex, causing overfit, or too simple, causing underfit. In figure 4.9, you see that the model missed on a data sample of the shape O, but it looks much more likely to generalize on new data.

    Figure 4.9 A model that is just right for the data and will generalize

TIP The analogy I like to use to explain overfitting and underfitting is a student studying for an exam. Underfitting is when the student doesn’t study very well and so fails the exam. Overfitting is when the student memorizes the book and can answer correctly when asked questions from the book, but answers poorly when asked questions from outside the book. The student failed to generalize. What we want is a student to learn from the book (training data) well enough to be able to generalize when asked questions related to the book material.

To diagnose underfitting and overfitting, the two values to focus on while training are the training error and the validation error:

  • If the model is doing very well on the training set but relatively poorly on the validation set, then it is overfitting. For example, if train_error is 1% and val_error is 10%, it looks like the model has memorized the training dataset but is failing to generalize on the validation set. In this case, you might consider tuning your hyperparameters to avoid overfitting and iteratively train, test, and evaluate until you achieve an acceptable performance.

  • If the model is performing poorly on the training set, then it is underfitting. For example, if the train_error is 14% and val_error is 15%, the model might be too simple and is failing to learn the training set. You might want to consider adding more hidden layers or training longer (more epochs), or try different neural network architectures.

In the next section, we will discuss several hyperparameter-tuning techniques to avoid overfitting and underfitting.

Using human-level performance to identify a Bayes error rate

We talked about achieving a satisfying performance, but how can we know whether performance is good or not? We need a realistic baseline to compare the training and validation errors to, in order to know whether we are improving. Ideally, a 0% error rate is great, but it is not a realistic target for all problems and may even be impossible. That is why we need to define a Bayes error rate.

A Bayes error rate represents the best possible error our model can achieve (theoretically). Since humans are usually very good with visual tasks, we can use human-level performance as a proxy to measure Bayes error. For example, if you are working on a relatively simple task like classifying dogs and cats, humans are very accurate. The human error rate will be very low: say, 0.5%. Then we want to compare the train_error of our model with this value. If our model accuracy is 95%, that’s not satisfying performance, and the model might be underfitting. On the other hand, suppose we are working on a more complex task for humans, like building a medical image classification model for radiologists. The human error rate could be a little higher here: say, 5%. Then a model that is 95% accurate is actually doing a good job.

Of course, this is not to say that DL models can never surpass human performance: on the contrary. But it is a good way to draw a baseline to gauge whether a model is doing well. (Note that the example error percentages are just arbitrary numbers for the sake of the example.)

4.4.2 Plotting the learning curves

Instead of looking at the training verbose output and comparing the error numbers, one way to diagnose overfitting and underfitting is to plot your training and validation errors throughout the training, as you see in figure 4.10.

Figure 4.10A shows that the network improves the loss value (aka learns) on the training data but fails to generalize on the validation data. Learning on the validation data progresses in the first couple of epochs and then flattens out and maybe decreases. This is a form of overfitting. Note that this graph shows that the network is actually learning on the training data, a good sign that training is happening. So you don’t need to add more hidden units, nor do you need to build a more complex model. If anything, your network is too complex for your data, because it is learning so much that it is actually memorizing the data and failing to generalize to new data. In this case, your next step might be to collect more data or apply techniques to avoid overfitting.

Figure 4.10B shows that the network performs poorly on both training and validation data. In this case, your network is not learning. You don’t need more data, because the network is too simple to learn from the data you already have. Your next step is to build a more complex model.

Figure 4.10 (a) The network improves the loss value on the training data but fails to generalize on the validation data. (b) The network performs poorly on both the training and validation data. (C) The network learns the training data and generalizes to the validation data.

Figure 4.10C shows that the network is doing a good job of learning the training data and generalizing to the validation data. This means there is a good chance that the network will have good performance out in the wild on test data.

4.4.3 Exercise: Building, training, and evaluating a network

Before we move on to hyperparameter tuning, let’s run a quick experiment to see how we split the data and build, train, and visualize the model results. You can see an exercise notebook for this at www.manning.com/books/deep-learning-for-vision-systems or www.computervisionbook.com.

In this exercise, we will do the following:

  • Create toy data for our experiment

  • Split the data into 80% training and 20% testing datasets

  • Build the MLP neural network

  • Train the model

  • Evaluate the model

  • Visualize the results

Here are the steps:

  1. Import the dependencies:

from sklearn.datasets import make_blobs       
from keras.utils import to_categorical        
from keras.models import Sequential           
from keras.layers import Dense                
from matplotlib import pyplot                 

The scikit-learn library to generate sample data

Keras method that converts a class vector to a binary class matrix (one-hot encoding)

Neural networks and layers library

Visualization library

  1. Use make_blobs from scikit-learn to generate a toy dataset with only two features and three label classes:

    X, y = make_blobs(n_samples=1000, centers=3, n_features=2, 
        cluster_std=2, random_state=2)
  2. Use to_categorical from Keras to one-hot-encode the label:

    y = to_categorical(y)
  3. Split the dataset into 80% training data and 20% test data. Note that we did not create a validation dataset in this example, for simplicity:

    n_train = 800
    train_X, test_X = X[:n_train, :], X[n_train:, :]
    train_y, test_y = y[:n_train], y[n_train:]
    print(train_X.shape, test_X.shape)
     
    >> (800, 2) (200, 2)
  4. Develop the model architecture--here, a very simple, two-layer MLP network (figure 4.11 shows the model summary):

    model = Sequential()
    model.add(Dense(25, input_dim=2, activation='relu'))              
    model.add(Dense(3, activation='softmax'))                         
    model.compile(loss='categorical_crossentropy', optimizer='adam', 
        metrics=['accuracy'])                                         
    model.summary()

    Two input dimensions because we have two features. ReLU activation function for hidden layers.

    Softmax activation for the output layer with three nodes because we have three classes

    Cross-entropy loss function (explained in chapter 2) and adam optimizer (explained in the next section)

    Figure 4.11 Model summary

  5. Train the model for 1,000 epochs:

    history = model.fit(train_X, train_y, validation_data=(test_X, test_y),
        epochs=1000, verbose=1)
  6. Evaluate the model:

    _, train_acc = model.evaluate(train_X, train_y)
    _, test_acc = model.evaluate(test_X, test_y)
    print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
     
    >> Train: 0.825, Test: 0.819
  7. Plot the learning curves of model accuracy (figure 4.12):

    pyplot.plot(history.history['accuracy'], label='train')
    pyplot.plot(history.history['val_accuracy'], label='test')
    pyplot.legend()
    pyplot.show()
     

    Figure 4.12 The learning curves: both train and test curves fit the data with similar behavior.

Let’s evaluate the network. Looking at the learning curve in figure 4.12, you can see that both train and test curves fit the data with a similar behavior. This means the network is not overfitting, which would be indicated if the train curve was doing well but the test curve was not. But could the network be underfitting? Maybe: 82% on a very simple dataset like this is considered poor performance. To improve the performance of this neural network, I would try to build a more complex network and experiment with other underfitting techniques.

4.5 Improving the network and tuning hyperparameters

After you run your training experiment and diagnose for overfitting and underfitting, you need to decide whether it is more effective to spend your time tuning the network, cleaning up and processing your data, or collecting more data. The last thing you want to do is to spend a few months working in one direction only to find out that it barely improves network performance. So, before discussing the different hyperparameters to tune, let’s answer this question first: should you collect more data?

4.5.1 Collecting more data vs. tuning hyperparameters

We know that deep neural networks thrive on lots of data. With that in mind, ML novices often throw more data to the learning algorithm as their first attempt to improve its performance. But collecting and labeling more data is not always a feasible option and, depending on your problem, could be very costly. Plus, it might not even be that effective.

NOTE While efforts are being made to automate some of the data-labeling process, at the time of writing, most labeling is done manually, especially in CV problems. By manually, I mean that actual human beings look at each image and label them one by one (this is called human in the loop). Here is another layer of complexity: if you are labeling lung X-ray images to detect a certain tumor, for example, you need qualified physicians to diagnose the images. This will cost a lot more than hiring people to classify dogs and cats. So collecting more data might be a good solution for some accuracy issues and increase the model’s robustness, but it is not always a feasible option.

In other scenarios, it is much better to collect more data than to improve the learning algorithm. So it would be nice if you had quick and effective ways to figure out whether it is better to collect more data or tune the model hyperparameters.

The process I use to make this decision is as follows:

  1. Determine whether the performance on the training set is acceptable as-is.

  2. Visualize and observe the performance of these two metrics: training accuracy (train_acc) and validation accuracy (val_acc).

  3. If the network yields poor performance on the training dataset, this is a sign of underfitting. There is no reason to gather more data, because the learning algorithm is not using the training data that is already available. Instead, try tuning the hyperparameters or cleaning up the training data.

  4. If performance on the training set is acceptable but is much worse on the test dataset, then the network is overfitting your training data and failing to generalize to the validation set. In this case, collecting more data could be effective.

TIP When evaluating model performance, the goal is to categorize the high-level problem. If it’s a data problem, spend more time on data preprocessing or collecting more data. If it’s a learning algorithm problem, try to tune the network.

4.5.2 Parameters vs. hyperparameters

Let’s not get parameters confused with hyperparameters. Hyperparameters are the variables that we set and tune. Parameters are the variables that the network updates with no direct manipulation from us. Parameters are variables that are learned and updated by the network during training, and we do not adjust them. In neural networks, parameters are the weights and biases that are optimized automatically during the backpropagation process to produce the minimum error. In contrast, hyperparameters are variables that are not learned by the network. They are set by the ML engineer before training the model and then tuned. These are variables that define the network structure and determine how the network is trained. Hyperparameter examples include learning rate, batch size, number of epochs, number of hidden layers, and others discussed in the next section.

Turning the knobs

Think of hyperparameters as knobs on a closed box (the neural network). Our job is to set and tune the knobs to yield the best performance:

4.5.3 Neural network hyperparameters

DL algorithms come with several hyperparameters that control many aspects of the model’s behavior. Some hyperparameters affect the time and memory cost of running the algorithm, and others affect the model’s prediction ability.

The challenge with hyperparameter tuning is that there are no magic numbers that work for every problem. This is related to the no free lunch theorem that we referred to in chapter 1. Good hyperparameter values depend on the dataset and the task at hand. Choosing the best hyperparameters and knowing how to tune them require an understanding of what each hyperparameter does. In this section, you will build your intuition about why you would want to nudge a hyperparameter one way or another, and I’ll propose good starting values for some of the most effective hyperparameters.

Generally speaking, we can categorize neural network hyperparameters into three main categories:

  • Network architecture

  • Number of hidden layers (network depth)
  • Number of neurons in each layer (layer width)
  • Activation type
  • Learning and optimization

  • Learning rate and decay schedule
  • Mini-batch size
  • Optimization algorithms
  • Number of training iterations or epochs (and early stopping criteria)
  • Regularization techniques to avoid overfitting

  • L2 regularization
  • Dropout layers
  • Data augmentation

We discussed all of these hyperparameters in chapters 2 and 3 except the regularization techniques. Next, we will cover them quickly with a focus on understanding what happens when we tune each knob up or down and how to know which hyperparameter to tune.

4.5.4 Network architecture

First, let’s talk about the hyperparameters that define the neural network architecture:

  • Number of hidden layers (representing the network depth)

  • Number of neurons in each layer, also known as hidden units (representing the network width)

  • Activation functions

Depth and width of the neural network

Whether you are designing an MLP, CNN, or other neural network, you need to decide on the number of hidden layers in your network (depth) and the number of neurons in each layer (width). The number of hidden layers and units describes the learning capacity of the network. The goal is to set the number large enough for the network to learn the data features. A smaller network might underfit, and a larger network might overfit. To know what is a “large enough” network, you pick a starting point, observe the performance, and then tune up or down.

The more complex the dataset, the more learning capacity the model will need to learn its features. Take a look at the three datasets in figure 4.13.

If you provide the model with too much learning capacity (too many hidden units), it might tend to overfit the data and memorize the training set. If your model is overfitting, you might want to decrease the number of hidden units.

Figure 4.13 The more complex the dataset, the more learning capacity the model will need to learn its features.

Generally, it is good to add hidden neurons until the validation error no longer improves. The trade-off is that it is computationally expensive to train deeper networks. Having a small number of units may lead to underfitting, while having more units is usually not harmful, with appropriate regularization (like dropout and others discussed later in this chapter).

Try playing around with the Tensorflow playground (https://playground.tensorflow .org) to develop more intuition. Experiment with different architectures, and gradually add more layers and more units in hidden layers while observing the network’s learning behavior.

Activation type

Activation functions (discussed extensively in chapter 2) introduce nonlinearity to our neurons. Without activations, our neurons would pass linear combinations (weighted sums) to each other and not solve any nonlinear problems. This is a very active area of research: every few weeks, we are introduced to new types of activations, and there are many available. But at the time of writing, ReLU and its variations (like Leaky ReLU) perform the best in hidden layers. And in the output layer, it is very common to use the softmax function for classification problems, with the number of neurons equal to the number of classes in your problem.

Layers and parameters

When considering the number of hidden layers and units in your neural network architecture, it is useful to think in terms of the number of parameters in the network and their effect on computational complexity. The more neurons in your network, the more parameters the network has to optimize. (In chapter 3, we learned how to print the model summary to see the total number of parameters that will be trained.)

Based on your hardware setup for the training process (computational power and memory), you can determine whether you need to reduce the number of parameters. To reduce the number of training parameters, you can do one of the following:

  • Reduce the depth and width of the network (hidden layers and units). This will reduce the number of training parameters and, hence, reduce the neural network complexity.

  • Add pooling layers, or tweak the strides and padding of the convolutional layers to reduce the feature map dimensions. This will lower the number of parameters.

These are just examples to help you see how you will look at the number of training parameters in real projects and the trade-offs you will need to make. Complex networks lead to a large number of training params, which in turn lead to high needs for computational power and memory.

The best way to build your baseline architecture is to look at the popular architectures available to solve specific problems and start from there; evaluate its performance, tune its hyperparameters, and repeat. Remember how we were inspired by AlexNet to design our CNN in the image classification project in chapter 3. In the next chapter, we will explore some of the most popular CNN architectures like LeNet, AlexNet, VGG, ResNet, and Inception.

4.6 Learning and optimization

Now that we have built our network architecture, it is time to discuss the hyperparameters that determine how the network learns and optimize its parameter to achieve the minimum error.

4.6.1 Learning rate and decay schedule

The learning rate is the single most important hyperparameter, and one should always make sure that it has been tuned. If there is only time to optimize one hyperparameter, then this is the hyperparameter that is worth tuning.

   --Yoshua Bengio

The learning rate (lr value) was covered extensively in chapter 2. As a refresher, let’s think about how gradient descent (GD) works. The GD optimizer searches for the optimal values of weights that yield the lowest error possible. When setting up our optimizer, we need to define the step size that it takes when it descends the error mountain. This step size is the learning rate. It represents how fast or slow the optimizer descends the error curve. When we plot the cost function with only one weight, we get the oversimplified U-curve in figure 4.14, where the weight is randomly initialized at a point on the curve.

Figure 4.14 When we plot the cost function with only one weight, we get an oversimplified U-curve.

The GD calculates the gradient to find the direction that reduces the error (derivative). In figure 4.14, the descending direction is to the right. The GD starts taking steps down after each iteration (epoch). Now, as you can see in figure 4.15, if we make a miraculously correct choice of the learning rate value, we land on the best weight value that minimizes the error in only one step. This is an impossible case that I’m using for elaboration purposes. Let’s call this the ideal lr value.

Figure 4.15 if we make a miraculously correct choice of the learning rate value, we land on the best weight value that minimizes the error in only one step.

If the learning rate is smaller than the ideal lr value, then the model can continue to learn by taking smaller steps down the error curve until it finds the most optimal value for the weight (figure 4.16). Much smaller means it will eventually converge but will take longer.

Figure 4.16 A learning rate smaller than the ideal lr value: the model takes smaller steps down the error curve.

If the learning rate is larger than the ideal lr value, the optimizer will overshoot the optimal weight value in the first step, and then overshoot again on the other side in the next step (figure 4.17). This could possibly yield a lower error than what we started with and converge to a reasonable value, but not the lowest error that we are trying to reach.

Figure 4.17 A learning rate larger than the ideal lr value: the optimizer overshoot the optimal weight value.

If the learning rate is much larger than the ideal lr value (more than twice as much), the optimizer will not only overshoot the ideal weight, but get farther and farther from the min error (figure 4.18). This phenomenon is called divergence.

Figure 4.18 A learning rate much larger than the ideal lr value: the optimizer gets farther from the min error.

Too-high vs. too-low learning rate

Setting the learning rate high or low is a trade-off between the optimizer speed versus performance. Too-low lr requires many epochs to converge, often too many. Theoretically, if the learning rate is too small, the algorithm is guaranteed to eventually converge if kept running for infinity time. On the other hand, too-high lr might get us to a lower error value faster because we take bigger steps down the error curve, but there is a better chance that the algorithm will oscillate and diverge away from the minimum value. So, ideally, we want to pick the lr that is just right (optimal): it swiftly reaches the minimum point without being so big that it might diverge.

When plotting the loss value against the number of training iterations (epochs), you will notice the following:

  • Much smaller lr--The loss keeps decreasing but needs a lot more time to converge.

  • Larger lr--The loss achieves a better value than what we started with, but is still far from optimal.

  • Much larger lr--The loss might initially decrease, but it starts to increase as the weight values get farther and farther away from the optimal values.

  • Good lr--The loss decreases consistently until it reaches the minimum possible value.

The difference between very high, high, good, and low learning rates

4.6.2 A systematic approach to find the optimal learning rate

The optimal learning rate will be dependent on the topology of your loss landscape, which in turn is dependent on both your model architecture and your dataset. Whether you are using Keras, Tensorflow, PyTorch, or any other DL library, using the default learning rate value of the optimizer is a good start leading to decent results. Each optimizer type has its own default value. Read the documentation of the DL library that you are using to find out the default value of your optimizer. If your model doesn’t train well, you can play around with the lr variable using the usual suspects--0.1, 0.01, 0.001, 0.0001, 0.00001, and 0.000001--to improve performance or speed up training by searching for an optimal learning rate.

The way to debug this is to look at the validation loss values in the training verbose:

  • If val_loss decreases after each step, that’s good. Keep training until it stops improving.

  • If training is complete and val_loss is still decreasing, then maybe the learning rate was so small that it didn’t converge yet. In this case, you can do one of two things:

    • Train again with the same learning rate but with more training iterations (epochs) to give the optimizer more time to converge.
    • Increase the lr value a little and train again.
  • If val_loss starts to increase or oscillate up and down, then the learning rate is too high and you need to decrease its value.

4.6.3 Learning rate decay and adaptive learning

Finding the learning rate value that is just right for your problem is an iterative process. You start with a static lr value, wait until training is complete, evaluate, and then tune. Another way to go about tuning your learning rate is to set a learning rate decay: a method by which the learning rate changes during training. It often performs better than a static value, and drastically reduces the time required to get optimal results.

By now, it’s clear that when we try lower learning values, we have a better chance to get to a lower error point. But training it will take longer. In some cases, training takes so long it becomes infeasible. A good trick is to implement a decay rate in our learning rate. The decay rate tells our network to automatically decrease the lr throughout the training process. For example, we can decrease the lr by a constant value of (x) for each (n) number of steps. This way, we can start with the higher value to take bigger steps toward the minimum, and then gradually decrease the learning rate every (n) epochs to avoid overshooting the ideal lr.

One way to accomplish this is by reducing the learning rate linearly (linear decay). For example, you can decrease it by half every five epochs, as shown in figure 4.19.

Figure 4.19 Decreasing the lr by half every five epochs

Another way is to decrease the lr exponentially (exponential decay). For example, you can multiply it by 0.1 every eight epochs (figure 4.20). Clearly, the network will converge a lot slower than with linear decay, but it will eventually converge.

Figure 4.20 Multiplying the lr by 0.1 every eight epochs

Other clever learning algorithms have an adaptive learning rate (adaptive learning). These algorithms use a heuristic approach that automatically updates the lr when the training stops. This means not only decreasing the lr when needed, but also increasing it when improvements are too slow (too-small lr). Adaptive learning usually works better than other learning rate-setting strategies. Adam and Adagrad are examples of adaptive learning optimizers: more on adaptive optimizers later in this chapter.

4.6.4 Mini-batch size

Mini-batch size is another hyperparameter that you need to set and tune in the optimizer algorithm. The batch_size hyperparameter has a big effect on resource requirements of the training process and speed.

In order to understand the mini-batch, let’s back up to the three GD types that we explained in chapter 2--batch, stochastic, and mini-batch:

  • Batch gradient descent (BGD) --We feed the entire dataset to the network all at once, apply the feedforward process, calculate the error, calculate the gradient, and backpropagate to update the weights. The optimizer calculates the gradient by looking at the error generated after it sees all the training data, and the weights are updated only once after each epoch. So, in this case, the mini-batch size equals the entire training dataset. The main advantage of BGD is that it has relatively low noise and bigger steps toward the minimum (see figure 4.21). The main disadvantage is that it can take too long to process the entire training dataset at each step, especially when training on big data. BGD also requires a huge amount of memory for training large datasets, which might not be available. BGD might be a good option if you are training on a small dataset.

    Figure 4.21 Batch GD with low noise on its path to the minimum error

  • Stochastic gradient descent (SGD) --Also called online learning. We feed the network a single instance of the training data at a time and use this one instance to do the forward pass, calculate error, calculate the gradient, and backpropagate to update the weights (figure 4.22). In SGD, the weights are updated after it sees each single instance (as opposed to processing the entire dataset before each step for BGD). SGD can be extremely noisy as it oscillates on its way to the global minimum because it takes a step down after each single instance, which could sometimes be in the wrong direction. This noise can be reduced by using a smaller learning rate, so, on average, it takes you in a good direction and almost always performs better than BGD. With SGD you get to make progress quickly and usually reach very close to the global minimum. The main disadvantage is that by calculating the GD for one instance at a time, you lose the speed gain that comes with matrix multiplication in the training calculations.

    Figure 4.22 Stochastic GD with high noise that oscillates on its path to the minimum error

To recap BGD and SGD, on one extreme, if you set your mini-batch size to 1 (stochastic training), the optimizer will take a step down the error curve after computing the gradient for every single instance of the training data. This is good, but you lose the increased speed of using matrix multiplication. On the other extreme, if your mini-batch size is your entire training dataset, then you are using BGD. It takes too long to make a step toward the minimum error when processing large datasets. Between the two extremes, there is mini-batch GD.

  • Mini-batch gradient descent (MB-GD) --A compromise between batch and stochastic GD. Instead of computing the gradient from one sample (SGD) or all training samples (BGD), we divide the training sample into mini-batches to compute the gradient from. This way, we can take advantage of matrix multiplication for faster training and start making progress instead of having to wait to train the entire training set.

Guidelines for choosing mini-batch size

First, if you have a small dataset (around less than 2,000), you might be better off using BGD. You can train the entire dataset quite fast.

For large datasets, you can use a scale of mini-batch size values. A typical starting value for the mini-batch is 64 or 128. You can then tune it up and down on this scale: 32, 64, 128, 256, 512, 1024, and keep doubling it as needed to speed up training. But make sure that your mini-batch size fits in your CPU/GPU memory. Mini-batch sizes of 1024 and larger are possible but quite rare. A larger mini-batch size allows a computational boost that uses matrix multiplication in the training calculations. But that comes at the expense of needing more memory for the training process and generally more computational resources. The following figure shows the relationship between batch size, computational resources, and number of epochs needed for neural network training:

The relationship between batch size, computational resources, and number of epochs

4.7 Optimization algorithms

In the history of DL, many researchers proposed optimization algorithms and showed that they work well with some problems. But most of them subsequently proved to not generalize well to the wide range of neural networks that we might want to train. In time, the DL community came to feel that the GD algorithm and some of its variants work well. So far, we have discussed batch, stochastic, and mini-batch GD.

We learned that choosing a proper learning rate can be challenging because a too-small learning rate leads to painfully slow convergence, while a too-large learning rate can hinder convergence and cause the loss function to fluctuate around the minimum or even diverge. We need more creative solutions to further optimize GD.

NOTE Optimizer types are well explained in the documentation of most DL frameworks. In this section, I’ll explain the concepts of two of the most popular gradient-descent-based optimizers--Momentum and Adam--that really stand out and have been shown to work well across a wide range of DL architectures. This will help you build a good foundation to dive deeper into other optimization algorithms. For more about optimization algorithms, read “An overview of gradient descent optimization algorithms” by Sebastian Ruder (https://arxiv.org/pdf/1609.04747.pdf).

4.7.1 Gradient descent with momentum

Recall that SGD ends up with some oscillations in the vertical direction toward the minimum error (figure 4.23). These oscillations slow down the convergence process and make it harder to use larger learning rates, which could result in your algorithm overshooting and diverging.

Figure 4.23 SGD oscillates in the vertical direction toward the minimum error.

To reduce these oscillations, a technique called momentum was invented that lets the GD navigate along relevant directions and softens the oscillation in irrelevant directions. In other words, it makes learning slower in the vertical-direction oscillations and faster in the horizontal-direction progress, which will help the optimizer reach the target minimum much faster.

This is similar to the idea of momentum from classical physics: when a snowball rolls down a hill, it accumulates momentum, going faster and faster. In the same way, our momentum term increases for dimensions whose gradients point in the same direction and reduces updates for dimensions whose gradients change direction. This leads to faster convergence and reduces oscillations.

How the math works in momentum

The math here is really simple and straightforward. The momentum is built by adding a velocity term to the equation that updates the weight:

wnew = wold - α dE/dwi 

wnew = wold - learning rate × gradient + velocity term 

Original update rule

New rule after adding velocity


The velocity term equals the weighted average of the past gradients.

4.7.2 Adam

Adam stands for adaptive moment estimation. Adam keeps an exponentially decaying average of past gradients, similar to momentum. Whereas momentum can be seen as a ball rolling down a slope, Adam behaves like a heavy ball with friction to slow down the momentum and control it. Adam usually outperforms other optimizers because it helps train a neural network model much more quickly than the techniques we have seen earlier.

Again, we have new hyperparameters to tune. But the good news is that the default values of major DL frameworks often work well, so you may not need to tune at all--except for the learning rate, which is not an Adam-specific hyperparameter:

keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, 
    decay=0.0)

The authors of Adam propose these default values:

  • The learning rate needs to be tuned.

  • For the momentum term β1, a common choice is 0.9.

  • For the RMSprop term β 2, a common choice is 0.999.

  • ε is set to 10-8.

4.7.3 Number of epochs and early stopping criteria

A training iteration, or epoch, is when the model goes a full cycle and sees the entire training dataset at once. The epoch hyperparameter is set to define how many iterations our network continues training. The more training iterations, the more our model learns the features of our training data. To diagnose whether your network needs more or fewer training epochs, keep your eyes on the training and validation error values.

The intuitive way to think about this is that we want to continue training as long as the error value is decreasing. Correct? Let’s take a look at the sample verbose output from a network training in figure 4.24.

Figure 4.24 Sample verbose output of the first five epochs. Both training and validation errors are improving.

You can see that both training and validation errors are decreasing. This means the network is still learning. It doesn’t make sense to stop the training at this point. The network is clearly still making progress toward the minimum error. Let’s let it train for six more epochs and observe the results (figure 4.25).

Figure 4.25 The training error is still improving, but the validation error started oscillating from epoch 8 onward.

It looks like the training error is doing well and still improving. That’s good. This means the network is improving on the training set. However, if you look at epochs 8 and 9, you will see that val_error started to oscillate and increase. Improving train_error while not improving val_error means the network is starting to overfit the training data and failing to generalize to the validation data.

Let’s plot the training and validation errors (figure 4.26). You can see that both the training and validation errors were improving at first, but then the validation error started to increase, leading to overfitting. We need to find a way to stop the training just before it starts to overfit. This technique is called early stopping.

4.7.4 Early stopping

Early stopping is an algorithm widely used to determine the right time to stop the training process before overfitting happens. It simply monitors the validation error value and stops the training when the value starts to increase. Here is the early stopping function in Keras:

EarlyStopping(monitor='val_loss', min_delta=0,  patience=20)

The EarlyStopping function takes the following arguments:

  • monitor--The metric you monitor during training. Usually we want to keep an eye on val_loss because it represents our internal testing of model performance. If the network is doing well on the validation data, it will probably do well on test data and production.

  • min_delta--The minimum change that qualifies as an improvement. There is no standard value for this variable. To decide the min_delta value, run a few epochs and see the change in error and validation accuracy. Define min_delta according to the rate of change. The default value of 0 works pretty well in many cases.

  • patience--This variable tells the algorithm how many epochs it should wait before stopping the training if the error does not improve. For example, if we set patience equal to 1, the training will stop at the epoch where the error increases. We must be a little flexible, though, because it is very common for the error to oscillate a little and continue improving. We can stop the training if it hasn’t improved in the last 10 or 20 epochs.

TIP The good thing about early stopping is that it allows you to worry less about the epochs hyperparameter. You can set a high number of epochs and let the stopping algorithm take care of stopping the training when error stops improving.

4.8 Regularization techniques to avoid overfitting

If you observe that your neural network is overfitting the training data, your network might be too complex and need to be simplified. One of the first techniques you should try is regularization. In this section, we will discuss three of the most common regularization techniques: L2, dropout, and data augmentation.

Figure 4.26 Improving train_error while not improving val_error means the network is starting to overfit.

4.8.1 L2 regularization

The basic idea of L2 regularization is that it penalizes the error function by adding a regularization term to it. This, in turn, reduces the weight values of the hidden units and makes them too small, very close to zero, to help simplify the model.

Let’s see how regularization works. First, we update the error function by adding the regularization term:

error functionnew = error functionold + regularization term

Note that you can use any of the error functions explained in chapter 2, like MSE or cross entropy. Now, let’s take a look at the regularization term

L2 regularization term = λ/2m * Σ || w ||2

where lambda (λ) is the regularization parameter, m is the number of instances, and w is the weight. The updated error function looks like this:

error function new = error function old+ λ/2m * Σ || w ||2

Why does L2 regularization reduce overfitting? Well, let’s look at how the weights are updated during the backpropagation process. We learned from chapter 2 that the optimizer calculates the derivative of the error, multiplies it by the learning rate, and subtracts this value from the old weight. Here is the backpropagation equation that updates the weights:

Since we add the regularization term to the error function, the new error becomes larger than the old error. This means its derivative (∂Error/∂Wx) is also bigger, leading to a smaller Wnew . L2 regularization is also known as weight decay, as it forces the weights to decay toward zero (but not exactly zero).

Reducing weights leads to a simpler neural network

To see how this works, consider: if the regularization term is so large that, when multiplied with the learning rate, it will be equal to Wold, then this will make the new weight equal to zero. This cancels the effect of this neuron, leading to a simpler neural network with fewer neurons.

In practice, L2 regularization does not make the weights equal to zero. It just makes them smaller to reduce their effect. A large regularization parameter (ƛ) lead to negligible weights. When the weights are negligible, the model will not learn much from these units. This will make the network simpler and thus reduce overfitting

L2 regularization reduces the weights and simplifies the network to reduce overfitting.

This is what L2 regularization looks like in Keras:

model.add(Dense(units=16, kernel_regularizer=regularizers.l2(ƛ),
    activation='relu'))                                             

When adding a hidden layer to your network, add the kernel_regularization argument with the L2 regularizer

The lambda value is a hyperparameter that you can tune. The default value of your DL library usually works well. If you still see signs of overfitting, increase the lambda hyperparameter to reduce the model complexity.

4.8.2 Dropout layers

Dropout is another regularization technique that is very effective for simplifying a neural network and avoiding overfitting. We discussed dropout extensively in chapter 3. The dropout algorithm is fairly simple: at every training iteration, every neuron has a probability p of being temporarily ignored (dropped out) during this training iteration. This means it may be active during subsequent iterations. While it is counterintuitive to intentionally pause the learning on some of the network neurons, it is quite surprising how well this technique works. The probability p is a hyperparameter that is called dropout rate and is typically set in the range of 0.3 to 0.5. Start with 0.3, and if you see signs of overfitting, increase the rate.

TIP I like to think of dropout as tossing a coin every morning with your team to decide who will do a specific critical task. After a few iterations, all your team members will learn how to do this task and not rely on a single member to get it done. The team would become much more resilient to change.

Both L2 regularization and dropout aim to reduce network complexity by reducing its neurons’ effectiveness. The difference is that dropout completely cancels the effect of some neurons with every iteration, while L2 regularization just reduces the weight values to reduce the neurons’ effectiveness. Both lead to a more robust, resilient neural network and reduce overfitting. It is recommended that you use both types of regularization techniques in your network.

4.8.3 Data augmentation

One way to avoid overfitting is to obtain more data. Since this is not always a feasible option, we can augment our training data by generating new instances of the same images with some transformations. Data augmentation can be an inexpensive way to give your learning algorithm more training data and therefore reduce overfitting.

The many image-augmentation techniques include flipping, rotation, scaling, zooming, lighting conditions, and many other transformations that you can apply to your dataset to provide a variety of images to train on. In figure 4.27, you can see some of the transformation techniques applied to an image of the digit 6.

Figure 4.27 Various image augmentation techniques applied to an image of the digit 6

In figure 4.27, we created 20 new images that the network can learn from. The main advantage of synthesizing images like this is that now you have more data (20×) that tells your algorithm that if an image is the digit 6, then even if you flip it vertically or horizontally or rotate it, it’s still the digit 6. This makes the model more robust to detect the number 6 in any form and shape.

Data augmentation is considered a regularization technique because allowing the network to see many variants of the object reduces its dependence on the original form of the object during feature learning. This makes the network more resilient when tested on new data.

Data augmentation in Keras looks like this:

from keras.preprocessing.image import ImageDataGenerator                    
 
datagen = ImageDataGenerator(horizontal_flip=True, vertical_flip=True)      
 
datagen.fit(training_set)                                                   

Imports ImageDataGenerator from Keras

Generates batches of new image data. ImageDataGenerator takes transformation types as arguments. Here, we set horizontal and vertical flip to True. See the Keras documentation (or your DL library) for more transformation arguments.

Computes the data augmentation on the training set

4.9 Batch normalization

Earlier in this chapter, we talked about data normalization to speed up learning. The normalization techniques we discussed were focused on preprocessing the training set before feeding it to the input layer. If the input layer benefits from normalization, why not do the same thing for the extracted features in the hidden units, which are changing all the time and get much more improvement in training speed and network resilience (figure 4.28)? This process is called batch normalization (BN).

Figure 4.28 Batch normalization is normalizing the extracted features in hidden units.

4.9.1 The covariate shift problem

Before we define covariate shift, let’s take a look at an example to illustrate the problem that batch normalization (BN) confronts. Suppose you are building a cat classifier, and you train your algorithm on images of white cats only. When you test this classifier on images with cats that are different colors, it will not perform well. Why? Because the model has been trained on a training set with a specific distribution (white cats). When the distribution changes in the test set, it confuses the model (figure 4.29).

Figure 4.29 Graph A is the training set of only white cats, and graph B is the testing set with cats of various colors. The circles represent the cat images, and the stars represent the non-cat images.

We should not expect that the model trained on the data in graph A will do very well with the new distribution in graph B. The idea of the change in data distribution goes by the fancy name covariate shift.

DEFINITION If a model is learning to map dataset x to label y, then if the distribution of x changes, it’s known as covariate shift. When that happens, you might need to retrain your learning algorithm.

4.9.2 Covariate shift in neural networks

To understand how covariate shift happens in neural networks, consider the simple four-layer MLP in figure 4.30. Let’s look at the network from the third-layer (L3) perspective. Its input are the activation values in L2 (a 12, a 22, a 32, and a 42), which are the features extracted from the previous layers. L3 is trying to map these inputs to ŷ to make it as close as possible to the label y. While the third layer is doing that, the network is adapting the values of the parameters from previous layers. As the parameters (w, b) are changing in layer 1, the activation values in the second layer are changing, too. So from the perspective of the third hidden layer, the values of the second hidden layer are changing all the time: the MLP is suffering from the problem of covariate shift. Batch norm reduces the degree of change in the distribution of the hidden unit values, causing these values to become more stable so that the later layers of the neural network have firmer ground to stand on.

Figure 4.30 A simple four-layer MLP. L1 features are input to the L2 layer. The same is true for layers 2, 3, and 4.

NOTE It is important to realize that batch normalization does not cancel or reduce the change in the hidden unit values. What it does is ensure that the distribution of that change remains the same: even if the exact values of the units change, the mean and variance do not change.

4.9.3 How does batch normalization work?

In their 2015 paper “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” (https://arxiv.org/abs/1502.03167), Sergey Ioffe and Christian Szegedy proposed the BN technique to reduce covariate shift. Batch normalization adds an operation in the neural network just before the activation function of each layer to do the following:

  1. Zero-center the inputs

  2. Normalize the zero-centered inputs

  3. Scale and shift the results

This operation lets the model learn the optimal scale and mean of the inputs for each layer.

How the math works in batch normalization

  1. To zero-center the inputs, the algorithm needs to calculate the input mean and standard deviation (the input here means the current mini-batch: hence the term batch normalization):

      

      

    Mini-batch mean

    Mini-batch variance

    where m is the number of instances in the mini-batch, μB is the mean, and σB is the standard deviation over the current mini-batch.

  2. Normalize the input:

    where is the zero-centered and normalized input. Note that there is a variable here that we added (ε). This is a tiny number (typically 10-5) to avoid division by zero if σ is zero in some estimates.

  3. Scale and shift the results. We multiply the normalized output by a variable γ to scale it and add (β ) to shift it

    yi ← γ Xi + β

    where yi is the output of the BN operation, scaled and shifted.

Notice that BN introduces two new learnable parameters to the network: γ and β . So our optimization algorithm will update the parameters of γ and β just like it updates weights and biases. In practice, this means you may find that training is rather slow at first, while GD is searching for the optimal scales and offsets for each layer, but it accelerates once it’s found reasonably good values.

i

4.9.4 Batch normalization implementation in Keras

It is important to know how batch normalization works so you can get a better understanding of what your code is doing. But when using BN in your network, you don’t have to implement all these details yourself. Implementing BN is often done by adding one line of code, using any DL framework. In Keras, the way you add batch normalization to your neural network is by adding a BN layer after the hidden layer, to normalize its results before they are fed to the next layer.

The following code snippet shows you how to add a BN layer when building your neural network:

from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers.normalization import BatchNormalization     
 
model = Sequential()                                          
 
model.add(Dense(hidden_units, activation='relu'))             
 
model.add(BatchNormalization())                               
 
model.add(Dropout(0.5))                                       
 
model.add(Dense(units, activation='relu'))                    
 
model.add(BatchNormalization())                               
 
model.add(Dense(2, activation='softmax'))                     

Imports the BatchNormalization layer from the Keras library

Initiates the model

Adds the first hidden layer

Adds the batch norm layer to normalize the results of layer 1

If you are adding dropout to your network, it is preferable to add it after the batch norm layer because you don’t want the nodes that are randomly turned off to miss the normalization step.

Adds the second hidden layer

Adds the batch norm layer to normalize the results of layer 2

Output layer

4.9.5 Batch normalization recap

The intuition that I hope you’ll take away from this discussion is that BN applies the normalization process not just to the input layer, but also to the values in the hidden layers in a neural network. This weakens the coupling of the learning process between earlier and later layers, allowing each layer of the network to learn more independently.

From the perspective of the later layers in the network, the earlier layers don’t get to shift around as much because they are constrained to have the same mean and variance. This makes the job of learning easier in the later layers. The way this happens is by ensuring that the hidden units have a standardized distribution (mean and variance) controlled by two explicit parameters, γ and β , which the learning algorithm sets during training.

4.10 Project: Achieve high accuracy on image classification

In this project, we will revisit the CIFAR-10 classification project from chapter 3 and apply some of the improvement techniques from this chapter to increase the accuracy from ~65% to ~90%. You can follow along with this example by visiting the book’s website, www.manning.com/books/deep-learning-for-vision-systems or www .computervisionbook.com, to see the code notebook.

We will accomplish the project by following these steps:

  1. Import the dependencies.

  2. Get the data ready for training:

    • Download the data from the Keras library.
    • Split it into train, validate, and test datasets.
    • Normalize the data.
    • One-hot encode the labels.
  3. Build the model architecture. In addition to regular convolutional and pooling layers, as in chapter 3, we add the following layers to our architecture:

    • Deeper neural network to increase learning capacity
    • Dropout layers
    • L2 regularization to our convolutional layers
    • Batch normalization layers
  4. Train the model.

  5. Evaluate the model.

  6. Plot the learning curve.

Let’s see how this is implemented.

Step 1: Import dependencies

Here’s the Keras code to import the needed dependencies:

import keras                                          
from keras.datasets import cifar10
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.utils import np_utils
from keras.layers import Dense, Activation, Flatten, Dropout, BatchNormalization,
    Conv2D, MaxPooling2D
from keras.callbacks import ModelCheckpoint
from keras import regularizers, optimizers
 
import numpy as np                                    
 
from matplotlib import pyplot                         

Keras library to download the datasets, preprocess images, and network components

Imports numpy for math operations

Imports the matplotlib library to visualize results

Step 2: Get the data ready for training

Keras has some datasets available for us to download and experiment with. These datasets are usually preprocessed and almost ready to be fed to the neural network. In this project, we use the CIFAR-10 dataset, which consists of 50,000 32 × 32 color training images, labeled over 10 categories, and 10,000 test images. Check the Keras documentation for more datasets like CIFAR-100, MNIST, Fashion-MNIST, and more.

Keras provides the CIFAR-10 dataset already split into training and testing sets. We will load them and then split the training dataset into 45,000 images for training and 5,000 images for validation, as explained in this chapter:

(x_train, y_train), (x_test, y_test) = cifar10.load_data()    
x_train = x_train.astype('float32')                           
x_test = x_test.astype('float32')                             
 
(x_train, x_valid) = x_train[5000:], x_train[:5000]           
(y_train, y_valid) = y_train[5000:], y_train[:5000]           

Downloads and splits the data

Breaks the training set into training and validation sets

Let’s print the shape of x_train, x_valid, and x_test:

print('x_train =', x_train.shape)
print('x_valid =', x_valid.shape)
print('x_test =', x_test.shape)
 
>> x_train = (45000, 32, 32, 3)
>> x_valid = (5000, 32, 32, 3)
>> x_test = (1000, 32, 32, 3)

The format of the shape tuple is as follows: (number of instances, width, height, channels).

Normalize the data

Normalizing the pixel values of our images is done by subtracting the mean from each pixel and then dividing the result by the standard deviation:

mean = np.mean(x_train,axis=(0,1,2,3))
std = np.std(x_train,axis=(0,1,2,3))
x_train = (x_train-mean)/(std+1e-7)
x_valid = (x_valid-mean)/(std+1e-7)
x_test = (x_test-mean)/(std+1e-7)
One-hot encode the labels

To one-hot encode the labels in the train, valid, and test datasets, we use the to_ categorical function in Keras:

num_classes = 10
y_train = np_utils.to_categorical(y_train,num_classes)
y_valid = np_utils.to_categorical(y_valid,num_classes)
y_test = np_utils.to_categorical(y_test,num_classes)
Data augmentation

For augmentation techniques, we will arbitrarily go with the following transformations: rotation, width and height shift, and horizontal flip. When you are working on problems, view the images that the network missed or provided poor detections for and try to understand why it is not performing well on them. Then create your hypothesis and experiment with it. For example, if the missed images were of shapes that are rotated, you might want to try the rotation augmentation. You would apply that, experiment, evaluate, and repeat. You will come to your decisions purely from analyzing your data and understanding the network performance:

datagen = ImageDataGenerator(       
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    horizontal_flip=True,
    vertical_flip=False
    )
datagen.fit(x_train)                

Data augmentation

Computes the data augmentation on the training set

Step 3: Build the model architecture

In chapter 3, we built an architecture inspired by AlexNet (3 CONV + 2 FC). In this project, we will build a deeper network for increased learning capacity (6 CONV + 1 FC).

The network has the following configuration:

  • Instead of adding a pooling layer after each convolutional layer, we will add one after every two convolutional layers. This idea was inspired by VGGNet, a popular neural network architecture developed by the Visual Geometry Group (University of Oxford). VGGNet will be explained in chapter 5.

  • Inspired by VGGNet, we will set the kernel_size of our convolutional layers to 3 × 3 and the pool_size of the pooling layer to 2 × 2.

  • We will add dropout layers every other convolutional layer, with (p) ranges from 0.2 and 0.4.

  • A batch normalization layer will be added after each convolutional layer to normalize the input for the following layer.

  • In Keras, L2 regularization is added to the convolutional layer code.

Here’s the code:

base_hidden_units = 32                                                  
weight_decay = 1e-4                                                     
model = Sequential()                                                    
 
# CONV1
model.add(Conv2D(base_hidden_units, kernel_size= 3, padding='same',     
         kernel_regularizer=regularizers.l2(weight_decay),              
input_shape=x_train.shape[1:]))
model.add(Activation('relu'))                                           
model.add(BatchNormalization())                                         
 
# CONV2
model.add(Conv2D(base_hidden_units, kernel_size= 3, padding='same', 
         kernel_regularizer=regularizers.l2(weight_decay)))
model.add(Activation('relu'))
model.add(BatchNormalization())
 
# POOL + Dropout
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.2))                                                 
 
# CONV3
model.add(Conv2D(base_hidden_units * 2, kernel_size= 3, padding='same', 
         kernel_regularizer=regularizers.l2(weight_decay)))
model.add(Activation('relu'))
model.add(BatchNormalization())
 
# CONV4
model.add(Conv2D(base_hidden_units * 2, kernel_size= 3, padding='same', 
         kernel_regularizer=regularizers.l2(weight_decay)))
model.add(Activation('relu'))
model.add(BatchNormalization())
 
# POOL + Dropout
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.3))
 
# CONV5
model.add(Conv2D(base_hidden_units * 4, kernel_size= 3, padding='same', 
         kernel_regularizer=regularizers.l2(weight_decay)))
model.add(Activation('relu'))
model.add(BatchNormalization())
 
# CONV6
model.add(Conv2D(base_hidden_units * 4, kernel_size= 3, padding='same',
         kernel_regularizer=regularizers.l2(weight_decay)))
model.add(Activation('relu'))
model.add(BatchNormalization())
 
# POOL + Dropout
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.4))

# FC7
model.add(Flatten())                                                   
model.add(Dense(10, activation='softmax'))                             
 
model.summary()                                                        

Number of hidden units variable. We declare this variable here and use it in our convolutional layers to make it easier to update from one place.

L2 regularization hyperparameter (ƛ)

Creates a sequential model (a linear stack of layers)

Notice that we define the input_shape here because this is the first convolutional layer. We don’t need to do that for the remaining layers.

Adds L2 regularization to the convolutional layer

Uses a ReLU activation function for all hidden layers

Adds a batch normalization layer

Dropout layer with 20% probability

Number of hidden units = 64

Flattens the feature map into a 1D features vector (explained in chapter 3)

10 hidden units because the dataset has 10 class labels. Softmax activation function is used for the output layer (explained in chapter 2)

Prints the model summary

The model summary is shown in figure 4.31.

Step 4: Train the model

Before we jump into the training code, let’s discuss the strategy behind some of the hyperparameter settings:

  • batch_size--This is the mini-batch hyperparameter that we covered in this chapter. The higher the batch_size, the faster your algorithm learns. You can start with a mini-batch of 64 and double this value to speed up training. I tried 256 on my machine and got the following error, which means my machine was running out of memory. I then lowered it back to 128:

    Resource exhausted: OOM when allocating tensor with shape[256,128,4,4]
  • epochs--I started with 50 training iterations and found that the network was still improving. So I kept adding more epochs and observing the training results. In this project, I was able to achieve >90% accuracy after 125 epochs. As you will see soon, there is still room for improvement if you let it train longer.

    Figure 4.31 Model summary

  • Optimizer --I used the Adam optimizer. See section 4.7 to learn more about optimization algorithms.

NOTE It is important to note that I’m using a GPU for this experiment. The training took around 3 hours. It is recommended that you use your own GPU or any cloud computing service to get the best results. If you don’t have access to a GPU, I recommend that you try a smaller number of epochs or plan to leave your machine training overnight or even for a couple of days, depending on your CPU specifications.

Let’s see the training code:

batch_size = 128                                                          
epochs = 125                                                              
 
checkpointer = ModelCheckpoint(filepath='model.100epochs.hdf5', verbose=1, 
                               save_best_only=True )                      
optimizer = keras.optimizers.adam(lr=0.0001,decay=1e-6)                   
 
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])                                                
 
history = model.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size), callbacks=[checkpointer], steps_per_epoch=x_train.shape[0] // batch_size, epochs=epochs, verbose=2, validation_data=(x_valid, y_valid))                       

Mini-batch size

Number of training iterations

Path f the file where the best weights will be saved, and a Boolean True to save the weights only when there is an improvement

Adam optimizer with a learning rate = 0.0001

Cross-entropy loss function (explained in chapter 2)

Allows you to do real-time data augmentation on images on CPU in parallel to training your model on GPU. The callback to the checkpointer saves the model weights; you can add other callbacks like an early stopping function.

When you run this code, you will see the verbose output of the network training for each epoch. Keep your eyes on the loss and val_loss values to analyze the network and diagnose bottlenecks. Figure 4.32 shows the verbose output of epochs 121 to 125.

Figure 4.32 Verbose output of epochs 121 to 125

Step 5: Evaluate the model

To evaluate the model, we use a Keras function called evaluate and print the results:

scores = model.evaluate(x_test, y_test, batch_size=128, verbose=1)
print('
Test result: %.3f loss: %.3f' % (scores[1]*100,scores[0]))
 
>> Test result: 90.260 loss: 0.398
Plot learning curves

Plot the learning curves to analyze the training performance and diagnose overfitting and underfitting (figure 4.33):

pyplot.plot(history.history['acc'], label='train')
pyplot.plot(history.history['val_acc'], label='test')
pyplot.legend()
pyplot.show()
 

Figure 4.33 Learning curves

Further improvements

Accuracy of 90% is pretty good, but you can still improve further. Here are some ideas you can experiment with:

  • More training epochs --Notice that the network was improving until epoch 123. You can increase the number of epochs to 150 or 200 and let the network train longer.

  • Deeper network --Try adding more layers to increase the model complexity, which increases the learning capacity.

  • Lower learning rate --Decrease the lr (you should train longer if you do so).

  • Different CNN architecture --Try something like Inception or ResNet (explained in detail in the next chapter). You can get up to 95% accuracy with the ResNet neural network after 200 epochs of training.

  • Transfer learning --In chapter 6, we will explore the technique of using a pretrained network on your dataset to get higher results with a fraction of the learning time.

Summary

  • The general rule of thumb is that the deeper your network is, the better it learns.

  • At the time of writing, ReLU performs best in hidden layers, and softmax performs best in the output layer.

  • Stochastic gradient descent usually succeeds in finding a minimum. But if you need fast convergence and are training a complex neural network, it’s safe to go with Adam.

  • Usually, the more you train, the better.

  • L2 regularization and dropout work well together to reduce network complexity and overfitting.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.171.58