This chapter concludes the first part of this book, providing a foundation for deep learning (DL). In chapter 2, you learned how to build a multilayer perceptron (MLP). In chapter 3, you learned about a neural network architecture topology that is very commonly used in computer vision (CV) problems: convolutional neural networks (CNNs). In this chapter, we will wrap up this foundation by discussing how to structure your machine learning (ML) project from start to finish. You will learn strategies to quickly and efficiently get your DL systems working, analyze the results, and improve network performance.
As you might have already noticed from the previous projects, DL is a very empirical process. It relies on running experiments and observing model performance more than having one go-to formula for success that fits all problems. We often have an initial idea for a solution, code it up, run the experiment to see how it did, and then use the outcome of this experiment to refine our ideas. When building and tuning a neural network, you will find yourself making many seemingly arbitrary decisions:
In this chapter, you will learn the following:
Defining the performance metrics for your system --In addition to model accuracy, you will use other metrics like precision, recall, and F-score to evaluate your network.
Designing a baseline model --You will choose an appropriate neural network architecture to run your first experiment.
Getting your data ready for training --In real-world problems, data comes in messy, not ready to be fed to a neural network. In this section, you will massage your data to get it ready for learning.
Evaluating your model and interpreting its performance --When training is complete, you analyze your model’s performance to identify bottlenecks and narrow down improvement options. This means diagnosing which of the network components are performing worse than expected and identifying whether poor performance is due to overfitting, underfitting, or a defect in the data.
Improving the network and tuning hyperparameters --Finally, we will dive deep into the most important hyperparameters to help develop your intuition about which hyperparameters you need to tune. You will use tuning strategies to make incremental changes based on your diagnosis from the previous step.
TIP With more practice and experimentation, DL engineers and researchers build their intuition over time as to the most effective ways to make improvements. My advice is to get your hands dirty and try different architectures and approaches to develop your hyperparameter-tuning skills.
Performance metrics allow us to evaluate our system. When we develop a model, we want to find out how well it is working. The simplest way to measure the “goodness” of our model is by measuring its accuracy. The accuracy metric measures how many times our model made the correct prediction. So, if we test the model with 100 input samples, and it made the correct prediction 90 times, this means the model is 90% accurate.
Here is the equation used to calculate model accuracy:
We have been using accuracy as a metric for evaluating our model in earlier projects, and it works fine in many cases. But let’s consider the following problem: you are designing a medical diagnosis test for a rare disease. Suppose that only one in every million people has this disease. Without any training or even building a system at all, if you hardcode the output to be always negative (no disease found), your system will always achieve 99.999% accuracy. Is that good? The system is 99.999% accurate, which might sound fantastic, but it will never capture the patients with the disease. This means the accuracy metric is not suitable to measure the “goodness” of this model. We need other evaluation metrics that measure different aspects of the model’s prediction ability.
To set the stage for other metrics, we will use a confusion matrix : a table that describes the performance of a classification model. The confusion matrix itself is relatively simple to understand, but the related terminology can be a little confusing at first. Once you understand it, you’ll find that the concept is really intuitive and makes a lot of sense. Let’s go through it step by step.
The goal is to describe model performance from different angles other than prediction accuracy. For example, suppose we are building a classifier to predict whether a patient is sick or healthy. The expected classifications are either positive (the patient is sick) or negative (the patient is healthy). We run our model on 1,000 patients and enter the model predictions in table 4.1.
Let’s now define the most basic terms, which are whole numbers (not rates):
True positives (TP) --The model correctly predicted yes (the patient has the disease).
True negatives (TN) --The model correctly predicted no (the patient does not have the disease).
False positives (FP) --The model falsely predicted yes, but the patient actually does not have the disease (in some literature known as a Type I error or error of the first kind).
False negatives (FN) --The model falsely predicted no, but the patient actually does have the disease (in some literature known as a Type II error or error of the second kind).
The patients that the model predicts are negative (no disease) are the ones that the model believes are healthy, and we can send them home without further care. The patients that the model predicts are positive (have disease) are the ones that we will send for further investigation. Which mistake would we rather make? Mistakenly diagnosing someone as positive (has disease) and sending them for more investigation is not as bad as mistakenly diagnosing someone as negative (healthy) and sending them home at risk to their life. The obvious choice of evaluation metric here is that we care more about the number of false negatives (FN). We want to find all the sick people, even if the model accidentally classifies some healthy people as sick. This metric is called recall.
Recall (also known as sensitivity) tells us how many of the sick patients our model incorrectly diagnosed as well. In other words, how many times did the model incorrectly diagnose a sick patient as negative (false negative, FN)? Recall is calculated by the following equation:
Precision (also known as specificity) is the opposite of recall. It tells us how many of the well patients our model incorrectly diagnosed as sick. In other words, how many times did the model incorrectly diagnose a well patient as positive (false positive, FP)? Precision is calculated by the following equation:
In many cases, we want to summarize the performance of a classifier with a single metric that represents both recall and precision. To do so, we can convert precision (p) and recall (r ) into a single F-score metric. In mathematics, this is called the harmonic mean of p and r:
The F-score gives a good overall representation of how your model is performing. Let’s take a look at the health-diagnostics example again. We agreed that this is a high-recall model. But what if the model is doing really well on the FN and giving us a high recall score, but it’s performing poorly on the FP and giving us a low precision score? Doing poorly on FP means, in order to not miss any sick patients, it is mistakenly diagnosing a lot of patients as sick, to be on the safe side. So, while recall might be more important for this problem, it is good to look at the model from both scores--precision and recall--together:
NOTE Defining the model evaluation metric is a necessary step because it will guide your approach to improving the system. Without clearly defined metrics, it can be difficult to tell whether changes to a ML system result in progress or not.
Now that you have selected the metrics you will use to evaluate your system, it is time to establish a reasonable end-to-end system for training your model. Depending on the problem you are solving, you need to design the baseline to suit your network type and architecture. In this step, you will want to answer questions like these:
Should I use an MLP or CNN network (or RNN, explained later in the book)?
Should I use other object detection techniques like YOLO or SSD (explained in later chapters)?
Do I need to add any other regularization layers like dropout or batch normalization to avoid overfitting?
If your problem is similar to another problem that has been studied extensively, you will do well to first copy the model and algorithm already known to perform the best for that task. You can even use a model that was trained on a different dataset for your own problem without having to train it from scratch. This is called transfer learning and will be discussed in detail in chapter 6.
For example, in the last chapter’s project, we used the architecture of the popular AlexNet as a baseline model. Figure 4.1 shows the architecture of an AlexNet deep CNN, with the dimensions of each layer. The input layer is followed by five convolutional layers (CONV1 through CONV5), the output of the fifth convolutional layer is fed into two fully connected layers (FC6 through FC7), and the output layer is a fully connected layer (FC8) with a softmax function:
INPUT ⇒ CONV1 ⇒ POOL1 ⇒ CONV2 ⇒ POOL2 ⇒ CONV3 ⇒ CONV4 ⇒ CONV5 ⇒ POOL3 ⇒ FC6 ⇒ FC7 ⇒ SOFTMAX_8
Looking at the AlexNet architecture, you will find all the network hyperparameters that you need to get started with your own model:
Network depth (number of layers): 5 convolutional layers plus 3 fully connected layers
Layers’ depth (number of filters): CONV1 = 96, CONV2 = 256, CONV3 = 384, CONV4 = 385, CONV5 = 256
ReLU as the activation function in the hidden layers (CONV1 all the way to FC7)
NOTE In the next chapter, we will discuss some of the most popular CNN architectures along with their code implementations in Keras. We will look at networks like LeNet, AlexNet, VGG, ResNet, and Inception that will build your understanding of what architecture works best for different problems and perhaps inspire you to invent your own CNN architecture.
We have defined the performance metrics that we will use to evaluate our model and have built the architecture of our baseline model. Let’s get our data ready for training. It is important to note that this process varies a lot based on the problem and data you have. Here, I’ll explain the basic data-massaging techniques that you need to perform before training your model. I’ll also help you develop an instinct for what “ready data” looks like so you can determine which preprocessing techniques you need.
When we train a ML model, we split the data into train and test datasets (figure 4.2). We use the training dataset to train the model and update the weights, and then we evaluate the model against the test dataset that it hasn’t seen before. The golden rule here is this: never use the test data for training. The reason we should never show the test samples to the model while training is to make sure the model is not cheating. We show the model the training samples to learn their features, and then we test how it generalizes on a dataset that it has never seen, to get an unbiased evaluation of its performance.
After each epoch during the training process, we need to evaluate the model’s accuracy and error to see how it is performing and tune its parameters. If we use the test dataset to evaluate the model during training, we will break our golden rule of never using the testing data during training. The test data is only used to evaluate the final performance of the model after training is complete. So we make an additional split called a validation dataset to evaluate and tune parameters during training (figure 4.3). Once the model has completed training, we test its final performance over the test dataset.
Take a look at this pseudo code for model training:
for each epoch for each training data instance propagate error through the network adjust the weights calculate the accuracy and error over training data for each validation data instance calculate the accuracy and error over the validation data
As we saw in the project in chapter 3, when we train the model, we get train_loss
, train_acc
, val_loss
, and val_acc
after each epoch (figure 4.4). We use this data to analyze the network’s performance and diagnose overfitting and underfitting, as you will see in section 4.4.
Traditionally, an 80/20 or 70/30 split between train and test datasets is used in ML projects. When we add the validation dataset, we went with 60/20/20 or 70/15/15. But that was back when an entire dataset was just tens of thousands of samples. With the huge amount of data we have now, sometimes 1% for both the validation and the test set is enough. For example, if our dataset contains 1 million samples, 10,000 samples is very reasonable for each of the test and validation sets, because it doesn’t make sense to hold back several hundred thousand samples of your dataset. It is better to use this data for model training.
So, to recap, if you have a relatively small dataset, the traditional ratios might be okay. But if you are dealing with a large dataset, then it is fine to set your train and validation sets to much smaller values.
Before you feed your data to the neural network, you will need to do some data cleanup and processing to get it ready for your learning model. There are several preprocessing techniques to choose from, based on the state of your dataset and the problem you are solving. The good news about neural networks is that they require minimal data preprocessing. When given a large amount of training data, they are able to extract and learn features from raw data, unlike the other traditional ML techniques.
With that said, preprocessing still might be required to improve performance or work within specific limitations on the neural network, such as converting images to grayscale, image resizing, normalization, and data augmentation. In this section, we’ll go through these preprocessing concepts; we’ll see their code implementations in the project at the end of the chapter.
We talked in chapter 3 about how color images are represented in three matrices versus only one matrix for grayscale images; color images add computational complexity with their many parameters. You can make a judgment call about converting all your images to grayscale, if your problem doesn’t require color, to save on the computational complexity. A good rule of thumb here is to use the human-level performance rule: if you are able to identify the object with your eyes in grayscale images, then a neural network will probably be able to do the same.
One limitation for neural networks is that they require all images to be the same shape. If you are using MLPs, for example, the number of nodes in the input layer must be equal to the number of pixels in the image (remember how, in chapter 3, we flattened the image to feed it to the MLP). The same is true for CNNs. You need to set the input shape of the first convolutional layer. To demonstrate this, let’s look at the Keras code to add the first CNN layer:
model.add(Conv2D(filters=16, kernel_size=2, padding='same'
, activation='relu'
, input_shape=(32, 32, 3)))
As you can see, we have to define the shape of the image at the first convolutional layer. For example, if we have three images with dimensions of 32 × 32, 28 × 28, and 64 × 64, we have to resize all the images to one size before feeding them to the model.
Data normalization is the process of rescaling your data to ensure that each input feature (pixel, in the image case) has a similar data distribution. Often, raw images are composed of pixels with varying scales (ranges of values). For example, one image may have a pixel value range from 0 to 255, and another may have a range of 20 to 200. Although not required, it is preferred to normalize the pixel values to the range of 0 to 1 to boost learning performance and make the network converge faster.
To make learning faster for your neural network, your data should have the following characteristics:
Small values --Typically, most values should be in the [0, 1] range.
Homogenous --All pixels should have values in the same range.
Data normalization is done by subtracting the mean from each pixel and then dividing the result by the standard deviation. The distribution of such data resembles a Gaussian curve centered at zero. To demonstrate the normalization process, figure 4.5 illustrates the operation in a scatterplot.
TIP Make sure you normalize your training and test data by using the same mean and standard deviation, because you want your data to go through the same transformation and rescale exactly the same way. You will see how this is implemented in the project at the end of this chapter.
In non-normalized data, the cost function will likely look like a squished, elongated bowl. After you normalize your features, your cost function will look more symmetric. Figure 4.6 shows the cost function of two features, F1 and F2.
As you can see, for normalized features, the GD algorithm goes straight forward toward the minimum error, thereby reaching it quickly. But for non-normalized features, it oscillates toward the direction of the minimum error and ends with a long march down the error mountain. It will eventually reach the minimum, but it will take longer to converge.
TIP Why does GD oscillate for non-normalized features? If we don’t normalize our data, the range of distribution of feature values will likely be different for each feature, and thus the learning rate will cause corrections in each dimension that differ proportionally from one another. This forces GD to oscillate to the direction of the minimum error and ends up with a longer path down the error.
Data augmentation will be discussed in more detail later in this chapter, when we cover regularization techniques. But it is important for you to know that this is another preprocessing technique that you have in your toolbelt to use when needed.
After the baseline model is established and the data is preprocessed, it is time to train the model and measure its performance. After training is complete, you need to determine if there are bottlenecks, diagnose which components are performing poorly, and determine whether the poor performance is due to overfitting, underfitting, or a defect in the training data.
One of the main criticisms of neural networks is that they are “black boxes.” Even when they work very well, it is hard to understand why they work so well. Many efforts are being made to improve the interpretability of neural networks, and this field is likely to evolve rapidly in the next few years. In this section, I’ll show you how to diagnose neural networks and analyze their behavior.
After running your experiment, you want to observe its performance, determine if bottlenecks are impacting its performance, and look for indicators of areas you need to improve. The main cause of poor performance in ML is either overfitting or underfitting the training dataset. We talked about overfitting and underfitting in chapter 3, but now we will dive a little deeper to understand how to detect when the system is fitting the training data too much (overfitting) and when it is too simple to fit the data (underfitting):
Underfitting means the model is too simple: it fails to learn the training data, so it performs poorly on the training data. One example of underfitting is using a single perceptron to classify the
and shapes in figure 4.7. As you can see, a straight line does not split the data accurately.
Overfitting is when the model is too complex for the problem at hand. Instead of learning features that fit the training data, it actually memorizes the training data. So it performs very well on the training data, but it fails to generalize when tested with new data that it hasn’t seen before. In figure 4.8, you see that the model fits the data too well: it splits the training data, but this kind of fitting will fail to generalize.
We want to build a model that is just right for the data: not too complex, causing overfit, or too simple, causing underfit. In figure 4.9, you see that the model missed on a data sample of the shape O, but it looks much more likely to generalize on new data.
TIP The analogy I like to use to explain overfitting and underfitting is a student studying for an exam. Underfitting is when the student doesn’t study very well and so fails the exam. Overfitting is when the student memorizes the book and can answer correctly when asked questions from the book, but answers poorly when asked questions from outside the book. The student failed to generalize. What we want is a student to learn from the book (training data) well enough to be able to generalize when asked questions related to the book material.
To diagnose underfitting and overfitting, the two values to focus on while training are the training error and the validation error:
If the model is doing very well on the training set but relatively poorly on the validation set, then it is overfitting. For example, if train_error
is 1% and val_error
is 10%, it looks like the model has memorized the training dataset but is failing to generalize on the validation set. In this case, you might consider tuning your hyperparameters to avoid overfitting and iteratively train, test, and evaluate until you achieve an acceptable performance.
If the model is performing poorly on the training set, then it is underfitting. For example, if the train_error
is 14% and val_error
is 15%, the model might be too simple and is failing to learn the training set. You might want to consider adding more hidden layers or training longer (more epochs), or try different neural network architectures.
In the next section, we will discuss several hyperparameter-tuning techniques to avoid overfitting and underfitting.
Instead of looking at the training verbose output and comparing the error numbers, one way to diagnose overfitting and underfitting is to plot your training and validation errors throughout the training, as you see in figure 4.10.
Figure 4.10A shows that the network improves the loss value (aka learns) on the training data but fails to generalize on the validation data. Learning on the validation data progresses in the first couple of epochs and then flattens out and maybe decreases. This is a form of overfitting. Note that this graph shows that the network is actually learning on the training data, a good sign that training is happening. So you don’t need to add more hidden units, nor do you need to build a more complex model. If anything, your network is too complex for your data, because it is learning so much that it is actually memorizing the data and failing to generalize to new data. In this case, your next step might be to collect more data or apply techniques to avoid overfitting.
Figure 4.10B shows that the network performs poorly on both training and validation data. In this case, your network is not learning. You don’t need more data, because the network is too simple to learn from the data you already have. Your next step is to build a more complex model.
Figure 4.10C shows that the network is doing a good job of learning the training data and generalizing to the validation data. This means there is a good chance that the network will have good performance out in the wild on test data.
Before we move on to hyperparameter tuning, let’s run a quick experiment to see how we split the data and build, train, and visualize the model results. You can see an exercise notebook for this at www.manning.com/books/deep-learning-for-vision-systems or www.computervisionbook.com.
In this exercise, we will do the following:
from
sklearn.datasetsimport
make_blobs ❶from
keras.utilsimport
to_categorical ❷from
keras.modelsimport
Sequential ❸from
keras.layersimport
Dense ❸from
matplotlibimport
pyplot ❹
❶ The scikit-learn library to generate sample data
❷ Keras method that converts a class vector to a binary class matrix (one-hot encoding)
❸ Neural networks and layers library
Use make_blobs
from scikit-learn to generate a toy dataset with only two features and three label classes:
X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)
Use to_categorical
from Keras to one-hot-encode the label:
y = to_categorical(y)
Split the dataset into 80% training data and 20% test data. Note that we did not create a validation dataset in this example, for simplicity:
n_train = 800 train_X, test_X = X[:n_train, :], X[n_train:, :] train_y, test_y = y[:n_train], y[n_train:] print(train_X.shape, test_X.shape) >> (800, 2) (200, 2)
Develop the model architecture--here, a very simple, two-layer MLP network (figure 4.11 shows the model summary):
model = Sequential() model.add(Dense(25, input_dim=2, activation='relu')) ❶ model.add(Dense(3, activation='softmax')) ❷ model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) ❸ model.summary()
❶ Two input dimensions because we have two features. ReLU activation function for hidden layers.
❷ Softmax activation for the output layer with three nodes because we have three classes
❸ Cross-entropy loss function (explained in chapter 2) and adam optimizer (explained in the next section)
Train the model for 1,000 epochs:
history = model.fit(train_X, train_y, validation_data=(test_X, test_y), epochs=1000, verbose=1)
_, train_acc = model.evaluate(train_X, train_y) _, test_acc = model.evaluate(test_X, test_y) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) >> Train: 0.825, Test: 0.819
Plot the learning curves of model accuracy (figure 4.12):
pyplot.plot(history.history['accuracy'], label='train') pyplot.plot(history.history['val_accuracy'], label='test') pyplot.legend() pyplot.show()
Let’s evaluate the network. Looking at the learning curve in figure 4.12, you can see that both train and test curves fit the data with a similar behavior. This means the network is not overfitting, which would be indicated if the train curve was doing well but the test curve was not. But could the network be underfitting? Maybe: 82% on a very simple dataset like this is considered poor performance. To improve the performance of this neural network, I would try to build a more complex network and experiment with other underfitting techniques.
After you run your training experiment and diagnose for overfitting and underfitting, you need to decide whether it is more effective to spend your time tuning the network, cleaning up and processing your data, or collecting more data. The last thing you want to do is to spend a few months working in one direction only to find out that it barely improves network performance. So, before discussing the different hyperparameters to tune, let’s answer this question first: should you collect more data?
We know that deep neural networks thrive on lots of data. With that in mind, ML novices often throw more data to the learning algorithm as their first attempt to improve its performance. But collecting and labeling more data is not always a feasible option and, depending on your problem, could be very costly. Plus, it might not even be that effective.
NOTE While efforts are being made to automate some of the data-labeling process, at the time of writing, most labeling is done manually, especially in CV problems. By manually, I mean that actual human beings look at each image and label them one by one (this is called human in the loop). Here is another layer of complexity: if you are labeling lung X-ray images to detect a certain tumor, for example, you need qualified physicians to diagnose the images. This will cost a lot more than hiring people to classify dogs and cats. So collecting more data might be a good solution for some accuracy issues and increase the model’s robustness, but it is not always a feasible option.
In other scenarios, it is much better to collect more data than to improve the learning algorithm. So it would be nice if you had quick and effective ways to figure out whether it is better to collect more data or tune the model hyperparameters.
The process I use to make this decision is as follows:
Determine whether the performance on the training set is acceptable as-is.
Visualize and observe the performance of these two metrics: training accuracy (train_acc
) and validation accuracy (val_acc
).
If the network yields poor performance on the training dataset, this is a sign of underfitting. There is no reason to gather more data, because the learning algorithm is not using the training data that is already available. Instead, try tuning the hyperparameters or cleaning up the training data.
If performance on the training set is acceptable but is much worse on the test dataset, then the network is overfitting your training data and failing to generalize to the validation set. In this case, collecting more data could be effective.
TIP When evaluating model performance, the goal is to categorize the high-level problem. If it’s a data problem, spend more time on data preprocessing or collecting more data. If it’s a learning algorithm problem, try to tune the network.
Let’s not get parameters confused with hyperparameters. Hyperparameters are the variables that we set and tune. Parameters are the variables that the network updates with no direct manipulation from us. Parameters are variables that are learned and updated by the network during training, and we do not adjust them. In neural networks, parameters are the weights and biases that are optimized automatically during the backpropagation process to produce the minimum error. In contrast, hyperparameters are variables that are not learned by the network. They are set by the ML engineer before training the model and then tuned. These are variables that define the network structure and determine how the network is trained. Hyperparameter examples include learning rate, batch size, number of epochs, number of hidden layers, and others discussed in the next section.
DL algorithms come with several hyperparameters that control many aspects of the model’s behavior. Some hyperparameters affect the time and memory cost of running the algorithm, and others affect the model’s prediction ability.
The challenge with hyperparameter tuning is that there are no magic numbers that work for every problem. This is related to the no free lunch theorem that we referred to in chapter 1. Good hyperparameter values depend on the dataset and the task at hand. Choosing the best hyperparameters and knowing how to tune them require an understanding of what each hyperparameter does. In this section, you will build your intuition about why you would want to nudge a hyperparameter one way or another, and I’ll propose good starting values for some of the most effective hyperparameters.
Generally speaking, we can categorize neural network hyperparameters into three main categories:
We discussed all of these hyperparameters in chapters 2 and 3 except the regularization techniques. Next, we will cover them quickly with a focus on understanding what happens when we tune each knob up or down and how to know which hyperparameter to tune.
First, let’s talk about the hyperparameters that define the neural network architecture:
Whether you are designing an MLP, CNN, or other neural network, you need to decide on the number of hidden layers in your network (depth) and the number of neurons in each layer (width). The number of hidden layers and units describes the learning capacity of the network. The goal is to set the number large enough for the network to learn the data features. A smaller network might underfit, and a larger network might overfit. To know what is a “large enough” network, you pick a starting point, observe the performance, and then tune up or down.
The more complex the dataset, the more learning capacity the model will need to learn its features. Take a look at the three datasets in figure 4.13.
If you provide the model with too much learning capacity (too many hidden units), it might tend to overfit the data and memorize the training set. If your model is overfitting, you might want to decrease the number of hidden units.
Generally, it is good to add hidden neurons until the validation error no longer improves. The trade-off is that it is computationally expensive to train deeper networks. Having a small number of units may lead to underfitting, while having more units is usually not harmful, with appropriate regularization (like dropout and others discussed later in this chapter).
Try playing around with the Tensorflow playground (https://playground.tensorflow .org) to develop more intuition. Experiment with different architectures, and gradually add more layers and more units in hidden layers while observing the network’s learning behavior.
Activation functions (discussed extensively in chapter 2) introduce nonlinearity to our neurons. Without activations, our neurons would pass linear combinations (weighted sums) to each other and not solve any nonlinear problems. This is a very active area of research: every few weeks, we are introduced to new types of activations, and there are many available. But at the time of writing, ReLU and its variations (like Leaky ReLU) perform the best in hidden layers. And in the output layer, it is very common to use the softmax function for classification problems, with the number of neurons equal to the number of classes in your problem.
Now that we have built our network architecture, it is time to discuss the hyperparameters that determine how the network learns and optimize its parameter to achieve the minimum error.
The learning rate is the single most important hyperparameter, and one should always make sure that it has been tuned. If there is only time to optimize one hyperparameter, then this is the hyperparameter that is worth tuning.
The learning rate (lr value) was covered extensively in chapter 2. As a refresher, let’s think about how gradient descent (GD) works. The GD optimizer searches for the optimal values of weights that yield the lowest error possible. When setting up our optimizer, we need to define the step size that it takes when it descends the error mountain. This step size is the learning rate. It represents how fast or slow the optimizer descends the error curve. When we plot the cost function with only one weight, we get the oversimplified U-curve in figure 4.14, where the weight is randomly initialized at a point on the curve.
The GD calculates the gradient to find the direction that reduces the error (derivative). In figure 4.14, the descending direction is to the right. The GD starts taking steps down after each iteration (epoch). Now, as you can see in figure 4.15, if we make a miraculously correct choice of the learning rate value, we land on the best weight value that minimizes the error in only one step. This is an impossible case that I’m using for elaboration purposes. Let’s call this the ideal lr value.
If the learning rate is smaller than the ideal lr value, then the model can continue to learn by taking smaller steps down the error curve until it finds the most optimal value for the weight (figure 4.16). Much smaller means it will eventually converge but will take longer.
If the learning rate is larger than the ideal lr value, the optimizer will overshoot the optimal weight value in the first step, and then overshoot again on the other side in the next step (figure 4.17). This could possibly yield a lower error than what we started with and converge to a reasonable value, but not the lowest error that we are trying to reach.
If the learning rate is much larger than the ideal lr value (more than twice as much), the optimizer will not only overshoot the ideal weight, but get farther and farther from the min error (figure 4.18). This phenomenon is called divergence.
The optimal learning rate will be dependent on the topology of your loss landscape, which in turn is dependent on both your model architecture and your dataset. Whether you are using Keras, Tensorflow, PyTorch, or any other DL library, using the default learning rate value of the optimizer is a good start leading to decent results. Each optimizer type has its own default value. Read the documentation of the DL library that you are using to find out the default value of your optimizer. If your model doesn’t train well, you can play around with the lr variable using the usual suspects--0.1, 0.01, 0.001, 0.0001, 0.00001, and 0.000001--to improve performance or speed up training by searching for an optimal learning rate.
The way to debug this is to look at the validation loss values in the training verbose:
If val_loss
decreases after each step, that’s good. Keep training until it stops improving.
If training is complete and val_loss
is still decreasing, then maybe the learning rate was so small that it didn’t converge yet. In this case, you can do one of two things:
If val_loss
starts to increase or oscillate up and down, then the learning rate is too high and you need to decrease its value.
Finding the learning rate value that is just right for your problem is an iterative process. You start with a static lr value, wait until training is complete, evaluate, and then tune. Another way to go about tuning your learning rate is to set a learning rate decay: a method by which the learning rate changes during training. It often performs better than a static value, and drastically reduces the time required to get optimal results.
By now, it’s clear that when we try lower learning values, we have a better chance to get to a lower error point. But training it will take longer. In some cases, training takes so long it becomes infeasible. A good trick is to implement a decay rate in our learning rate. The decay rate tells our network to automatically decrease the lr throughout the training process. For example, we can decrease the lr by a constant value of (x) for each (n) number of steps. This way, we can start with the higher value to take bigger steps toward the minimum, and then gradually decrease the learning rate every (n) epochs to avoid overshooting the ideal lr.
One way to accomplish this is by reducing the learning rate linearly (linear decay). For example, you can decrease it by half every five epochs, as shown in figure 4.19.
Another way is to decrease the lr exponentially (exponential decay). For example, you can multiply it by 0.1 every eight epochs (figure 4.20). Clearly, the network will converge a lot slower than with linear decay, but it will eventually converge.
Other clever learning algorithms have an adaptive learning rate (adaptive learning). These algorithms use a heuristic approach that automatically updates the lr when the training stops. This means not only decreasing the lr when needed, but also increasing it when improvements are too slow (too-small lr). Adaptive learning usually works better than other learning rate-setting strategies. Adam and Adagrad are examples of adaptive learning optimizers: more on adaptive optimizers later in this chapter.
Mini-batch size is another hyperparameter that you need to set and tune in the optimizer algorithm. The batch_size
hyperparameter has a big effect on resource requirements of the training process and speed.
In order to understand the mini-batch, let’s back up to the three GD types that we explained in chapter 2--batch, stochastic, and mini-batch:
Batch gradient descent (BGD) --We feed the entire dataset to the network all at once, apply the feedforward process, calculate the error, calculate the gradient, and backpropagate to update the weights. The optimizer calculates the gradient by looking at the error generated after it sees all the training data, and the weights are updated only once after each epoch. So, in this case, the mini-batch size equals the entire training dataset. The main advantage of BGD is that it has relatively low noise and bigger steps toward the minimum (see figure 4.21). The main disadvantage is that it can take too long to process the entire training dataset at each step, especially when training on big data. BGD also requires a huge amount of memory for training large datasets, which might not be available. BGD might be a good option if you are training on a small dataset.
Stochastic gradient descent (SGD) --Also called online learning. We feed the network a single instance of the training data at a time and use this one instance to do the forward pass, calculate error, calculate the gradient, and backpropagate to update the weights (figure 4.22). In SGD, the weights are updated after it sees each single instance (as opposed to processing the entire dataset before each step for BGD). SGD can be extremely noisy as it oscillates on its way to the global minimum because it takes a step down after each single instance, which could sometimes be in the wrong direction. This noise can be reduced by using a smaller learning rate, so, on average, it takes you in a good direction and almost always performs better than BGD. With SGD you get to make progress quickly and usually reach very close to the global minimum. The main disadvantage is that by calculating the GD for one instance at a time, you lose the speed gain that comes with matrix multiplication in the training calculations.
To recap BGD and SGD, on one extreme, if you set your mini-batch size to 1 (stochastic training), the optimizer will take a step down the error curve after computing the gradient for every single instance of the training data. This is good, but you lose the increased speed of using matrix multiplication. On the other extreme, if your mini-batch size is your entire training dataset, then you are using BGD. It takes too long to make a step toward the minimum error when processing large datasets. Between the two extremes, there is mini-batch GD.
Mini-batch gradient descent (MB-GD) --A compromise between batch and stochastic GD. Instead of computing the gradient from one sample (SGD) or all training samples (BGD), we divide the training sample into mini-batches to compute the gradient from. This way, we can take advantage of matrix multiplication for faster training and start making progress instead of having to wait to train the entire training set.
In the history of DL, many researchers proposed optimization algorithms and showed that they work well with some problems. But most of them subsequently proved to not generalize well to the wide range of neural networks that we might want to train. In time, the DL community came to feel that the GD algorithm and some of its variants work well. So far, we have discussed batch, stochastic, and mini-batch GD.
We learned that choosing a proper learning rate can be challenging because a too-small learning rate leads to painfully slow convergence, while a too-large learning rate can hinder convergence and cause the loss function to fluctuate around the minimum or even diverge. We need more creative solutions to further optimize GD.
NOTE Optimizer types are well explained in the documentation of most DL frameworks. In this section, I’ll explain the concepts of two of the most popular gradient-descent-based optimizers--Momentum and Adam--that really stand out and have been shown to work well across a wide range of DL architectures. This will help you build a good foundation to dive deeper into other optimization algorithms. For more about optimization algorithms, read “An overview of gradient descent optimization algorithms” by Sebastian Ruder (https://arxiv.org/pdf/1609.04747.pdf).
Recall that SGD ends up with some oscillations in the vertical direction toward the minimum error (figure 4.23). These oscillations slow down the convergence process and make it harder to use larger learning rates, which could result in your algorithm overshooting and diverging.
To reduce these oscillations, a technique called momentum was invented that lets the GD navigate along relevant directions and softens the oscillation in irrelevant directions. In other words, it makes learning slower in the vertical-direction oscillations and faster in the horizontal-direction progress, which will help the optimizer reach the target minimum much faster.
This is similar to the idea of momentum from classical physics: when a snowball rolls down a hill, it accumulates momentum, going faster and faster. In the same way, our momentum term increases for dimensions whose gradients point in the same direction and reduces updates for dimensions whose gradients change direction. This leads to faster convergence and reduces oscillations.
Adam stands for adaptive moment estimation. Adam keeps an exponentially decaying average of past gradients, similar to momentum. Whereas momentum can be seen as a ball rolling down a slope, Adam behaves like a heavy ball with friction to slow down the momentum and control it. Adam usually outperforms other optimizers because it helps train a neural network model much more quickly than the techniques we have seen earlier.
Again, we have new hyperparameters to tune. But the good news is that the default values of major DL frameworks often work well, so you may not need to tune at all--except for the learning rate, which is not an Adam-specific hyperparameter:
keras.optimizers.Adam(lr=0.001
, beta_1=0.9
, beta_2=0.999
, epsilon=None, decay=0.0
)
The authors of Adam propose these default values:
A training iteration, or epoch, is when the model goes a full cycle and sees the entire training dataset at once. The epoch hyperparameter is set to define how many iterations our network continues training. The more training iterations, the more our model learns the features of our training data. To diagnose whether your network needs more or fewer training epochs, keep your eyes on the training and validation error values.
The intuitive way to think about this is that we want to continue training as long as the error value is decreasing. Correct? Let’s take a look at the sample verbose output from a network training in figure 4.24.
You can see that both training and validation errors are decreasing. This means the network is still learning. It doesn’t make sense to stop the training at this point. The network is clearly still making progress toward the minimum error. Let’s let it train for six more epochs and observe the results (figure 4.25).
It looks like the training error is doing well and still improving. That’s good. This means the network is improving on the training set. However, if you look at epochs 8 and 9, you will see that val_error
started to oscillate and increase. Improving train_error
while not improving val_error
means the network is starting to overfit the training data and failing to generalize to the validation data.
Let’s plot the training and validation errors (figure 4.26). You can see that both the training and validation errors were improving at first, but then the validation error started to increase, leading to overfitting. We need to find a way to stop the training just before it starts to overfit. This technique is called early stopping.
Early stopping is an algorithm widely used to determine the right time to stop the training process before overfitting happens. It simply monitors the validation error value and stops the training when the value starts to increase. Here is the early stopping function in Keras:
EarlyStopping(monitor='val_loss'
, min_delta=0
, patience=20
)
The EarlyStopping
function takes the following arguments:
monitor
--The metric you monitor during training. Usually we want to keep an eye on val_loss
because it represents our internal testing of model performance. If the network is doing well on the validation data, it will probably do well on test data and production.
min_delta
--The minimum change that qualifies as an improvement. There is no standard value for this variable. To decide the min_delta
value, run a few epochs and see the change in error and validation accuracy. Define min_delta
according to the rate of change. The default value of 0 works pretty well in many cases.
patience
--This variable tells the algorithm how many epochs it should wait before stopping the training if the error does not improve. For example, if we set patience
equal to 1, the training will stop at the epoch where the error increases. We must be a little flexible, though, because it is very common for the error to oscillate a little and continue improving. We can stop the training if it hasn’t improved in the last 10 or 20 epochs.
TIP The good thing about early stopping is that it allows you to worry less about the epochs hyperparameter. You can set a high number of epochs and let the stopping algorithm take care of stopping the training when error stops improving.
If you observe that your neural network is overfitting the training data, your network might be too complex and need to be simplified. One of the first techniques you should try is regularization. In this section, we will discuss three of the most common regularization techniques: L2, dropout, and data augmentation.
The basic idea of L2 regularization is that it penalizes the error function by adding a regularization term to it. This, in turn, reduces the weight values of the hidden units and makes them too small, very close to zero, to help simplify the model.
Let’s see how regularization works. First, we update the error function by adding the regularization term:
error functionnew = error functionold + regularization term
Note that you can use any of the error functions explained in chapter 2, like MSE or cross entropy. Now, let’s take a look at the regularization term
L2 regularization term = λ/2m * Σ || w ||2
where lambda (λ) is the regularization parameter, m is the number of instances, and w is the weight. The updated error function looks like this:
error function new = error function old+ λ/2m * Σ || w ||2
Why does L2 regularization reduce overfitting? Well, let’s look at how the weights are updated during the backpropagation process. We learned from chapter 2 that the optimizer calculates the derivative of the error, multiplies it by the learning rate, and subtracts this value from the old weight. Here is the backpropagation equation that updates the weights:
Since we add the regularization term to the error function, the new error becomes larger than the old error. This means its derivative (∂Error/∂Wx) is also bigger, leading to a smaller Wnew . L2 regularization is also known as weight decay, as it forces the weights to decay toward zero (but not exactly zero).
This is what L2 regularization looks like in Keras:
model.add(Dense(units=16, kernel_regularizer=regularizers.l2(ƛ
), activation='relu'
)) ❶
❶ When adding a hidden layer to your network, add the kernel_regularization argument with the L2 regularizer
The lambda value is a hyperparameter that you can tune. The default value of your DL library usually works well. If you still see signs of overfitting, increase the lambda hyperparameter to reduce the model complexity.
Dropout is another regularization technique that is very effective for simplifying a neural network and avoiding overfitting. We discussed dropout extensively in chapter 3. The dropout algorithm is fairly simple: at every training iteration, every neuron has a probability p of being temporarily ignored (dropped out) during this training iteration. This means it may be active during subsequent iterations. While it is counterintuitive to intentionally pause the learning on some of the network neurons, it is quite surprising how well this technique works. The probability p is a hyperparameter that is called dropout rate and is typically set in the range of 0.3 to 0.5. Start with 0.3, and if you see signs of overfitting, increase the rate.
TIP I like to think of dropout as tossing a coin every morning with your team to decide who will do a specific critical task. After a few iterations, all your team members will learn how to do this task and not rely on a single member to get it done. The team would become much more resilient to change.
Both L2 regularization and dropout aim to reduce network complexity by reducing its neurons’ effectiveness. The difference is that dropout completely cancels the effect of some neurons with every iteration, while L2 regularization just reduces the weight values to reduce the neurons’ effectiveness. Both lead to a more robust, resilient neural network and reduce overfitting. It is recommended that you use both types of regularization techniques in your network.
One way to avoid overfitting is to obtain more data. Since this is not always a feasible option, we can augment our training data by generating new instances of the same images with some transformations. Data augmentation can be an inexpensive way to give your learning algorithm more training data and therefore reduce overfitting.
The many image-augmentation techniques include flipping, rotation, scaling, zooming, lighting conditions, and many other transformations that you can apply to your dataset to provide a variety of images to train on. In figure 4.27, you can see some of the transformation techniques applied to an image of the digit 6.
In figure 4.27, we created 20 new images that the network can learn from. The main advantage of synthesizing images like this is that now you have more data (20×) that tells your algorithm that if an image is the digit 6, then even if you flip it vertically or horizontally or rotate it, it’s still the digit 6. This makes the model more robust to detect the number 6 in any form and shape.
Data augmentation is considered a regularization technique because allowing the network to see many variants of the object reduces its dependence on the original form of the object during feature learning. This makes the network more resilient when tested on new data.
Data augmentation in Keras looks like this:
from
keras.preprocessing.image
import
ImageDataGenerator ❶ datagen = ImageDataGenerator(horizontal_flip=True, vertical_flip=True) ❷ datagen.fit(training_set) ❸
❶ Imports ImageDataGenerator from Keras
❷ Generates batches of new image data. ImageDataGenerator takes transformation types as arguments. Here, we set horizontal and vertical flip to True. See the Keras documentation (or your DL library) for more transformation arguments.
❸ Computes the data augmentation on the training set
Earlier in this chapter, we talked about data normalization to speed up learning. The normalization techniques we discussed were focused on preprocessing the training set before feeding it to the input layer. If the input layer benefits from normalization, why not do the same thing for the extracted features in the hidden units, which are changing all the time and get much more improvement in training speed and network resilience (figure 4.28)? This process is called batch normalization (BN).
Before we define covariate shift, let’s take a look at an example to illustrate the problem that batch normalization (BN) confronts. Suppose you are building a cat classifier, and you train your algorithm on images of white cats only. When you test this classifier on images with cats that are different colors, it will not perform well. Why? Because the model has been trained on a training set with a specific distribution (white cats). When the distribution changes in the test set, it confuses the model (figure 4.29).
We should not expect that the model trained on the data in graph A will do very well with the new distribution in graph B. The idea of the change in data distribution goes by the fancy name covariate shift.
DEFINITION If a model is learning to map dataset x to label y, then if the distribution of x changes, it’s known as covariate shift. When that happens, you might need to retrain your learning algorithm.
To understand how covariate shift happens in neural networks, consider the simple four-layer MLP in figure 4.30. Let’s look at the network from the third-layer (L3) perspective. Its input are the activation values in L2 (a 12, a 22, a 32, and a 42), which are the features extracted from the previous layers. L3 is trying to map these inputs to ŷ to make it as close as possible to the label y. While the third layer is doing that, the network is adapting the values of the parameters from previous layers. As the parameters (w, b) are changing in layer 1, the activation values in the second layer are changing, too. So from the perspective of the third hidden layer, the values of the second hidden layer are changing all the time: the MLP is suffering from the problem of covariate shift. Batch norm reduces the degree of change in the distribution of the hidden unit values, causing these values to become more stable so that the later layers of the neural network have firmer ground to stand on.
NOTE It is important to realize that batch normalization does not cancel or reduce the change in the hidden unit values. What it does is ensure that the distribution of that change remains the same: even if the exact values of the units change, the mean and variance do not change.
In their 2015 paper “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” (https://arxiv.org/abs/1502.03167), Sergey Ioffe and Christian Szegedy proposed the BN technique to reduce covariate shift. Batch normalization adds an operation in the neural network just before the activation function of each layer to do the following:
This operation lets the model learn the optimal scale and mean of the inputs for each layer.
i
It is important to know how batch normalization works so you can get a better understanding of what your code is doing. But when using BN in your network, you don’t have to implement all these details yourself. Implementing BN is often done by adding one line of code, using any DL framework. In Keras, the way you add batch normalization to your neural network is by adding a BN layer after the hidden layer, to normalize its results before they are fed to the next layer.
The following code snippet shows you how to add a BN layer when building your neural network:
fromkeras.models
import Sequential fromkeras.layers
import Dense, Dropout fromkeras.layers.normalization
import BatchNormalization ❶model
= Sequential() ❷model
.add(Dense(hidden_units, activation='relu'
)) ❸model
.add(BatchNormalization()) ❹model.
add(Dropout(0.5)) ❺model
.add(Dense(units, activation='relu'
)) ❻model
.add(BatchNormalization()) ❼model
.add(Dense(2, activation='softmax'
)) ❽
❶ Imports the BatchNormalization layer from the Keras library
❹ Adds the batch norm layer to normalize the results of layer 1
❺ If you are adding dropout to your network, it is preferable to add it after the batch norm layer because you don’t want the nodes that are randomly turned off to miss the normalization step.
❻ Adds the second hidden layer
❼ Adds the batch norm layer to normalize the results of layer 2
The intuition that I hope you’ll take away from this discussion is that BN applies the normalization process not just to the input layer, but also to the values in the hidden layers in a neural network. This weakens the coupling of the learning process between earlier and later layers, allowing each layer of the network to learn more independently.
From the perspective of the later layers in the network, the earlier layers don’t get to shift around as much because they are constrained to have the same mean and variance. This makes the job of learning easier in the later layers. The way this happens is by ensuring that the hidden units have a standardized distribution (mean and variance) controlled by two explicit parameters, γ and β , which the learning algorithm sets during training.
In this project, we will revisit the CIFAR-10 classification project from chapter 3 and apply some of the improvement techniques from this chapter to increase the accuracy from ~65% to ~90%. You can follow along with this example by visiting the book’s website, www.manning.com/books/deep-learning-for-vision-systems or www .computervisionbook.com, to see the code notebook.
We will accomplish the project by following these steps:
Build the model architecture. In addition to regular convolutional and pooling layers, as in chapter 3, we add the following layers to our architecture:
Let’s see how this is implemented.
Here’s the Keras code to import the needed dependencies:
import
keras ❶from
keras.datasetsimport
cifar10from
keras.preprocessing.imageimport
ImageDataGeneratorfrom
keras.modelsimport
Sequentialfrom
keras.utilsimport
np_utilsfrom
keras.layersimport
Dense, Activation, Flatten, Dropout, BatchNormalization, Conv2D, MaxPooling2Dfrom
keras.callbacksimport
ModelCheckpointfrom
kerasimport
regularizers, optimizersimport
numpyas
np ❷from
matplotlibimport
pyplot ❸
❶ Keras library to download the datasets, preprocess images, and network components
❷ Imports numpy for math operations
❸ Imports the matplotlib library to visualize results
Keras has some datasets available for us to download and experiment with. These datasets are usually preprocessed and almost ready to be fed to the neural network. In this project, we use the CIFAR-10 dataset, which consists of 50,000 32 × 32 color training images, labeled over 10 categories, and 10,000 test images. Check the Keras documentation for more datasets like CIFAR-100, MNIST, Fashion-MNIST, and more.
Keras provides the CIFAR-10 dataset already split into training and testing sets. We will load them and then split the training dataset into 45,000 images for training and 5,000 images for validation, as explained in this chapter:
(x_train, y_train), (x_test, y_test) = cifar10.load_data() ❶ x_train = x_train.astype('float32'
) ❶ x_test = x_test.astype('float32'
) ❶ (x_train, x_valid) = x_train[5000:], x_train[:5000] ❷ (y_train, y_valid) = y_train[5000:], y_train[:5000] ❷
❶ Downloads and splits the data
❷ Breaks the training set into training and validation sets
Let’s print the shape of x_train
, x_valid
, and x_test
:
'x_train ='
, x_train.shape)'x_valid ='
, x_valid.shape)'x_test ='
, x_test.shape) >> x_train = (45000, 32, 32, 3) >> x_valid = (5000, 32, 32, 3) >> x_test = (1000, 32, 32, 3)
The format of the shape tuple is as follows: (number of instances, width, height, channels).
Normalizing the pixel values of our images is done by subtracting the mean from each pixel and then dividing the result by the standard deviation:
mean = np.mean(x_train,axis=(0,1,2,3)) std = np.std(x_train,axis=(0,1,2,3)) x_train = (x_train-mean)/(std+1e-7) x_valid = (x_valid-mean)/(std+1e-7) x_test = (x_test-mean)/(std+1e-7)
To one-hot encode the labels in the train, valid, and test datasets, we use the to_
categorical
function in Keras:
num_classes = 10 y_train = np_utils.to_categorical(y_train,num_classes) y_valid = np_utils.to_categorical(y_valid,num_classes) y_test = np_utils.to_categorical(y_test,num_classes)
For augmentation techniques, we will arbitrarily go with the following transformations: rotation, width and height shift, and horizontal flip. When you are working on problems, view the images that the network missed or provided poor detections for and try to understand why it is not performing well on them. Then create your hypothesis and experiment with it. For example, if the missed images were of shapes that are rotated, you might want to try the rotation augmentation. You would apply that, experiment, evaluate, and repeat. You will come to your decisions purely from analyzing your data and understanding the network performance:
datagen = ImageDataGenerator( ❶ rotation_range=15, width_shift_range=0.1, height_shift_range=0.1, horizontal_flip=True
, vertical_flip=False
) datagen.fit(x_train) ❷
❷ Computes the data augmentation on the training set
In chapter 3, we built an architecture inspired by AlexNet (3 CONV + 2 FC). In this project, we will build a deeper network for increased learning capacity (6 CONV + 1 FC).
The network has the following configuration:
Instead of adding a pooling layer after each convolutional layer, we will add one after every two convolutional layers. This idea was inspired by VGGNet, a popular neural network architecture developed by the Visual Geometry Group (University of Oxford). VGGNet will be explained in chapter 5.
Inspired by VGGNet, we will set the kernel_size
of our convolutional layers to 3 × 3 and the pool_size
of the pooling layer to 2 × 2.
We will add dropout layers every other convolutional layer, with (p) ranges from 0.2 and 0.4.
A batch normalization layer will be added after each convolutional layer to normalize the input for the following layer.
In Keras, L2 regularization is added to the convolutional layer code.
base_hidden_units = 32 ❶ weight_decay = 1e-4 ❷ model = Sequential() ❸# CONV1
model.add(Conv2D(base_hidden_units, kernel_size= 3, padding='same'
, ❹ kernel_regularizer=regularizers.l2(weight_decay), ❺ input_shape=x_train.shape[1:])) model.add(Activation('relu'
)) ❻ model.add(BatchNormalization()) ❼# CONV2
model.add(Conv2D(base_hidden_units, kernel_size= 3, padding='same'
, kernel_regularizer=regularizers.l2(weight_decay))) model.add(Activation('relu'
)) model.add(BatchNormalization())# POOL + Dropout
model.add(MaxPooling2D(pool_size=(2,2))) model.add(Dropout(0.2)) ❽# CONV3
model.add(Conv2D(base_hidden_units * 2, kernel_size= 3, padding='same'
, ❾ kernel_regularizer=regularizers.l2(weight_decay))) model.add(Activation('relu'
)) model.add(BatchNormalization())# CONV4
model.add(Conv2D(base_hidden_units * 2, kernel_size= 3, padding='same'
, kernel_regularizer=regularizers.l2(weight_decay))) model.add(Activation('relu'
)) model.add(BatchNormalization())# POOL + Dropout
model.add(MaxPooling2D(pool_size=(2,2))) model.add(Dropout(0.3))# CONV5
model.add(Conv2D(base_hidden_units * 4, kernel_size= 3, padding='same'
, kernel_regularizer=regularizers.l2(weight_decay))) model.add(Activation('relu'
)) model.add(BatchNormalization())# CONV6
model.add(Conv2D(base_hidden_units * 4, kernel_size= 3, padding='same'
, kernel_regularizer=regularizers.l2(weight_decay))) model.add(Activation('relu'
)) model.add(BatchNormalization())# POOL + Dropout
model.add(MaxPooling2D(pool_size=(2,2))) model.add(Dropout(0.4))# FC7
model.add(Flatten()) ❿ model.add(Dense(10, activation='softmax'
)) ⓫ model.summary() ⓬
❶ Number of hidden units variable. We declare this variable here and use it in our convolutional layers to make it easier to update from one place.
❷ L2 regularization hyperparameter (ƛ
)
❸ Creates a sequential model (a linear stack of layers)
❹ Notice that we define the input_shape here because this is the first convolutional layer. We don’t need to do that for the remaining layers.
❺ Adds L2 regularization to the convolutional layer
❻ Uses a ReLU activation function for all hidden layers
❼ Adds a batch normalization layer
❽ Dropout layer with 20% probability
❿ Flattens the feature map into a 1D features vector (explained in chapter 3)
⓫ 10 hidden units because the dataset has 10 class labels. Softmax activation function is used for the output layer (explained in chapter 2)
The model summary is shown in figure 4.31.
Before we jump into the training code, let’s discuss the strategy behind some of the hyperparameter settings:
batch_size
--This is the mini-batch hyperparameter that we covered in this chapter. The higher the batch_size
, the faster your algorithm learns. You can start with a mini-batch of 64 and double this value to speed up training. I tried 256 on my machine and got the following error, which means my machine was running out of memory. I then lowered it back to 128:
Resource exhausted: OOM when allocating tensor with shape[256,128,4,4]
epochs
--I started with 50 training iterations and found that the network was still improving. So I kept adding more epochs and observing the training results. In this project, I was able to achieve >90% accuracy after 125 epochs. As you will see soon, there is still room for improvement if you let it train longer.
Optimizer --I used the Adam optimizer. See section 4.7 to learn more about optimization algorithms.
NOTE It is important to note that I’m using a GPU for this experiment. The training took around 3 hours. It is recommended that you use your own GPU or any cloud computing service to get the best results. If you don’t have access to a GPU, I recommend that you try a smaller number of epochs or plan to leave your machine training overnight or even for a couple of days, depending on your CPU specifications.
batch_size = 128 ❶ epochs = 125 ❷ checkpointer = ModelCheckpoint(filepath='model.100epochs.hdf5', verbose=1, save_best_only=True ) ❸ optimizer = keras.optimizers.adam(lr=0.0001,decay=1e-6) ❹ model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy']) ❺ history = model.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size), callbacks=[checkpointer], steps_per_epoch=x_train.shape[0] // batch_size, epochs=epochs, verbose=2, validation_data=(x_valid, y_valid)) ❻
❷ Number of training iterations
❸ Path f the file where the best weights will be saved, and a Boolean True to save the weights only when there is an improvement
❹ Adam optimizer with a learning rate = 0.0001
❺ Cross-entropy loss function (explained in chapter 2)
❻ Allows you to do real-time data augmentation on images on CPU in parallel to training your model on GPU. The callback to the checkpointer saves the model weights; you can add other callbacks like an early stopping function.
When you run this code, you will see the verbose output of the network training for each epoch. Keep your eyes on the loss
and val_loss
values to analyze the network and diagnose bottlenecks. Figure 4.32 shows the verbose output of epochs 121 to 125.
To evaluate the model, we use a Keras function called evaluate
and print the results:
scores = model.evaluate(x_test, y_test, batch_size=128, verbose=1) print('
Test result:
%.3floss:
%.3f'
% (scores[1]*100,scores[0])) >> Test result: 90.260 loss: 0.398
Plot the learning curves to analyze the training performance and diagnose overfitting and underfitting (figure 4.33):
pyplot.plot(history.history['acc'
], label='train'
) pyplot.plot(history.history['val_acc'
], label='test'
) pyplot.legend() pyplot.show()
Accuracy of 90% is pretty good, but you can still improve further. Here are some ideas you can experiment with:
More training epochs --Notice that the network was improving until epoch 123. You can increase the number of epochs to 150 or 200 and let the network train longer.
Deeper network --Try adding more layers to increase the model complexity, which increases the learning capacity.
Lower learning rate --Decrease the lr (you should train longer if you do so).
Different CNN architecture --Try something like Inception or ResNet (explained in detail in the next chapter). You can get up to 95% accuracy with the ResNet neural network after 200 epochs of training.
Transfer learning --In chapter 6, we will explore the technique of using a pretrained network on your dataset to get higher results with a fraction of the learning time.
The general rule of thumb is that the deeper your network is, the better it learns.
At the time of writing, ReLU performs best in hidden layers, and softmax performs best in the output layer.
Stochastic gradient descent usually succeeds in finding a minimum. But if you need fast convergence and are training a complex neural network, it’s safe to go with Adam.
L2 regularization and dropout work well together to reduce network complexity and overfitting.
3.145.171.58