CHAPTER 4
Deep Learning Using Keras

In Chapter 2, you learned about Machine Learning algorithms and techniques. You saw code samples of how to build ML models and evaluate your models using metrics of precision and recall. These models were pretty straightforward to understand with some clever ways of capturing patterns in the data. This chapter gets to the much more complex types of learning models. These models have many learning units organized in layers and many such layers—making the architecture “deep.” Though they are complex to build and train, you will see how effective they are at handling big and complex unstructured data like images. Finally, you will use one of the most popular Deep Learning libraries today—called Keras—to build models that can classify images of handwritten digits and learn to label these digits. I am hoping that these simple examples trigger some big ideas in your mind. You can reuse this code to apply learning to your images to build deep models in your domain area.

Handling Unstructured Data

We saw data used in earlier problems like wine quality analysis. Here each column had a particular significance and meaning. We used the term feature to describe each column and this was an important part of our learning method to understand how these features are correlated. We used techniques like normalization to scale the features so they were on the same value scale. Also, we saw that we could use fewer features to make our models learn faster. In short, we needed to know what our features were and our model captured a pattern between them. This was all structured data.

Now let's imagine an image. When a computer reads an image, it is normally captured by a digital camera or a scanner and stored in digital form in computer memory. When we take a photo with a digital camera, our camera has an optical sensor that captures light from a scene, renders this inside our camera, and saves the image as a series of numbers—basically a large sequence of 0s and 1s. In raw form a two‐dimensional image is basically a matrix or array of pixel values. Here each pixel value represents intensity of a particular color. However, it does not have a human‐readable value like wine alcohol percentage or quality rating. This data is usually referred to as unstructured. The individual values have less significance but as a whole they complement each other and form a bigger domain object like an image. The same goes for audio, video, and text data. You can see more examples of unstructured data and some basic steps to cleanse and extract information from them in Chapter 3. To analyze unstructured data like this, we need much more complex ML models with many learning units, known as neural networks.

Neural Networks

For complex and unstructured data, we build deeper models that use a combination of smaller individual learning units to form a bigger network. This network of learning units can learn complex patterns from a large number of features. This is called a neural network.

A common analogy that is used to represent this is the human brain, which contains a network of biological cells called neurons that are connected via axons and dendrites. If you recall from your biology textbooks, we have signals flowing into neurons through dendrites and processed outputs flowing to other neurons or muscles through axons. In fact, neural networks are heavily inspired by the structure of the human brain. Figure 4.1 shows a simple representation.

Diagram of biological neurons in the human brain that are connected via axons and dendrites, with an inset displaying the neuron cells.

Figure 4.1: Biological neurons in the human brain

(Source: OpenStax college ‐ Wikipedia)

Similar to the human brain, these artificial neural networks contain processing units called neurons and connections between them. These networks are structured into layers, with each layer extracting valuable information from the data that is fed to it. These are Deep Learning networks and they have many layers of learning. They try to map the input space to a set of possible outcomes or classes. Let's look at a very simple neural network, shown in Figure 4.2. Let's understand some basics and then we will get into building more complex networks.

“Illustration of a simple neural network where learning units A1, A2, and A3 are connected as a network and the flow of information depicted by arrows.”

Figure 4.2: Simple neural network where learning units are connected as a network

This neural network has three inputs in the first input layer. The second layer of neurons is the hidden layer, with three neurons again. The final layer is the output layer, with one neuron.

Each neuron is a learning unit that may receive inputs from other neurons, perform some calculations, and send output to other neurons. The flow of information is shown by the arrows in Figure 4.2. Now let's look closer at what happens at the neuron stage.

Let's revisit our diagram from the logistic regression function in Chapter 2. We first calculated a weighted sum of inputs (Z1) and then applied a function to return a value A1 between 0 and 1. This result can also be called an activation. This is exactly what is happening at each neuron in the neural network. You have inputs coming into each neuron, activations being calculated using weights, and an activation function—and these activations feed into the next neuron in the network. This is how neural networks work. See Figure 4.3.

Illustration of processing at an individual neuron with inputs coming into each neuron, and activations being calculated and fed into the next neuron.

Figure 4.3: Processing at an individual neuron

This representation is often called a computational graph or dataflow. You have the nodes as circles where inputs are entered or some computation is done. The edges of this graph represent weights. You can look at this as a flow of data through the edges between nodes, with each node adding some processing to the data. Let's get back to our simple neural network in Figure 4.2.

A neural network has many neurons or learning units organized into layers. Inputs flow into each network layer and calculated outputs or activations move to the next layer. If we have a small number of layers, typically two or three in a network, we call them shallow networks. These take less processing time and can quickly calculate results. However, they cannot learn complex patterns, especially with unstructured data. The basic Machine Learning models we saw in Chapter 2 typically have two layers—one input and one output layer like linear and logistic regressions. These are the shallow learning models. We can find out what's going on inside them and they can be quickly trained—in a few milliseconds.

When we have to capture complex and non‐linear patterns in the data that would not be possible by simpler shallow learning models, we need models with many layers called Deep Learning models. Deep Learning models learn in stages, or layers, with each layer extracting some pattern that is fed into the next layer.

For example, if you are learning to detect faces in an image, your deep network will take as input the pixel array of the image. Then, during the first stage, it may learn to detect lines and curves. Next, it will combine these to form figures like rectangles and circles. Finally, it will combine these to recognize any pattern of faces. This is the power that deep neural networks give us. They learn complex patterns in the data and capture non‐linear relations extremely well.

The last layer of the neural network is the output layer, and the number of neurons there correspond to the number of outputs we want to learn. If we just want to make a prediction as to buy/don't buy based on housing variables, then our network will have a single neuron in the output layer with its value determining the buy decision. Now if we want to also predict another variable, we can add that as a neuron to the output layer. This new output should be considered in the training data we provide to the network while training. That's it—there is no special consideration needed and we can use the same network to predict two outputs instead of one.

The key difference between deep neural networks and the other ML algorithms we saw in Chapter 2 is that deep networks learn important features of the data on their own. We configure the inputs and outputs we seek, decide on the numbers of layers and neurons in each, and build a good training dataset. The network learns all the complex patterns in the data and establishes correlations between inputs and outputs. Basically, it maps the input space (Xs) to the output space (Ys). Hence, neural networks are often referred to as blackboxes, because they don't really tell us how they find these relations; they only predict outputs by internally capturing these relationships.

Because these networks are complex, we often analyze them by considering individual layers. Let's look at the neural network shown in Figure 4.2 again. We have an input layer with three neurons representing input Xs. There is one hidden layer with three neurons and an output layer with one neuron—the Ys.

Now this was just one example of a neural network and a pretty simple one at that. We only have layers where every neuron in the layer is connected to every other neuron in the subsequent layer. Such a layer of neurons is called a fully‐connected or dense layer. In a dense layer, every neuron learns features by considering the output generated from every neuron from the previous layer. Hence, these layers tend to be memory consuming. In practice you will find these layers at the end of a deep network to learn from features extracted in earlier layers and make predictions. The earlier layers in a network may have more local connections to extract features. We look at some of these feature‐extracting layers in Chapter 5, which discusses advanced Deep Learning.

The hidden and output layer neurons do exactly what we discussed earlier for a logistic regression unit. They get a weighted sum of inputs and apply an activation function. At each neuron these calculations are done and the results are fed forward to the next layer of neurons. This architecture, with all neurons feeding their output forward, is called a feed‐forward architecture. At each layer, many calculations occur that are handled in parallel using multi‐dimensional arrays. We will not go into detail about the equations, but it will help to get an understanding of weights in layers. Let's populate a few weights into our diagram, as shown in Figure 4.4.

Illustration of a neural network with weight values depicting an input layer, middle layer, and output layer.

Figure 4.4: Neural network with weight values

You will see different conventions used in books and articles. Let's say we have the weight W1‐32 for layer 1 and it connects neuron 3 from one layer to neuron 2 to the next layer. We can see that between the first hidden layer and input layer, we have (2 × 3), which is six weights. Then, between the output and hidden layers, we have (3 × 1), which is three weights. For this simple network, we have 6 + 3, which is nine weights. These are the weights that need to be “learned” during the training process.

Now let's add a special type of neuron to this network called a bias neuron. We saw the importance of bias in Chapter 2 in the linear regression equation. Bias helps the network learn certain assumptions about the data so that it does not depend only on the variables for generating results. All the inputs (X1, X2, X3) in our network are zero, which means both outputs (Y1, Y2) will always be zero, regardless of the weights. The network has no bias that influences its values in the absence of inputs.

All right, so let's add bias neurons to this network. Bias neurons don't do any calculations. We can just add a constant value of +1 and, just like other neurons, they have weights associated with them. See Figure 4.5. We will use the letter B to associate these weights.

Illustration of a neural network with weight and bias values depicting an input layer, middle layer, and output layer.

Figure 4.5: Neural network with weight and bias values

The convention used here is B2‐1, which is the weight associated with a bias neuron from the first layer to neuron 1 from the next layer. We have added bias neurons to the input and hidden layers. Adding to the output does not make sense since we don't calculate anything from this layer.

So, the total number of bias weights will be 3 + 1, which is four. Adding this to the previous nine weights, we have a total of 13 weights that this network has to learn. Now let's talk about the Activation functions.

Activation functions take the weighted sum of inputs at each neuron and apply some non‐linear function to them. Popular functions used are Tanh, Sigmoid, and ReLU (Rectified Linear Units). Each function helps in thresholding the output value based on the weighted sum of inputs. Usually you apply the same Activation function to a particular layer. So, every neuron in that layer uses the same Activation function. The “References” section at the end of the book includes a reference to material, with details on each Activation function. But as a rule of thumb, Tanh and ReLU are used for hidden layers and ReLU is more common. Sigmoid is mainly used for output layer neurons. Sigmoid produces a value between 0 and 1 so it can be used for classification problems.

Back‐Propagation and Gradient Descent

Let see how neural networks are trained. I will avoid fancy formulas so you can understand the concepts clearly and then we will see a code example.

In Chapter 2, we saw an overview of the gradient descent method to calculate gradients and optimize weights as part of the learning process. We follow the same process for training a neural network, but at a network level. The most popular algorithm used for training neural networks is called back‐propagation. Back‐propagation is essentially a smart way for calculating the partial derivatives (gradients) of the Cost function with respective to different weights.

The idea is to have a Cost function similar to mean squared error (MSE) that we saw during regression model training. Then we adjust the weights using Gradient Descent so as to minimize this Cost function. To do this, we calculate the partial derivative of the Cost function with respect to each weight value. Then, based on the error term, we use this partial derivative to find the magnitude and direction of the change in weight and apply the change. After every iteration, the Cost function is calculated and the weights are updated. See Figure 4.6.

Graph depicting how a gradient descent moves toward the minima; the Cost function is calculated and the weights are updated after every iteration.

Figure 4.6: How Gradient Descent moves toward the minima

Before starting the training, we must establish a cleansed and normalized training dataset. This should contain data points with all our input features (Xs) and corresponding expected outputs (Ys). This will be treated as something referred to in the ML community as the ground truth. The model we train will try to learn patterns so it can be good enough to generate results as good as the ground truth. In other words, the ground truth is the standard that our ML model will aim to achieve.

Establishing the ground truth and defining a good training and testing set is a general starting point in the Machine Learning project lifecycle. We cover the ML lifecycle in detail in Chapter 9. For now, you can assume that we have the data for the Xs and Ys properly cleansed and available and we can use it as‐is for training.

Here are the general steps involved in back‐propagation training of neural networks:

  1. Initialize the weight values to zero or random numbers. Run all the data points in consideration through the network and predict Ys of each X in the dataset.
  2. Compare each Y value with the expected outcomes or the ground truth. Find the difference in the values. Based on the difference in values, calculate the error for each output term.
  3. Establish a Cost function, which is basically a function of all the weights in the network—include weights at every layer of the network including bias weights. The Cost function helps us define a metric to how far our model predictions are from the ground truth. The choice of Cost function is very important in training a good ML model.
  4. The Cost function may be the same as the mean absolute error (MAE) or mean square error (MSE) that we used earlier. This Cost function is ideal when we are predicting a value and we can directly see how far away our predictions are from real values using the MAE or MSE. If we are predicting a class then the preferred Cost function is cross‐entropy. Cross‐entropy tries to minimize the information loss by doing a wrong classification and hence it helps capture the classification loss better.

    The objective of having a Cost function is to establish a relationship between the weights and the error in the prediction. So, as we tune the weights, the error reduces and we get an accurate model. (I provide a link to an article explaining different cost functions in detail in the “References” section at end of the book.)

  5. As the name suggests, we back‐propagate the error calculated through the network. As we go back from the outputs to the inputs, we update the weights using gradients of the Cost function with respect to the corresponding weight. This is the same gradient descent algorithm we saw in Figure 4.5, but we are now applying it to the whole network.

    We establish an overall Cost function of the network and, using partial derivatives, we calculate the gradient of this Cost function with respect to each weight. Now we start from the last layer and calculate the gradients from the errors between predicted values and ground truth. We back‐propagate this error from last to first layer and thus calculate gradients for each neuron in every layer. The gradients are then used to adjust the weight values at each neuron connection in a layer. The details of this algorithm are excellently explained by Andrew Ng in his video from the ML class. I include a link to it in the “References” section at end of the book.

  6. With all the weights adjusted, we run all the data points again and find the Error term. We iterate through this process until so many iterations are completed or until we achieve an acceptable Error value.

We will see this process in action with an example in Keras shortly.

Batch vs. Stochastic Gradient Descent

Gradient Descent applied to neural networks may be of the batch or stochastic type. We run back‐propagation using training data points and adjust the weights. When all the training data points have passed through the entire network it is known as an epoch. In Batch Gradient Descent, we wait for a whole epoch to pass so our network has seen all the training data and then adjust weights. This usually takes a long time and needs to store variables in memory. This approach takes a long time, but it helps us find the optimum value for weights in one go. Usually, if we have a limited training dataset and lots of memory, we follow this approach.

The other type of gradient descent is stochastic gradient descent (SGD), where we adjust weights after the pass of every data point through the network. Here, we don't store much data in memory and rapidly update weights. This is a very fast method, but we see fluctuations in the training because we tend to overshoot the local minima. This approach works when we have limited memory and a large training dataset to work with. We learn at every data point and update our model. The problem with this approach is that it's possible to get “lost” and move away from the minima, especially when we have some bad data points, which may cause major errors. Hence, in practice, we use a compromise of these two approaches.

That compromise is called mini‐batch gradient descent. Here, we divide our data into smaller batches and update weights after each pass of each batch through the network. This is found to be a better approach to training neural networks. At each iteration of training, a smaller subset of the training data is loaded in memory to calculate errors and back‐propagate that error to get gradients. Hence, the memory (RAM) used by the algorithm is less, as compared to loading the whole training dataset in memory. Also, we don't need to wait until all the training data is processed to see results. We usually see a faster convergence to the minima using this approach and reduce training time by a few magnitudes.

Neural Network Architectures

This particular neural network in Figure 4.4, where we have all layers as dense or fully‐connected, is known as a multi‐layered perceptron (MLP). It is very good at learning patterns and has several applications especially around finding non‐linear patterns in structured data. If you can have your data as a single‐dimensional vector, then the MLP can quickly learn patterns due to its fully connected nature and make predictions. It works really well for structured data. We can apply MLP to unstructured data like images, but with some modification. Since the MLP handles data in single‐dimensional layers, we have to convert the three‐dimensional image data into a large flattened vector and feed it to the MLP. Hence, all the valuable spatial information that is stored in the three dimensions is lost and we have one large vector. There are some other deep architectures that extend the idea of MLP and are more suitable for specific types of unstructured data, which we discuss in Chapter 5. Before getting into their details, let's first build an example of an MLP neural network.

Welcome to TensorFlow and Keras

We can use libraries like Scikit‐Learn to build a neural network, but for complex networks you will find many limitations especially around performance. Especially when you have to do massive parallel processing for a network, it makes sense to look at a Deep Learning framework like TensorFlow and PyTorch. These frameworks let us build computational or dataflow graphs by capturing the architecture in our neural network. These graphs can then be scheduled to run in parallel and on specialized hardware like CPU clusters or GPUs to train much faster than they would on a normal CPU machine. The frameworks have their own dedicated runtime, which could be a CPU or GPU cluster. They expose APIs in common languages like Java, Python, and C++, through which software applications can build, train, and run DL models.

TensorFlow is a framework developed by Google and is available as open source. Google has an active and agile team developing and maintaining TensorFlow and they release new versions every three to four months. Google internally also heavily uses TensorFlow for all sorts of image, video, text, and audio Deep Learning use cases.

Keras is a high‐level API layer in Python written on the TensorFlow framework. It was developed by François Chollet, who now works at Google. With Keras you don't have to get into the nitty‐gritties of defining the computational graphs. You focus on building the layers and defining configuration parameters like the type of layers, number of neurons, connections, etc. Keras internally handles building the computational graph for you.

PyTorch is a similar framework developed and maintained by Facebook. It heavily uses NumPy (Numerical Python), which is a powerful library for math processing in Python. PyTorch also defines computational graphs and has some of the simplicity of Keras built‐in. It's more a matter of personal preference and time spent to decide which framework you choose. Your deep architectures should be able to be built and run on either TensorFlow or PyTorch.

Let's use Keras to build the MLP neural network. First, we will load a sample dataset that is provided with Keras called MNIST. This is a standard dataset for studying Machine Learning algorithms. It comes with defined training and test datasets. Let's load the data and show it as a plot using the Matplotlib library (see Figure 4.7). This code is pretty standard, as shown in Listing 4.1.

Illustration of a sample depicting how load handwritten digits dataset and present it as a plot using the Matplotlib library.

Figure 4.7: Sample from the MNIST training dataset

We see an example of the training dataset. Each of these has Xs, which are features, and Ys, which are the output. The Xs are 784 in number which correspond to 28×28 pixels of the images. The output Y is a number between 0 and 9, representing the digit that the image represents.

Let's use Python to understand the size of the X and Y features. This code is very important to clearly understand the features. It is recommended that the features have similar values. Hence, we normalize the pixel values, which can be between 0 and 255 to a number between 0 and 1. Similarly, the output Y is changed from an integer from 0–9 to a one‐hot encoded vector. Basically, each Y is converted to a vector of size 10 with only the relevant element as 1 and all others as 0.

For example: Y = 3 is converted to:

Y = [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ] 

Listing 4.2 shows standard code that does this.

Here are the results:

Training X dimensions: (60000, 28, 28)
Training Y dimensions: (60000, 10, 2)
Testing X dimensions: (10000, 28, 28)
Testing Y dimensions: (10000, 10, 2) 

Now that we have our dataset defined, let's look at the code that actually builds the neural network. We will first create the simple MLP we saw earlier. The input layer will have 784 inputs. This is created by taking the 28×28 image array and making it a single‐dimension vector of size 784. This is done using the Flatten layer in Keras. You don't need to specify dimensions for the Flatten layer because it automatically calculates them using the input layer dimensions.

Next, we will use a hidden layer with 512 neurons. This layer is a dense layer signifying that every neuron from a previous layer is connected to every neuron from the next layer. We will use an ReLU activation function for this layer. As we discussed earlier, ReLU activation for hidden layers helps the network learn much faster.

Finally, we have the output layer, which is again a dense layer with 10 neurons. These 10 neurons signify the prediction of handwritten digits represented by the image—between 0 and 9. Here we use a Softmax Activation function so that we can get outputs between 0 and 1 in each of the 10 neurons. Also, Softmax applied to the whole layer gives us a total probability value for all neurons as 1. So, if the digit indicated in the image is a 5, the training set results will show a value of Y_Train as:

Y = [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ] 

After training, we expect our model to make a prediction such that the sum of all predictions is 1 (indicating 100% probability) and for digit 5, maximum probability has been allocated. After we build the model, we will show a summary of the model, as shown in Listing 4.3.

Here are the results:

___________________________________________________________________
Layer (type)                  Output Shape                Param # 
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
flatten_6 (Flatten)           (None, 784)                 0 
_________________________________________________________________
dense_11 (Dense)              (None, 512)                 401920 
_________________________________________________________________
dense_12 (Dense)              (None, 10)                  5130 
 
Total params: 407,050
Trainable params: 407,050
Non‐trainable params: 0
_________________________________________________________________
Train on 40199 samples, validate on 19801 samples
 
Epoch 1/1
40199/40199 [= = = = = = = = = = = = = = = =] ‐ 7s 178us/step – 
loss: 0.2389 ‐ acc: 0.9298 ‐ val_loss: 0.1346  ‐ val_acc: 0.9606
 
10000/10000 [= = = = = = = = = = = = = = = =] ‐ 0s 44us/step
 
Evaluation on Test Dataset: [0.11573542263880372, 0.9653] 

The number of hidden layers and neurons in a hidden layer are our hyper‐parameters. We will not learn these but modify them and see if they make our predictions better. What we will learn is the total number of weights to learn—also called trainable parameters. The model summary will show the number of trainable parameters. In the previous example, we see the total weights or trainable parameters set to 407,050. This calculation is pretty simple and can be used with any network:

First weight Layer size    = (Layer1 Neurons + 1) * Layer2 Neurons
                           = (784+1) * 512 = 401920
 
Second weights Layer size  = (Layer2 Neurons + 1) * Layer3 Neurons
                           = (512+1)*10 = 5130
 
Total weights of the Model = 401920 + 5130 = 407050 

As we did earlier, we will use training X and Y values to build the model and tune the weights. The testing values will be exclusively used for validation.

There you go. You have collected image data, normalized it, and trained your first neural network with 92% accuracy. Our MLP neural network structure is shown in Figure 4.8. We have 407,050 weights in this model to train. We have all dense layers, which we will indicate in blue. In the next chapter, as we deal with more types of layers, we will use different notations.

Illustration of a multi-layered perceptron (MLP) neural network structure presenting the summary of a neural network.

Figure 4.8: Summary of our neural network, multi‐layered perceptron

Some observations about the code and the results:

  • We used the Adam optimizer. This is very common. Some other types are RMSProp, Adagrad, and SGD (Stochastic Gradient Descent). These are all variations of the traditional Gradient Descent optimization technique so that the model converges faster and our training process is faster. Adam is usually very popular, but you can try others and see if the results get better and faster.
  • Since this was a multi‐class classification problem, we used a categorical Cross‐Entropy Loss function. We ran the training only for one epoch and got pretty good results. This was because the data was clean and of good quality. In reality, you will likely have bad data that will need cleansing and other processing.
  • Another thing to notice is that MNIST was nice enough to give us training and testing data. However, when we trained the model, we also included a validation split of 0.33, which is 33%. So, we only used 67% of training data for training and got the model validated using 33% of data. Our results show us the training accuracy, loss and validation accuracy, and loss. Typically, we will tune hyper‐parameters like number of layers and neurons in each and see if our validation accuracy increases.
  • The testing dataset is used for evaluating our model and establishing benchmarks. The last line of code evaluates our model on test data and says it's 96.53% accurate. Now if we choose a new architecture or a new algorithm, this will be our benchmark to beat!

Now let's talk a little about the training, validation, and testing sets and about overfitting and underfitting.

Bias vs. Variance: Underfitting vs. Overfitting

You have seen the concepts of overfitting and underfitting in Chapter 2. Remember the darts example, shown again in Figure 4.9?

Illustration of three dart boards to explain bias and variance: (left) high bias low variance, (middle) low bias high variance, and (right) low bias low variance.

Figure 4.9: Darts example to explain bias and variance

Let's discuss how training and validation results give us an idea of underfitting and overfitting. Figure 4.10 shows an informal chart that can help us make some decisions while building neural networks.

Illustration of an informal chart discussing how training and validation results give an idea of underfitting and overfitting while building neural networks.

Figure 4.10: Training vs. validation set accuracy

As you build a new model, always use a separate training and validation dataset and find the accuracy of your model on each dataset. This concept is known as cross‐validation. The idea is to give your model a set of data to train upon. Then you evaluate its metrics on a new dataset that it has never seen before, in order to see how effective it is.

There is a type of cross‐validation that is popular in industry known as K‐fold cross‐validation. Here the idea is to divide your complete dataset into K groups and at each iteration use one of the K groups as a validation set and the rest of the data for training. This way, you keep changing the “unseen” data that the model learns from and over time it becomes more effective.

If you get high accuracy on training data but not on the validation data, then your model is overfitting on the training data. It is learning a bigger variance specific to the training data and does not translate to your validation dataset. In this case, you need to get more training data. There are also techniques to avoid overfitting, like regularization and dropout, that you can use.

Let's quickly look at what regularization is. We saw in the discussion on back‐propagation how a Cost function is a function of all the weights in different layers of the network and helps us find optimum values for the network weights. If your model is overfitting on the training data, that means your weights are being too specific to your training data. The idea with regularization is to add some special terms with network weights to the Cost function, so the network doesn't converge very quickly. In other words, we are penalizing the weights so that they don't overfit to the training dataset and are more generic.

The second method to prevent overfitting is dropout. In dropout, during the training process, we randomly drop out a percentage of neurons from a layer and use the rest of the network for training. This helps prevent overfitting of the network by preventing certain neurons from getting tied to specific inputs. Since at any training iteration or epoch, there is a random number of neurons that will be dropped out (outputting zero values), the network is forced to learn patterns that are not dependent on specific training data or neurons.

Again, these concepts of regularization and dropout are extremely well explained with equations in Andrew Ng's video lecture. For practical ML, what I covered is good enough. You can now start using these layers in Keras. However, if you are interested in understanding what is happening under the hood, I highly recommend watching Andrew Ng's video classes.

If you get high accuracy on the validation data but your training data gives fewer promising results, you probably have a complex training dataset and a pretty simple validation set. We divided our data at random in our MNIST example. However, in real problems, you will need to build a validation set that has a good representation of your expected outputs. It is recommended to have all the variations you see in the validation set. This way, once you get good accuracy in validation you can be pretty confident that the model performs well on unseen data.

Finally, if you get poor accuracy on both the training and validation sets, that means you need more data or a better model, or sometimes both. By the same token, if you get good accuracy with both, you have a good model that has learned the patterns and works well on unseen data. That's what you should aim for!

For the MNIST example, an accuracy of above 90% is pretty good. We got that with all three datasets—training, validation, and testing. In the next chapter, we will see other model architectures like Convolutional Neural Networks (CNNs) and compare them to our MNIST model.

Summary

This chapter started building deep neural network models for analyzing image data. We used the Keras library on a TensorFlow framework to build our model. We saw cross‐validation, where we separate training and testing data. We ran models that trained and we evaluated the accuracy metrics for our models. In the next chapter, we will start building more complex models. We will go beyond the MLP into Convolutional Neural Networks (CNNs) and show how they are much more effective in building deep models specifically for image analysis. In that chapter, we use different data—a fashion items images dataset. Hopefully, it will be interesting and you can try on some of your own image data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.245.140